Tuesday, May 22, 2007

Why Web 2.0 needs Grid Computing


[Excellent article in Grid Today. I fully agree with Paul Strong that traditional grids have long been associated with academic high performance computing with a relatively small number of jobs on a limited number of resources. Web 2.0 and web services brings in whole new range of complexity and scaling issues where there may be hundreds millions of services being invoked across thousands of distributed servers. Some excerpts from Grid Today--BSA]



If one were asked to cite companies whose datacenters epitomize the
idea of Grid 2.0, he could do a lot worse than to point to any of the
Internet giants that have taken advantage of grid technologies to
forever transform the ways in which we shop, access information,
communicate, and just about every other aspect of our lives.

Companies like Google, Yahoo!, Amazon and eBay set the standard
because they are using their massive, distributed computing
infrastructures to not
only host applications and store data, but also to host countless
services, both internal and external, and handle hundreds of millions
to billions of transactions every single day -- in real time.

Paul Strong takes on just this topic, discussing grid computing's expansion from being solely an HPC technology to being the basis for the distributed platforms necessary to make the Web 2.0 world run.

Historically, the term "grid computing" has been intimately associated with the high-performance and technical computing community. The term was, of course, coined by some leading members of that community and, unsurprisingly, to many it has been almost completely defined within the context of this specific type of use. Yet when you look closely at what grid computing actually is predicated upon, it becomes apparent that the notion of grid computing is far more universally applicable than perhaps many people think. Indeed, one could make the assertion that grids are the integrated platforms for all network-distributed applications or services, whether they are computationally or transactionally intensive.

Grids are about leveraging the network to scale to be able to solve problems or achieve throughput in a way that is just not possible using individual computers -- at least not computers in the traditional sense. It doesn’t matter what your class of workload, you can do so much more on a network: Your ability to scale is limited only by the number of resources you and your collaborators can connect together. Scaling is often achieved through division of workload and replication/duplication, which potentially yields the very desirable side effects of great resilience and, at least for appropriately implemented transactional applications, almost continuous availability.

In the Web 2.0 world, one has to use the network to scale. Individual servers cannot handle the transaction rates required of business on the Internet. The volume of data stored and manipulated by the likes of Google, eBay, Yahoo! and Amazon cannot fit into single instances or clusters of traditional databases or file systems. You have to do something different. Similarly, manipulating and presenting this data, or the services based on it, to hundreds of millions of consumers cannot be achieved without harnessing tens of thousands of computers, load balancers, firewalls and so forth. All of these applications or services treat the network as the platform, hence the assertion that infrastructures, such as eBay’s, are, in fact, grids.

While these Internet behemoths are at the extreme end of things, almost all application developers and datacenters are using similar techniques to scale their services. SOA and the like are just the latest milestones in the long journey from monolithic, server-centric applications to evermore fine-grained, agile, network-distributed services. As more and more instances of this class of application are deployed, and deployed on typically larger and larger numbers of small commodity components (small servers, cheap disks, etc.) instead of large multi-processor servers or complex storage arrays, the platform moves inexorably toward being the network itself, rather than the server. In fact, for most datacenters, it moves toward being a grid.

Just like a traditional operating system, [the grid] maps workload onto resources in line with policy. Unlike a traditional operating system, however, the resources are no longer just processors, memory and IO devices. Instead, they are servers, operating system instances, virtual machine monitors, disk arrays, load balancers, switches, database instances, application servers and so forth. The workload has shifted from being relatively simple binaries and executables to being distributed business services or complex simulations, and the policies are moving from simple scheduling priorities and the like to service-level objectives. This meta-operating system is realized today by various pieces of software, ranging from what we think of as traditional grid middleware to systems- and enterprise-management software.

The trouble with this massive scale-out approach is that management becomes a serious problem. What do you do when you find yourself managing 15,000-plus servers or an application with 15,000 instances? These environments become exponentially more complex as you add new elements. One new component or class of component might mean tens or hundreds of new relationships or interactions with the thousands of elements already in the infrastructure. Indeed, a 10,000-server infrastructure might actually comprise hundreds of thousands of managed components and relationships. Components can be as simple as servers or as complex as business workflow, and the relationships between all of these must be understood.

This is becoming the next big challenge, especially for commercial organizations. They must be able to map the value of the services delivered to the underlying platform elements and their costs. They must be able to detect infrastructure failures and understand how these impact the business services, or capture breaches of service-level objectives for business processes and trace these to problems within individual software or hardware components. They must be able to be agile, to make changes in a small part of their infrastructure and be able to predict or prevent any negative impact on the rest. With massively distributed, shared platforms -- i.e., grids -- this is extremely difficult. Clearly, the organizations who feel this pain the most probably are those with the greatest scale, such as the big Internet businesses.

So, where grid was once perhaps thought of as the context for one class of application, one can reasonably assert that it is, in fact, the universal context for network distributed applications or services. Grids are, in fact, a general-purpose platform and the context for commercial workloads within the enterprise. And where once the focus was almost exclusively on delivering scale, as the techniques and technologies that enable this mature, the focus has to shift to managing the resulting, enormously complex grid systems -- both the services and the platforms on which they run.