Entries Tagged as 'Cloud Computing'

Apache Hadoop Core

From: http://hadoop.apache.org

What Is Hadoop?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:

  • Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
  • HBase builds on Hadoop Core to provide a scalable, distributed database.
  • Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
  • ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.
  • Hive is a data warehouse infrastructure built on Hadoop Core that provides data summarization, adhoc querying and analysis of datasets.
HDFS

HDFS

Who uses Hadoop?

A wide variety of companies and organizations use Hadoop for both research and production. Users are encouraged to add themselves to the Hadoop users wiki page.

Apache Hadoop Core is a software platform that lets one easily write and run applications that process vast amounts of data.

Here’s what makes Hadoop especially useful:

  • Scalable: Hadoop can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
  • Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
  • Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Hadoop has been demonstrated on clusters with 2000 nodes. The current design target is 10,000 node clusters.

For more information about Hadoop, please see the Hadoop wiki.

Getting Started

The Hadoop project plans to scale Hadoop up to handling thousands of computers. However, to begin with you can start by installing in on a single machine or a very small cluster.

  1. Learn about Hadoop by reading the documentation.
  2. Download Hadoop from the release page.
  3. Hadoop Quickstart.
  4. Hadoop Cluster Setup.
  5. Discuss it on the mailing list.

Getting Involved

Hadoop is an open source volunteer project under the Apache Software Foundation. We encourage you to learn about the project and contribute your expertise. Here are some starter links:

  1. See our How to Contribute to Hadoop page.
  2. Give us feedback: What can we do better?
  3. Join the mailing list: Meet the community.

Scientists write guide to build supercomputer from Sony Playstation3

Researchers at the University of Massachusetts Dartmouth, US, have
created a step-by-step guide to building a home-brewed supercomputer
that can reduce the cost of university and general computing research.
The resource fully illustrates how to create a fully functioning and
high performance supercomputer with the Sony Playstation 3.

Last year, the researchers’ construction of a small supercomputer using
eight Sony-donated Playstation 3 gaming consoles made headlines
nationwide in the scientific community. The consoles are used to solve
complex equations designed to predict the properties of gravitational
waves generated by the black holes located at the centre of the
galaxies.

Typically, scientists rent supercomputer time by the hour. A single
simulation can cost more than 5,000 hours at USD 1 per hour on the
National Science Foundation’s TeraGrid computing infrastructure.

The guide is freely available to the public under an open source license
at www.ps3cluster.org.

PhysOrg.com / University of Massachusetts Dartmouth – December 17, 2008

http://www.merit.unu.edu/i&tweekly/ref.php?nid=3511

Public Data Sets – Amazon Web Services

From Amazon AWS

Public Data Sets on AWS provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. AWS is hosting the public data sets at no charge for the community, and like all AWS services, users pay only for the compute and storage they use for their own applications. An initial list of data sets is already available, and more will be added soon.

Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from their Amazon Elastic Compute Cloud (Amazon EC2) instances and start computing on the data within minutes. Users can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, users can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.

Opening the Cloud: Open-source cloud-computing tools could give companies greater flexibility.

By Erica Naone from Technology Review

Cloud-computing platforms such as Amazon’s Elastic Compute Cloud (EC2), Microsoft’s Azure Services Platform, and Google App Engine have given many businesses flexible access to computing resources, ushering in an era in which, among other things, startups can operate with much lower infrastructure costs. Instead of having to buy or rent hardware, users can pay for only the processing power that they actually use and are free to use more or less as their needs change.

However, relying on cloud computing comes with drawbacks, including privacy, security, and reliability concerns. So there is now growing interest in open-source cloud-computing tools, for which the source code is freely available. These tools could let companies build and customize their own computing clouds to work alongside more powerful commercial solutions.

One open-source software-infrastructure project, called Eucalyptus, imitates the experience of using EC2 but lets users run programs on their own resources and provides a detailed view of what would otherwise be the black box of cloud-computing services.

Another open-source cloud-computing project is the University of Chicago‘s Globus Nimbus, which is widely recognized as having pioneered the field. And a European cloud-computing initiative coordinated by IBM, called RESERVOIR, features several open-source components, including OpenNebula, a tool for managing the virtual machines within a cloud. Even some companies, such as Enomaly and 10gen, are developing open-source cloud-computing tools.

Rich Wolski, a professor in the computer-science department at the University of California, Santa Barbara, who directs the Eucalyptus project, says that his focus is on developing a platform that is easy to use, maintain, and modify. “We actually started from first principles to build something that looks like a cloud,” he says. “As a result, we believe that our thing is more malleable. We can modify it, we can see inside it, we can install it and maintain it in a cloud environment in a more natural way.”

Reuven Cohen, founder and chief technologist of Enomaly, explains that an open-source cloud provides useful flexibility for academics and large companies. For example, he says, a company might want to run most of its computing in a commercial cloud such as that provided by Amazon but use the same software to process sensitive data on its own machines, for added security. Alternatively, a user might want to run software on his or her own resources most of the time, but have the option to expand to a commercial service in times of high demand. In both cases, an open-source cloud-computing interface can offer that flexibility, serving as a complement to the commercial service rather than a replacement.

Indeed, Wolski says that Eucalyptus isn’t meant to be an EC2 killer (for one thing, it’s not designed to scale to the same size). However, he believes that the project can make a productive contribution by offering a simple way to customize programs for use in the cloud. Wolski says that it’s easier to assess a program’s performance when it’s possible to see how it operates both at the interface and from within a cloud.

Wolski says that Eucalyptus will also imitate Amazon’s popular Simple Storage Surface, which allows users to access storage space on demand, as well as its Elastic IP addresses, which keeps the address of Web resources the same, even if the physical location changes.

Ignacio Llorente, a professor in the distributed systems architecture group at the Universidad Complutense de Madrid, in Spain, who works on OpenNebula, says that Eucalyptus’s main advantage is that it uses the popular EC2 interface. However, he adds that “the open-source interface is only one part of the solution. Their back-end [the system's internal management of physical resources and virtual machines] is too basic. A complete cloud solution requires other components.” Llorente says that Eucalyptus is just one example of a growing ecosystem of open-source cloud-computing components.

Wolski expects many of Eucalyptus’s users to be academics interested in studying cloud-computing infrastructure. Although he doubts that such a platform would be used as a distributed system for ordinary computer users, he doesn’t discount the possibility. “You can argue it both ways,” he notes. But Wolski says that he thinks some open-source cloud-computing tool will become important in the future. “If it’s not Eucalyptus, I suspect [it will be] something else,” he says. “There will be an open-source thing that everyone gets excited about and runs in their environment.”