What is hadoop? • Open source software for reliable, scalable, and distributed computing. • Flexible infrastructure for large scale computation and data processing on a network of commodity hardware. • The Linux of distributed processing. Why hadoop? • Very large distributed file system. • The data is distributed across data nodes . • Reliability and availability. • Files are replicated to handle hardware failure. • Detects failures and recovers from them. • Ability to run on cheap hardware. • Open source flexibility. • Runs on heterogeneous OS. • Scalability. • The number of nodes in a cluster is not constant. • Parallel processing through MapReduce. Main components: • HDFS(Hadoop file system) for storing. • The Map-Reduce programming model for processing. Hadoop Distributed File System (HDFS) • A distributed file system based on GFS, as its shared filesystem. • Distributed across data servers. • Data files partitioned into large chunks (64MB), replicated on multiple data nodes. • NameNode stores metadata information (block locations, directory structure). Map-Reduce: • Framework for distributed processing of large data sets.
Saturday, April 9, 2011
Hadoop
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment