What is hadoop?
• Open source software for reliable, scalable, and distributed computing.
• Flexible infrastructure for large scale computation and data processing on a network of commodity hardware.
• The Linux of distributed processing.
Why hadoop?
• Very large distributed file system.
• The data is distributed across data nodes .
• Reliability and availability.
• Files are replicated to handle hardware failure.
• Detects failures and recovers from them.
• Ability to run on cheap hardware.
• Open source flexibility.
• Runs on heterogeneous OS.
• Scalability.
• The number of nodes in a cluster is not constant.
• Parallel processing through MapReduce.
Main components:
• HDFS(Hadoop file system) for storing.
• The Map-Reduce programming model for processing.
Hadoop Distributed File System (HDFS)
• A distributed file system based on GFS, as its shared filesystem.
• Distributed across data servers.
• Data files partitioned into large chunks (64MB), replicated on multiple data nodes.
• NameNode stores metadata information (block locations, directory structure).
Map-Reduce:
• Framework for distributed processing of large data sets.
0 comments:
Post a Comment