2. What is Hadoop ?


What is Hadoop?
Apache Hadoop is an open-source software framework written in Java for distributed storage and processing of large data sets across clusters of commodity hardware using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. 
Apache Hadoop has two core parts one is Hadoop Distributed File System (HDFS- For - Storage) and second is MapReduce (For - Processing).
Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers packaged code for nodes to process in parallel, based on the data each node needs to process.
Hadoop is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.

Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.


The Hadoop framework is used by major players including Google, Yahooand IBM, largely for applications involving search engines and advertising. 

The project includes these modules:

        Hadoop Distributed File System (HDFS): A distributed file system.
     Hadoop : A framework for job scheduling and cluster resource management.
      MapReduce: A  parallel processing system for large data sets.

Other Hadoop-related projects at Apache include:

     Avro: A data serialization system.
     Cassandra: A scalable multi-master database with no single points of failure.
     Chukwa: A data collection system for managing large distributed systems.
     HBASE: A scalable, distributed database that supports structured data storage for large tables.
     Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
     Mahout: A Scalable machine learning and data mining library.
     Sqoop: A tools which import the data from Relational Databases to HDFS or Hive and also 
                    export data from HDFS or Hive to Relational databases like MySQL and Oracle.
       Pig: A high-level data-flow language and execution framework for parallel computation.
     Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and 
                  expressive programming model that supports a wide range of applications, including ETL,
                  machine learning, stream processing, and graph computation.
     ZooKeeper: A high-performance coordination service for distributed applications.



Comments

Popular posts from this blog

1. What is Big Data ?

How to install Cloudera QuickStart VM on VMware - Part1?

Different flavours of Hadoop?