2. What is Hadoop ?
What
is Hadoop?
Apache Hadoop is an open-source software framework written in Java for distributed storage and processing of
large data sets across clusters of commodity hardware using simple programming
models. Hadoop is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
Apache Hadoop has two core parts one is Hadoop
Distributed File System (HDFS- For - Storage) and second is
MapReduce (For - Processing).
Hadoop splits files into large blocks and
distributes them amongst the nodes in the cluster. To process the data, Hadoop
MapReduce transfers packaged code for
nodes to process in parallel, based on the data each node needs to process.
Hadoop
is designed to detect and handle failures at the application layer, so
delivering a highly-available service on top of a cluster of computers, each of
which may be prone to failures.
The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce,
the Hadoop distributed
file system (HDFS) and a number of related projects such
as Apache Hive,
HBase and Zookeeper.
Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
The Hadoop framework is used by major players including
Google, Yahooand IBM,
largely for applications involving search engines
and advertising.
The project includes these modules:
Hadoop Distributed
File System (HDFS): A distributed file system.
Hadoop : A framework for job scheduling and cluster resource
management.
MapReduce: A parallel processing system for large data
sets.
Other Hadoop-related projects at Apache include:
Avro: A data
serialization system.
Cassandra:
A scalable multi-master database with no single points of failure.
Chukwa: A data collection
system for managing large distributed systems.
HBASE: A scalable,
distributed database that supports structured data storage for large tables.
Hive: A data warehouse
infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine
learning and data mining library.
Sqoop: A tools which
import the data from Relational Databases to HDFS or Hive and also
export data from HDFS or Hive to Relational databases like MySQL and Oracle.
export data from HDFS or Hive to Relational databases like MySQL and Oracle.
Pig: A high-level
data-flow language and execution framework for parallel computation.
Spark: A fast and general
compute engine for Hadoop data. Spark provides a simple and
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
expressive programming model that supports a wide range of applications, including ETL,
machine learning, stream processing, and graph computation.
ZooKeeper:
A high-performance coordination service for distributed applications.
Comments
Post a Comment