Posts

What is Sqoop?

What is Sqoop? As we all know Relational databases are the main data sources for Big Data, and Hadoop is a framework which we use to analyze big data. So Sqoop is a tool which imports the data from Relational databases to Hadoop HDFS and also exports the data from Hadoop HDFS to Relational databases. Relational databases can be MySQL, PostgreSQL ,Oracle and Redshift or any other RDBMS. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance. Sqoop is an open source software product of the Apache Software Foundation. Prerequisites Before we start with Sqoop following prerequisite knowledge is required to run Sqoop jobs: ·          Basic knowledge of linux operating system with commands ·          Concepts of Relational database management systems ·          Concepts of Hadoop and HDFS, with basic commands Starting with Sqoop :- Let’s start with very basic and important command which will tell you abou

How to install Cloudera QuickStart VM on VMware - Part1?

Start using Hadoop with Cloudera's QuickStart VMs. The QuickStart VMs contain a single-node Apache Hadoop cluster, complete with example data, queries, scripts. These virtual machines make it easy to get started with CDH (Cloudera’s 100% open source Hadoop platform that includes Impala, Search, Spark, and more) and Cloudera Manager. They come complete with everything you need to learn Hadoop.The VMs run CentOS 6.4 and are available for VMware, VirtualBox, and KVM.   Requirements:- a) System Requirement:- T hese are a 64-bit VMs. They requires a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.   To use a VMware VM, you must use a player compatible with WorkStation 8.x or higher: Player 4.x or higher, ESXi 5.x or higher, or Fusion 4.x or higher. Older versions of WorkStation can be used to create a new VM using the same virtual disk (VMDK file), but some features in VMware Tools won't be availa

Different flavours of Hadoop?

There are so many flavors of Hadoop available as Cloudera, Hortonworks etc. The nice thing is that you can download and try a free version. The base Apache distribution is good when you are just learning and getting started.   The real advantage of the bundled distributions such as Cloudera and Hortonworks is that you don't have to think about version x of Hive will work with Hadoop version y and version z of Hbase. They also come with better tools for deployment and management in an operational environment for example Cloudera comes with Cloudera Manager which provide you the web GUI for everything. There are three modes to start a Hadoop cluster. 1.       Local (Standalone) Mode 2.       Pseudo-Distributed Mode 3.       Fully-Distributed Mode In next post will start with Cloudera Quickstart VM and will see how to install in Virtual Machine. 

3. What is HDFS?

The Hadoop Distributed File System (HDFS) is the primary storage system used by  Hadoop  applications. HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. HDFS uses a  master/slave  architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files. HDFS typically is deployed on low-cost commodity  hardware , server failures are common. The file system is designed to be highly  fault-tolerant , however, by facilitating the rapid transfer of data between compute nodes and enabling Hadoop systems to continue running if a  node  fails. That decreases the risk of  catastrophic failure , even in the event that numerous nodes fail. When we copy data to HDFS, it breaks down the information into multiple pieces and distributes them to different nodes in a cluster, allowing

2. What is Hadoop ?

What is Hadoop? Apache Hadoop  is an  open-source   software framework  written in  Java  for  distributed storage  and  processing of large data sets across clusters of commodity hardware using simple programming models. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage.   Apache Hadoop has two core parts one is Hadoop Distributed File System (HDFS- For - Storage) and second is MapReduce (For - Processing). Hadoop splits files into large blocks and distributes them amongst the nodes in the cluster. To process the data, Hadoop MapReduce transfers  packaged code  for nodes to process in parallel, based on the data each node needs to process. Hadoop is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The current Apache Hadoop ecosystem consists of the Hadoop  kernel,  Map

1. What is Big Data ?

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. According to IBM, 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data. When dealing with larger data sets, organizations face difficulties in being able to create, manipulate, and manage big data. Big data is particularly a problem in business analytics because standard tools and procedures are not designed to search and anal