For some time that the "digital expansion" or as some call the "digital tsunami" has changed the world of information as we knew from just ten years ago. The creation, storage and digital data replication follows an increase exponential that has not yet ended. All these changes have changed in a very short time the picture on the treatment and management of large volumes of data or so-called "big data". For example, while a decade ago the relational database (RDBMS) were undisputed queens in this field, today the picture is different. The data files are no longer megabytes, but terabytes, and the amount of information is no longer measured in gigabytes but in petabytes (1 million gigabytes). All these changes have produced a response. One has been Hadoop.

Hadoop not be seen as a competitor to the RDBMS but as a complement. Where the relational databases are beginning to bear no acceptable response times (due to the high volume of data), Hadoop gives few times more than acceptable response. But then what is it about Hadoop that do not have the RDBMS ?. We can basically be defined as that Hadoop is a distributed storage system and processing, which is essential when trying files in the order of more than 1 terabyte.


Hadoop platform was born in 2006, from the creation in Google's set of algorithms for distributed processing MapReduce2 and "file system" distributed GFS (Google Gyle Gystem) 3 (father of the current Hadoop Fyle system - HDFS). The Apache Software Foundation has undertaken since then to distribute this software as "Open Source". Years later the company began to distribute Cloudera CDH (Cloudera Distribution for Hadoop) software with a package that includes many of the "framework" of Apache Hadoop. 

Why Hadoop is chosen ?

Distributed processing and storage is certainly a good solution to handle, process and store the continuous flow of current information. Where the relational databases fail, Hadoop has emerged as a software which over the years has gained in performance, reliability, scalability and manageability and is currently a stable, robust and flexible software.

Hadoop meets all the requirements to be a good candidate for companies that need a software platform that will not choke to the millions and millions of data stored and received daily. In fact many of the best known social networking companies,

Search engines, e-commerce have chosen Hadoop as software for processing and storing information (Yahoo, Twitter, Facebook, LinkedIn, Amazon, eBay, etc ...). In 2010 Facebook engineers have declared the cluster hadoop of world4 with 21 Petabytes of information larger. Currently over 100 PB 5 quintupled information in just two years. Although Hadoop in most cases it is used in a separate cluster ( "in-house"), you are increasing the use of cluster in the cloud (cloud).

One of the best known companies hire these services is Amazon that provides services Amazon EC2 (Elastic Compute Cloud) and Amazon S3 (Simple Storage Service). This way you can use a whole cloud Hadoop cluster without using own resources. You can also hire exclusive services for Hadoop as Amazon EMR (Elastic MapReduce) along with the MapR tool for installing, configuring and managing Hadoop in the cloud.


One of the best-known books, "Hadoop. The Definitive Guide "6 describes clearly as Hadoop. This book is a good start to enter the world Hadoop. We listed below some of the inherent characteristics:

  • Hadoop is linearly scalable. Hadood has an architecture that allows it to operate in cluster, ie, to be distributed operates in a cluster of nodes (servers) in the cluster, one of them "Master" and the rest slaves, simplicity and ease with which a node can be added to the cluster Hadoop makes it extremely flexible and scalable result in any change in the variation of data to be processed.
  •  High Availability. The files are replicated as many times as necessary by a configuration variable, thus we have a system with high reliability.
  • Fault Tolerance, Any drop a node or set of nodes in the cluster does not impede the proper functioning of the system.

Professor Eric A. Brever defines the impossibility of a distributed system meets total way with the requirements of consistency, availability and fault tolerance (CAP Theorem 7), Hadoop is quite close to these three requirements, and therefore we can label it as A distributed system with very high performance.

MapReduce and HDFS.

MapReduce is the soul of Hadoop. MapReduce is a programming model for distributed data processing. The operation is based on a couple of very simple ideas: To divide and distribute data and reduce the problem. For example if we have an X function to be applied to a data set D, D is divided into pieces and each piece function applies X (stage Map) and the result of all Maps Reduces is collected by making a single result (Reduce phase). MapReduce is inherently parallel, so the data is distributed in the n nodes with m tasks ( "TaskTracker"), prioritizing data are processed at the nodes where the data is stored ( "data locality"). These and other features make it ideal for processing large volumes of data.

If MapReduce is the soul of Hadoop, HDFS is his body, the "filesystem" Hadoop distributed, where data is replicated and distributed throughout the cluster disk. HDFSpermite to treat MapReduce file up to 10 Terabytes in size. HDFS data access via "Streaming" This makes it very efficient in reading data. HDFS uses for handling most user commands that use the type S.O linux, so it is intuitive and management is simple.

The Hadoop ecosystem

Programming languages ​​such as Pig, Hive data warehouse as, database as Hbase, and other tools like Oozie, Sqoop, Flume, Zookeeper and Hue operate under the Hadoop platform, and all are installed from package providing Cloudera CDH. Pig is a framework Hadoop world, this "framework" runs the powerful programming language Pig-Latin. This is a language of "dataflow" and is part of the set of languages ​​NOSQL. Hive is a data warehouse and uses the SQL language on Hadoop. Hbase database is a column-oriented and is a clone of the database Google's Bigtable, which is identical except that the former is open source.

New concepts, new tools.

Hadoop has done nothing but walk the footsteps of a long way in its latest version the package distributed by Cloudera CDH4 and introduces version 2.0 of Apache Hadoop. In this new version new concepts such as "NameNode High Availability" which are added 8 solves the fragility of "NameNode" from previous versions. MapReduce and HDFS become more robust and stable and provide solutions to more complex systems.

Today they have begun to be marketed as Cloudera Manager tools, Hortonworks Data Platform, IBM InfoSphere BigInsights MapR or who install, configure and monitor Hadoop cluster nodes quickly and easily. On the other hand are the tools of "Business Intelligence" running on Hadoop, as Pentahoo, Datameer, Jasperrsoft among others are.

There is no doubt that Hadoop has a long life and that this is only the beginning of a good start. is a personal technology columnist. writes about computer software, consumer gadgets, New Technology News, Gadgets Review and Much More....
Newer Post
Older Post