Hadoop

Hadoop

Hadoop

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Hadoop makes it possible to run applications on systems with thousands of commodity hardware nodes and to handle thousands of terabytes of data. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss, even if a significant number of nodes become inoperative. Consequently, Hadoop quickly emerged as a foundation for big data processing tasks, such as scientific analytics, business and sales planning and processing enormous volumes of sensor data, including from internet of things sensors. It was inspired by Google's Map Reduce, a software framework in which an application is broken down into numerous small parts. Any of these parts, which are also called fragments or blocks, can be run on any node in the cluster. After years of development within the open source community, Hadoop 1.0 became publically available in November 2012 as part of the Apache project sponsored by the Apache Software Foundation. Since its initial release, Hadoop has been continuously developed and updated. The second iteration of Hadoop (Hadoop 2) improved resource management and scheduling. It features a high-availability file-system option and support for Microsoft Windows and other components to expand the framework's versatility for data processing and analytics.

Hadoop also supports a range of related projects that can complement and extend Hadoop's basic capabilities. Complementary software packages include:

  • Apache Flume. A tool used to collect, aggregate and move huge amounts of streaming data into HDFS.
  • Apache HBase. An open source, nonrelational, distributed database;
  • Apache Hive. A data warehouse that provides data summarization, query and analysis;
  • Cloudera Impala. A massively parallel processing database for Hadoop, originally created by the software company Cloudera, but now released as open source software;
  • Apache Oozie. A server-based workflow scheduling system to manage Hadoop jobs;
  • Apache Phoenix. An open source, massively parallel processing, relational database engine for Hadoop that is based on Apache HBase;
  • Apache Pig. A high-level platform for creating programs that run on Hadoop;
  • Apache Sqoop. A tool to transfer bulk data between Hadoop and structured data stores, such as relational databases;
  • Apache Spark. A fast engine for big data processing capable of streaming and supporting SQL, machine learning and graph processing;
  • Apache Storm. An open source data processing system; and
  • Apache ZooKeeper. An open source configuration, synchronization and naming registry service for large distributed systems.