BigDataCloud.in - Azure Big Data Online Trainings

You are here from BigDataCloud.in >> Beginners >> 6 - Hadoop Ecosystem & Azure

Zakaas Inc. understood what is Spark and its flavors. However there is lot of confusion on Hadoop, Hive, Kafka etc etc. Is that really need to learn? What is significance of these technologies in Azure?

Note: A junior developer can survive in Big Data world if he knows Spark and even not aware of Hadoop Ecosystem. However this statement not true for senior developers and architects.

Let us understand Hadoop Ecosystem in short statements.

Again, like Spark, Hadoop Ecosystem is set of engines, libraries and frameworks residing on Unix. Hadoop is made of storage + engine + Boss + Worker, similar to Spark.. yes except storage. For Spark, storage is external. Storage in Hadoop is called HDFS (Hadoop Distributed File System), Engine is Yarn (resource manager) + MapReduce (data processing program mostly written in Java). Yarn manages Head Node and Worker nodes (called Data nodes in Hadoop).

Below listed are part of Hadoop Ecosystem which have same base - HDFS + Yarn:
1) HIVE -- It is Database in Hadoop. We create databases, tables and views in HIVE.
2) HQL -- SQL query to play with HIVE data. Hive Query Language.
3) Spark -- Zakaas Inc knows it already.
4) Kafka -- A storage system to store and consume messages (small data) in real time. Used in real time stream applications.
5) Storm -- A real time data processing engine. It does not store data. Offers many downstream connectors (e.g. PowerBI).
6) Oozie -- Data Orchestration tool (job/work-flow scheduler)
7) Pig -- Dead. Used to have for processing data. Another type of data processing language.
8) MapReduce -- Java program for processing data. Dead for Spark developers.
9) Scoop -- Command line to bulk load data transfer to and from HDFS and databases.
10) Ambari -- To manage Hadoop cluster (nodes and services)
11) Hbase -- Similar to Hive but stores NoSQL format data. Mostly semistructured. Does not store unstructured data.

Where you should focus if going to work in Azure Data Platform : Number 1 and 3 (4 is also required for real time data ingestion and processing).

Azure offers alternative services for all of above.
HIVE -- Databricks Tables, SQL DB, SQL DW
HQL -- Spark SQL, T-SQL
Spark -- Databricks (built on top of Spark only), HDInsight Spark
Kafka -- Event Hub, IoT Hub, HDInsight Kafka
Storm -- Stream Analytics, Databricks, HDInsight Storm
Oozie -- ADF
Scoop -- ADF Copy Activity
Ambari -- Present in HDInsight, otherwise not needed
Hbase -- Cosmos DB, HDInsight Hbase

Since you now understood Hadoop Ecosystem on high level, read more detailed one here: https://www.edureka.co/blog/hadoop-ecosystem
Do not miss: https://docs.microsoft.com/en-us/azure/hdinsight/hadoop/apache-hadoop-introduction

[Zakaas] : If Azure offers both HDInsight Spark and Databricks, which one we should go for?
The current trend in market is Databricks. Why?
1) Databricks manages servers. It is PaaS service. Where as HDInsight is IaaS service. You have to manage the servers. Security and performance are integral part of server management.
2) Databricks Spark engine is much more optimized for data processing, data science (ML) and streaming.
3) Need not to buy separate clusters for the purpose -- HDInsight HIVE + HDInsight Spark + HDInsight Storm = All alternative functionalities are present in one Databricks cluster.
4) Auto shutdown clusters -- Databricks offers it. HDInsight also now offering it in public preview mode.
5) Portability -- Customers who are already working on Azure Databricks can easily port to AWS Databricks.

For Spark developers, there is 80% similarities between both spark engines (Databricks and HDI).

Go Back to BigDataCloud.in