BigDataCloud.in - Azure Big Data Online Trainings

You are here from BigDataCloud.in >> Beginners >> 5 - Spark Basics

Spark is Framework
Spark is Engine
Spark is set of libraries
Spark is set of SDKs

In one sentence, Spark is custom built solution resides on UNIX machine. The solution is built by Apache to process large amount of data.
Spark contains different sets of numerous libraries ready to use by developers.

You learn Spark in below way:

1) Architecture of Spark - Distributed Data Processing (explained later)
2) SDKs / Libraries - Spark SQL, Streaming, MLib, GraphX
3) Programming Languages - Spark supports Python, Scala, Java & R language

What do it mean by diff programming language support?
Without programming language there is no use of Spark engine. I prefer Python, you prefer Scala and that guy prefer R. All working on same Spark engine but using different languages & syntax. underneath all are using same core libraries.

As a Data Engineer, we will focus on data processing using Python/Scala and Spark SQL/Streaming. We wont focus on MLib and GraphX to not to make complex.

Distributed Data Processing - Spark processes data on different machines at same time. After processing is done by machines, it collects the results and sends to caller. Consider in a construction site, many workers are working under 2 bosses (who don't work :)). At a time only 1 boss plans the work and supervise the workers. If the boss tired or sleeps only then 2nd boss becomes active. Spark also works in similar mode. There are 2 head nodes (who don't work) and 1 or more worker nodes (who work on the ground).

Note: As per workload the boss brings more or less workers to finish the work in time. This is cloud offering. (explained later - Auto scaling)

As a typical programming, when we read data and store inside a variable, it is called as DataFrame. You can also fire queries in SQL format. There are various ways to read/process/write data. These are called APIs in Spark. RDD/DataFrame/DataSet/SQL are some of the APIs (i.e. interfaces to interact with data).
RDD - Spark keeps data internally distributed and track it's linage while processing. Resilient distributed dataset (RDD) is collection of data in memory on different nodes. Data can be plain text (string, numbers etc), collection of rows, dictionaries etc. If any node crashes during data processing, RDD self recover automatically and instantiate on other node (Lazy instantiation). This is not tabular format data.
Dataset - This is enhanced version of RDD in tabular format. Available only with Scala & Java.
DataFrame - Enhanced version of RDD & Dataset available in Scala, Java, Python & R.

We will not learn these APIs in detail at this stage. One point you keep in mind, DataFrame is widely used for data processing in Spark.

Can I play with data using:
Only Python on Spark -- Yes
Only Scala on Spark -- Yes
Only Spark SQL on Spark -- Yes
Combination of Python + Spark SQL -- Yes
Combination of Python + Scala -- Yes/No - depends on flavor of Spark.

More detailed on What is Spark:

Better to read:
https://www.tutorialspoint.com/apache_spark/apache_spark_quick_guide.htm
https://www.bernardmarr.com/default.asp?contentID=1079

Zakaas Inc understood what is Spark but still confused on what is Cloudera / Databricks or any other Spark version.
Companies like Cloudera or Databricksw customized core Apache Spark with additional offerings (in terms of more libraries, network connectivity, security implementation, performance improvement, server management etc). These are called Spark flavors.

As you see above, core Apache Spark is bundled with additional offerings and set for sale. Further, they sale their offerings through cloud providers as well (Azure, AWS, Google etc). In that way it is very easy for customers to migrate from one cloud to other cloud as underneath Spark is same (e.g. Databricks).

Auto scaling: Databricks offer auto scaling. If Boss has to finish more work in same time, he calls more workers for work. Similarly Databricks Head node creates dynamic worker nodes automatically for big data processing. Once done, it auto scale down to default. Moreover when it is idle (no work), the cluster (set of head nodes and worker nodes) auto shutdown.

Notebooks: Notebooks are online interface for coding purpose. There are many notebooks available in the market (more or less they work similar) -- Jupyter, Databricks Notebook, Zeppelin.
Usage of notebooks:
1) Coding - Python, Scala, R
2) Visualization for data analysis - Shows output of selected data in various format - Tabular, Graphs, Charts etc
3) Scheduling - The notebooks can be scheduled by job schedulers
4) Portable - You can export, clone, download and send notebooks to other parties

Go Back to BigDataCloud.in