You are here from BigDataCloud.in >> Beginners >> 1 - What Big Data & Why?

What is Big Data and Why?

Before we start, let me tell you one simple rule - Huge data is NOT big data. Someone who already working on HIVE & Spark also may not know what is Big Data.

I will not start with typical terms like Velocity, Variety and Volume to make it difficult to learn ..... rather start with one example:

There is a consumer products (say chocolates) manufacturing company - Zakaas Inc. The company has its own EDW where project teams bring data nicely using ETL tools (like Informatica or SSIS). There are reports being generated out of EDW.


Over the years company's data grows from GBs to TBs.. is that Big Data? Absolutely not. As company grows, the management realized the need of understanding trends in market. But how? For that they need data. What data? The data here is nothing but the information on how consumer rating the products, what they talk about the products, how the products are being compared etc etc.. On the other side, to improve the operational efficiency, management decided to install sensors in the factory. Along with that the management started tracking human activities as well.. let us see below picture:


To satisfy analysis requirement, management decided to bring external data along with sensors data from factory. Bringing data is not enough, that has to be analysed... there are users apart from reporting, who do ad-hoc analysis of data. The data is in structural format, unstructured or semi-structured. Frequency can be any.. This is called Big Data.

Thus Big Data is nothing but the data (in any format) of many activities (by various entities) collected at any frequency for analysis purpose.

More on how Big Data is valuable (by Coursera) click here

Did you notice I have shown Data Lake instead of EDW? A typical water lake is does not collect water considering consumer needs, right? People pull water once the lake is full of water. Similarly Data Lake is not created as per the requirements by consumer at first place. The aim of data lake is collect data first (by any means) and then analyse. This approach is called ELT (Extract-Load-Transform).

Collecting data by any means --- 

Variety of data - Reviews and ratings (JSON), Competitors Data (CSV/JSON), News (PDFs), Photos (Images)
Velocity of data - Bring data per Second, minute, day, week, month, quarter and what not
Volume - If there are so many sources, you can imagine the volume of data.

As an implementation plan, Zakaas Inc has to invest in setting up machines, servers, network etc etc. However what management learned from Microsoft is - it can be easily available without investing time & efforts. That is because of cloud offerings by Microsoft through Azure platform. The Azure cloud provides all infra and services as per Zakaas's needs for data lake.

How Big Data is analysed?

Step 1 - Bring and store the data in raw format (as it is)
Step 2 - Clean and transform and then store again at another location
Step 3 - Prepare data for consumption (store it again)
Step 4 - Analyse & View (need not to store)

Data Engineering involves step 1,2 and occasionally 3.
Data Analytics involves step 3 & 4.
Data Science and BI persons work with business team to get reports and visualizations from prepared data.

More on these roles:
https://www.edureka.co/blog/data-analyst-vs-data-engineer-vs-data-scientist/ 
https://www.dataquest.io/blog/data-analyst-data-scientist-data-engineer/

We will learn more on concepts like batch data load, real time load, variety of architectures in upcoming courses.

Go Back to BigDataCloud.in