You are here from BigDataCloud.in >> Beginners >> 8 - Storage Services

In Big Data world, data keeps coming in. IT teams have to have store the incoming data. Since data is in form of structured, unstructured or semistructured, it can not be stored inside database. Ideal place to store the data (in file format) is disk, shared location, NAS drive etc.

Consider there are 3 teams working on a data lake in Zakaas Inc. Each team requires it's own path to store the files. IT team provisioned below:

\\ZakaasDrive\team1
\\ZakaasDrive\team2
\\ZakaasDrive\team3

If is it on Azure cloud, how the storage will be provisioned?

Azure offers storage services for the purpose. These storage services are Blob, File, Queue, Azure Data Lake Services (ADLS), Disk. In this article we focus on Blob & ADLS storage only as these are widely used in Big Data applications. Links for other storage types are provided below at the end of this course.

Blob & ADLS Storage features:

  • Web storage – everything you store is on web. The data can be accessible using http(s).
  • Fast read/write from Hadoop/Spark – The storages are based on Web version of HDFS thus has inbuilt support of all HDFS features. However to access the data we have to use wasb(s) or abfs(s) instead of http(s).
  • Rest API – Data can be read/write using Rest API
  • Security – Offers security in various ways – Azure AD support, firewall, VNET (accessible only from office locations/network), tokenization
  • Replication – Data replicated across location/zones/regions (configurable).
  • Encryption – Always encrypted at rest. No need to worry of data theft.
  • Scalable – No limit on size. Push as much big data as you can. Storage auto scales to accommodate all new data at super fast speed.

The above folder structure will be converted on Blob/ADLS:
Root path : <accountname>.blob.core.windows.net e.g. https://ZakaasStorage.blob.core.windows.net
Folders:
   https://ZakaasStorage.blob.core.windows.net/team1/<files>
   https://ZakaasStorage.blob.core.windows.net/team2/<files>
   https://ZakaasStorage.blob.core.windows.net/team3/<files>

From Spark/Hadoop these folder will be referred as
If data is stored on Blob
    wasbs://team1@ZakaasStorage.blob.core.windows.net/<files>
    wasbs://team2@ZakaasStorage.blob.core.windows.net/<files>
    wasbs://team3@ZakaasStorage.blob.core.windows.net/<files>

If data is stored on ADLS
    abfss://team1@ZakaasStorage.blob.core.windows.net/<files>
    abfss://team2@ZakaasStorage.blob.core.windows.net/<files>
    abfss://team3@ZakaasStorage.blob.core.windows.net/<files>

wasbs and abfss are called as drivers.

Account, Containers, Blobs:
wasbs://ZakaasStorage.blob.core.windows.net/team1/<files>
In above example ZakaasStorage is account name. team1 to team3 are containers. Inside containers there are files which are blobs.

Why container as a separate entity? 
The root level folders are called containers. Containers represent their own identity within same account. Mostly containers are created to organize data. One example to establish security - Each team/dept should have its own secured storage drive which should not be accessible to other teams/depts. This can be achieved using containers. We can create separate Blob/ADLS accounts for the purpose but that will be lead more efforts and cost in managing the storages.

How to connect storage:

Similar to username/password, storage account can be accessible using accountname and key. Key is generated with random alphanumeric characters by Azure which can be changed anytime. We are not allowed to provide our own text as key. The process to change the keys is called Refreshing keys.
  • With Key – explained above. If this key is compromised (shared), the one who possess the key will have full control over storage.
  • AAD – using your own Azure AD login. This will be used in dev / support phase only. You cannot use your own credentials in any code.
  • SAS Url – SAS urls are nothing but token generated (again separate key) for particular period and with controls established. You can restrict the user of SAS url on read/wrote/list etc operations. SAS URL is generated on account as well as on container and file level. This is called ACL (Access Control List)
BLOB Vs ADLS Storage:
There are 2 versions of Blob – V1 & V2 and 2 versions of ADLS – Gen1 and Gen2. You need not to learn Blob V1. That is old. ADLS Gen2 is built on top of Blob V2. There are many organizations still using ADLS Gen1 and migrating to Gen2 now.

All above sections in this article are applicable to Blob V1, V2 and ADLS Gen2. ADLS Gen1 is different kind of storage which we are not going to cover as that is old (but still used by many organizations). Main different between Blob and ADLS Gen2 are:



Blob Storage
ADLS Gen2 Storage
Purpose
Store objects like media files, unstructured data, archival of data, log files, streaming data
Store large files (terabytes/petabytes), typically for analytics
Folder
Folders are not supported (except containers at root level). Files can be uploaded by providing folder path but those folders are virtual. Empty folders are not permitted.
Folders are any level can be created.
Security
Key, AAD, SAS Url
Key, AAD, SAS Url, Service Principles, POSIX (we will cover in advance courses)
Structure
Flat (since folders are not allowed)
Hierarchical (since folders are allowed)
Root Folders
Called as Containers
Called as File System
Data stored as
Block Blob, Page Blob
Files
Driver
WASB
ABFS (For Gen1 – ADLA)