You are here from BigDataCloud.in >> Beginners >> 4 - Data Orchestration

You already aware of ETL (Extract Load Transform). Data Orchestration is enhanced conceptual version of ETL. In ETL you connect to known in-network sources, transform data and then load to target which is again in-network. In Data Orchestration also you connect to source, load data and transform it and again load it (ELT).

What makes Data Orchestration different than ETL is:

1) Various sources : On cloud you have various connected and non-connected sources. In other words data sources are on-prem databases, FTP/SFTP, APIs, external cloud storages etc etc.

2) Data formats varies per source. It can be CSV, JSON, DB connection, anything.

3) Security is highly implemented. Scenarios where data is pushed into data lake, the source team should not be able to play with existing data. Not only that within organization also the data load jobs (Data Orchestration) must access only desired storage location. By any mean the program should not try to access unintended location.

4) Decision making : Alerting and notifying designated contacts as per rules configured, halting the data load jobs; rerun ability without impacting performance; auditing are in-built features of Data Orchestration.

5) Monitoring: In-build rich UI for monitoring purpose. Log collection and analysis is supported.


Azure Data Factory:

Zakaas Inc now wants to learn more on ADF.

ADF is Data Orchestration tool provided by Microsoft Azure. It is similar to your existing ETL tool. Let us correlate ETL and ADF. See the below picture.



There is no need to install any IDE on your machine. ADF is on cloud offering i.e. we have to create pipelines, triggers (and other objects) by logging into portal.azure.com or adf.azure.com. However it does support creation of ADF objects via Powershell, Azure Rest API and SDKs.

Whatever you build in ADF will be converted into  JSON format code. The JSON code can be used to create files (code files) if you wish to have backup of existing code.

ADF can be directly integrated with GitHub for source code versioning and deployment (CI/CD).

Simple data flow with ADF:




More on ADF:
https://docs.microsoft.com/en-us/azure/data-factory/introduction

Wait.. its not over. One techie from Zakaas Inc asked 'If ADF is on cloud, how it would connect to my DB which is located in our premise'?

ADF offers a type of connection called Integration Runtime (IR). There are 2 types of IR -

1) Self Hosted IR (SHIR) - To connect to on-prem db. SHIR is software to be installed on prem machine from where the on-prem db or any other systems are accessible. Once that is installed and configured (so there is key which tells ADF which machine to connect), ADF will recognize the on-prem db as normal DB connection.

2) IR to execute SSIS packages on cloud

More on IR:
https://docs.microsoft.com/en-us/azure/data-factory/create-azure-integration-runtime

Go Back to BigDataCloud.in