Big Data Terminologies
Listing here only required terminologies
Layers -- Combination of Storage and Compute separations for data store, processing (includes transformation), consumption and reporting purpose.
Data Ingestion -- Extract data from source and store it across layers in data lake. Data can be Pull or Push from source.
Data Orchestration -- Includes connections to source & target, mappings, transformations and other functionalities (logging, alerting, notification, auditing) etc
Batch Data Load -- The data that can be created, stored, processed, analyzed, and visualized on scheduled basis (daily, weekly....) is known as real-time data.
Real Time Data Load -- The data that can be created, stored, processed, analyzed, and visualized instantly i.e. in milliseconds, is known as real-time data.
Near Real Time Data Load -- The data that can be created, stored, processed, analyzed, and visualized with short delay (hourly, per x mins), is known as real-time data.
Latency -- Delay in process and/or transfer the data. Low latency is fast data, high latency is slow data.
Ingress -- Data transfer from outside of your network or region.
Egress -- Data transfer to outside of your network or region.
VM -- Virtual machine e.g. remote machine which is accessible using RDP/SSH.
Storage -- Disk / web service to store data in any format. Includes file storage, http storage, queues, shared drives anything where we store data
Compute -- Use of CPU, memory and IO. e.g. Data movement/job trigger requires processing power. Databricks & VM are example compute.
ELT -- Extract Load Transform
Spark -- A engine developed on Unix system (generally bundled with Hadoop) for data processing with high power and speed. We will learn detailed on this in upcoming courses.
NoSQL -- Data which is not relational or does not have fixed schema. E.g. Bicycle and truck are vehicles having different properties are stored in same database in NoSQL format.
PaaS -- Platform as a Service -- Example is you use SQL server which is available on internet and you need not to manage (installation, patching etc). You only connect to SQL server and fire queries (and little bit of user management).
IaaS -- Infrastructure as a Service -- Example is you use get a remote machine which is available on internet. You are required to manage security, install updates, keep secure from virus and malware etc.
SaaS -- Software as a Service -- Example is you use email service which is available on internet. Gmail/Service Now/Jira are example of SaaS.
Avro/Parquet/ORC -- These are file formats (like CSV, JSON, TXT) to store data specially used in big data world. It is NOT mandatory to use these for small or medium size data. For large data these format boost performance.
Go back to BigDataCloud.in