1. What is ETL?
ETL stands for Extract, Transform, and Load. ETL process starts by extract data from one or multiple sources, then, Transform this data to match the data warehouse schema, and finally load the transformed data to the data warehouse.
ETL system should enforce data quality, consistency standards, and ensure that separated data sources can be used together, then provide data in ready format to end-user to make decisions, or other developers to use it.
ETL is a back room process that isn’t visible to end-user.
The first step of the ETL process, Extraction is to collect data from multiple targeted sources, Extraction is the most complicated task in the ETL process, Because many sources are in a way that lacks the quality or quantity required (unsatisfactorily), and Determining the eligibility for extraction is not an easy process.
Extraction needs a lot of work during the research phase, because before doing anything you should understand your data correctly, and it’s a continuous process, The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the warehouse and keep it up-to-date.
There are two types of extraction: logical extraction and physical extraction, each of them has other types inside it, so let us demonstrate them.
There are two kinds of logical extraction:
- Full extraction.
- Incremental extraction.
Full extraction goes for this logic when data is extracted and loaded for the first time, in this type, data from the source is extracted completely, So extracted data reflects all the data currently available on the source system. full extraction also used when the system can’t identify which data is updated, in this situation, We get a full copy of the latest extraction, then start identifying changes.
Incremental extraction keeps track of updated data in source systems since the last successful extraction To extract and load only new or changed parts not the whole data like Full extraction, We keep track of updated data using last changed timestamp in source systems, So in Incremental extraction, the extraction tool should recognize new or updated data using time of adding or updating.
Incremental extraction depends on that source system can give us an update notification when update or add new data to the source, and describe changed or added data.
There are two kinds of physical extraction:
- Online extraction.
- Offline extraction.
Here, data is extracted directly from source systems, the Extraction process connects directly to source systems and there’s no need for any external file, So that we called it online.
Data isn’t extracted directly from source systems, first, it’s copied to an external file, then our extraction process connects to that external file and starts processing. So now, when you want to start transforming your data, you can fetch records from the external file instead of access the source directly.
After data extraction in the ETL, the second stage of the ETL process is transformation, which is when the data is transformed to meet the schema and requirements of the destination. This involves data mapping, linking data from two or more sources, data conversion, and data cleansing. Generally, it takes more than one conversion to convert data into a consistent format, such as join, lookup, filter, expressions, and aggregation.
- Remove duplicated data.
- Mapping Null values to 0 or mean or median of the column.
- Format conversion, like: convert ID from integer to string.
- Establish key relationships between tables.
- Splitting columns into multiple columns.
- Joining data from multiple sources together.
- Filter rows/columns.
- Derive new columns from existing columns.
- Aggregate data from multiple sources.
- Data validation.
It’s the last step of the ETL process When we have to load data to our data warehouse.In this process we need to load a huge amount of data in short periods, So we have to optimize the performance of this process.
If the load process failed, We have to configure a recovery mechanism that should restart from the point of failure without any integrity problems. There are three types of load: Initial load, incremental load, and Full refresh
It’s the one-up process. and it is generally done when the data is going to be populated for the very first time in the data warehouse. When we load the data for the very first time we do no care about identifying the new or the modified records we generally take the whole data set from the staging area and load it into its data warehouse.
we apply all the ongoing changes from the source systems into the data warehouse periodically, in incremental load we only load the records which has changed or newly inserted.
- Only updated records and new ones are loaded.
- It difficult, Because the ETL system should check for new and updated rows.
Full refresh is completely erasing the contents of one or more tables and loading the fresh data for certain tables.
- Truncate data, Then load it again with new and updated records.
- Easier than incremental load but more time-consuming.
2. What is ELT?
ETL systems are connected only to schema-based destination, in modern applications, we aim to have a variety of data sources from structured schema-based source to unstructured sources, These different data sources rarely find identical schema between them, and when a new source added to them, it becomes worst and difficult to implement the logic in ETL systems and time-consuming.
ELT is an abbreviation for Extract, Load, and Transform. It is a process that extracts data from a source system to a target system and then transforms the data.
Unlike ETL, where data transformation takes place in a staging area before being loaded into the target system, ELT extracts the raw data directly to the target system and transform it there, So now we can work with various types of data by getting it directly from source to destination without caring of the schema that will handle this data, and after loading it we can start thinking in transformations that needed to handle this data and integrate it together. data lakes adopt ELT.
Data lakes: a popular repository that is capable of storing a huge amount of data without maintaining any specific data structure, Because of that data lakes can store any type of data whether structured or unstructured. Also, you can store data whose purpose may or may not be defined yet. (Its purposes include – building dashboards, machine learning, or real-time analytics). You can extract and load any type of data into a data lake, no matter the format.
Pros and cons of ELT
- data lakes adopt ELT and cloud data warehouses and it’s one of the factors that lead to implementing ELT.
- There is better performance and data integrity as it works with high-end data devices such as Hadoop cluster and cloud.
- ELT needs less time and resources as the data is transformed and loaded in parallel. The volume of data can be very large.
Cons: There are limited tools and expertise available that fully support ELT operations.
3. ETL vs ELT
After Demonstrating the ETL and ELT, Let us start to configure the main differences between ELT and ETL in:
- Availability of tools and expert
- Data size
- Order of the process and loading time
- Diversity of data
- Data warehousing support
- Data lakes support
3.1 Availability of tools and experts
ETL is a highly supported process and used widely for more than 15 years, there’s a lot of readily expertise, developers, and a large community.
ELT Its a different approach from traditional ETL so not all tools or solutions support this approach natively
3.2 Data Size
ETL: It will be a better solution for working with a small to medium amount of data that required complex transformations. ELT: It will be a better solution for working with a huge amount of data, and any type of data structured, unstructured and semi-structured.
3.3 Order of the Processes
In ETL, Transformations take place after extracting data in the staging area, then after transforming the data we load it to the data warehouse (Target system), So here data passed through the multi-staged process ( loading to the staging area and make transformations) before being loaded to the data warehouse. Because of the multi-staged process, the loading time from source to destination is longer than ELT.
In ELT, Data is loaded directly from source to target system (commonly data lake), and after loading data, Transformations take place with the needed data for analytics or any other purpose. loading time is better than ETL because there’s a need for a multi-staged process, data is loaded directly.
3.4 Data Warehousing Support
ETL works with the data warehouse (commonly), and it also supports cloud data warehouses, but it works only with structured-based schema and OLAP
ELT can work with a Data Warehouse, but in this case, all transformations and processing will be pushed to the Data Warehouse system.
ETL and ELT are two different processes that are used to meet the same requirements, i.e. preparing data so that it can be analyzed and used to make business decisions. The implementation of the ELT process is more complex compared to ETL, however, it is now preferred. ELT design and implementation may require more effort but offers more long-term benefits. In general, ELT is an economical process because it requires fewer resources and takes less time. However, If the target system is not powerful enough for ELT, ETL may be a more suitable option.