ETL stands for Extraction Transform Load is a common concept in data engineering, and as we can imply from the name of the concept that this concept has three types of operations, Extract which indicate the process of extracting data from the source system of information, Transform to represent the process of manipulating the data to reshape the data, apply data quality rules, filter bad data, or join different pieces of data together, and finally Load operation which is the final step of the process which is moving the data after applying the required transformation to the target system which could be a file system, RDBMS, or any other types of system.
One of the very common use cases in which we can see the implementation of ETL concept is in Data Warehouse Architecture, ETL solution used to move data from different source systems then transform/reshape it according to the required logic, and load the result into the Data Warehouse model with the required granularity (level of details) for each dataset.
Now when it comes to choose the best technology vendor to implement ETL concept in your architecture, you should know it totally depends on your use cases and what you need to achieve here, here we have some parameters that will help you choose the best fit for your organization
One of the important parameters that you need to check is the type of data sources currently exists or will exist soon, ETL solution must support read and write operations from all your data source or future strategic data sources. data sources could be:
– File system, such as Text files, JSON, Parquet, XML files
– RDBMS such as Oracle, Teradata, Microsoft SQL Server
– APIs with different types
– NoSQL databases such as MongoDB, Reddis, and so on.
In case you have or you are planning to integrate with Big Data sources such as Hadoop File system HDFS, Hive, Impala, Apache Spark, and so on, you need to make sure your ETL solution support reading, writing, and processing data in this kind of source systems. so at this point, you need to ask the following:
1- Do the ETL solution support my data sources with different versions we have?
2- What kind of source systems this ETL can integrate with?
3- Is this ETL solution will support my future data sources?
Having an ETL solution that can integrate with all your data sources is great, but when it comes to performance you need to take care because bad performance can become the main bottleneck in achieving your goals to get the maximum benefits out of your investments, it will be useless to have ETL solution that will take hours and hours to process your workload. Especially in ETL, all transformation and processing load will be carried out by the ETL solution itself, unlike other concepts such as ELT where data processing is carried out by the target system. At this point you need to ask the following:
1- What current workload we have such as how many records are being processed? , how much data (size) is being processed? What is the peak workload?
2- What is the future or predicted workload in the future?
3- What is the ETL solution benchmark?
4- Do the ETL solution will be able to handle the current, future, and peak workload?
Most of the ETL work can be done manually using bash scripts, database scripts or even Powershell scripts, but in this case, we will face a different kind of problems related to complexity, operations load, slow response to business requirements, and many other issues, that’s why we need to have an ETL solution to make our life easier and make enable us to focus on our business KPIs and get insights from data fast enough to make effective decisions, and this will not be achieved unless we have a Powerful ETL solution with a rich and flexible library of transformations and easy to use GUI which will help our team to respond to business requirements very fast and not consume time and effort to think how we are going to develop ETL requirements, but instead put all time and efforts in getting new insights of data. At this point you need to ask the following:
1- What kind of transformations or data enrichment this ETL solution supports?
2- Do ETL solution supports all data manipulation needed or we still have to do manual work?
3- What percentage of manual work needs to be done in case we will choose this solution? Of course, you shouldn’t have any manual work to do, but this question to assess the ETL solution from operations and development prospectives.
4- Is this ETL solution have an easy to use and friendly GUI?
Operational efficiency is one of the most critical points to consider when evaluating an ETL solution, adding operational headache is a disadvantage in any ETL solution, simply because once development is done, now the load will become an operational load to keep the development up and running and functioning correctly. ETL solution should provide a fully automated operational interface that will enable the operational team to monitor, enhance, and control production workload easily. At this point you need to ask the following:
1- Do the ETL solution provide a friendly and easy to use interface to manage daily production operations?
2- Do ETL solution provides automation capabilities, to automate daily production operations?
3- Do the ETL solution provides a scheduler to control execution times for different processes?
4- Do the ETL solution provides a notification system integration such as Emails, SMS, and so on?
5- Do we need to do any manual operational process after we select this ETL solution?
Recovery from Failures
Some ETL batches can take hours and hours due to large size of data to process, in this case, the ability of ETL solution to recover from failure and to restart the ETL batches from the point of failures is very critical and can save hours of re-executing the whole ETL workload from the beginning. In general, recovery from failure is a big plus to ETL solutions, especially in the type of organizations that expect heavy data processing on almost a daily basis. At the point you need to ask one simple question:
Do the ETL solution supports recovery from failures? What is the mechanism for the recovery?
Of course, Budget is one of the main factors of any decision in any type of organizations, and this parameter will always have a heavyweight when compared with the other factors, especially when you have more than one solution that is technically close when compared using the parameters we mentioned earlier, and at this point, you can ask the following questions:
1- What is the licensing scheme for the ETL solution?
2- Do we have to pay each year for our license?
3- Is the license per user or per service provided?
Famous ETL Solutions
Following are some of the famous vendors who provide ETL solutions beside other data integration solutions, have a look and try to apply the parameters we mentioned on each solution, enjoy 🙂
Oracle Data Integrator (ODI)
Microsoft SQL Server Integration Services