ETL is one of the major tasks for any data engineer, and we have many solutions either on-premise or cloud solutions available in the market to implement this concept, in Microsoft Azure, Azure Data Factory is the ETL solution to implement data pipelines using data from the cloud source or data from on-premise sources, Azure Data Factory is very flexible, GUI ETL solution which provides different types of connectors to on-premise and cloud sources, in this article, we will go through the basic components of Azure Data Factory and we will implement a Full Use Case step by step.
Data Factory Components
One of the main building blocks of data factory which represents simply the ETL job. Any ETL job has three components Source Dataset(s), Transformations, Target Dataset(s). In Data Factory we call ETL job as a Dataflow
Activity is one task or a collection of tasks that can be executed sequentially or in parallel, we have different types of tasks we can execute such as a Dataflow task which we already discussed in previous point, Copy Data task, Databricks task, and so on.
Linked Service is the connection details for a specific source, to connect to any data source or an kind of services in Data Factory we need to create a Linked Service for this source or service, you can think of it as a connection statement with all required details to connect to a specific source or service.
The pipeline is a workflow to orchestrate a group of activities together in sequential or parallel order. A common use case is when you have a group of activities that are related to the same use case and affect the output data you will need to create a pipeline to control the flow of data and dependencies between activities. The conclusion is that the Pipeline is a logical grouping of activities.
In this use case, we will use Copy Data activity to migrate data from Blob Storage to SQL Server database, Objective from this use case is to walk you through the development process of Data Factory, and see in action Data Factory components.
1- Login to Azure portal
2- Click on Create a Resource
3- Search for Data Factory and click on it
4- Click on Create
5- Choose Azure Subscription for your billing, Resource group, region, name of your data factory, in our case we will name our Data Factory as dvdevelopmentdf
6- You can link your data factory development with a Git repository either Github or Azure DevOps Repository , click on Configure Git Later
7- If needed add Tag to your Data Factory, Tags enables you track expenses, for example, we added team tag to track all resources that has the same team tag value
8- Review settings and click Create
9- Wait till resource is created successfully then click on Go to Resource
10- Click on Author & Monitor
11- Next, we need to create Linked Services (Connections) to our source and target for our Copy Data Activity, to do that click on Manage
12- These two demos will show you the process of creating a Linked Service for SQL and Storage services
13- Now, we are ready to create a Copy Data Activity, go back to the main Data Factory window, click on Data Factory, then click on Copy Data
14- Enter Task details, Task Name –> Copy Data Task name, then choose Task schedule to run it once or run it based on schedule
15- Next, we will enter source details, choose source connection
Choose the path to your file
Then choose file configurations, like file format, column delimiters, row delimiters, then check on First Row Header in case the first row has the column names, and check sample of data to make sure data is readed successfully
16- Choose Target Connection, in our case it is SQL connection
17- We can create target table dynamically, or choose an existing table from the target connection, in our case we will create the table dynamically using the source file schema
18- Next, we will choose column mapping between source and target columns, also we can change data types for the output columns
19- The last step is to configure general settings for our task, check Data Consistency verification to let data factory run validation tests between source and target data, choose from the list of fault tolerance approaches, here we will choose to skip incompatible rows, Enable Logging to log task events to a destination, here we selected storage connection, then we clicked on Browse and select a directory we already created for the logs, click Next
20- Review Summary, then click Next
21- The last step is deployment step, data factory will validate our task, create a target dataset, create a pipeline for our activity, then will execute our pipeline. Note, pipeline that will be created will contain only one activity which is our copy data activity
Go to Author window, and you can see we have a pipeline with our activity
22- Check Monitor window, you can see our pipeline executed successfully
Now let’s check the data in the target table to validate our pipeline
Congratulations, you have your first pipeline executed successfully, now you have a practical sense of Data Factory process end to end and of the Data Factory components.