loader image

Implement SCD Types 2 on Talend Open Studio

Introduction

In this article, we will explore together how to use Talend data integration capabilities to implement one of the most important use cases in Data Warehouse implementation which is Slowly Changing Dimensions (SCD) tables.

Before moving on and following the next steps, make sure you have and running Talend solution, you can check our previous article on how to set Talend on a Linux environment

What is SCD?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

What are the types of SCD? 

  • Type 0 Scd – Retain Original
    The Type 0 dimension attributes never change, and are assigned to attributes that have durable values or are described as ‘Original’. Examples: Date of Birth, Original Credit Score. Type 0 applies to most Date Dimension attributes.
  • Type 1  SCD – Overwriting
    In a Type 1 SCD: the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.
  • Type 2 SCD – Creating another Dimension [KEEP ALL HISTORY Data ]
    A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the period between which the record was active.
  • Type 3 SCD – Creating Current Value Field [KEEP ONLY ONE VALUE OF HISTORY]
    A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

In this article, we will track data changes using Slowly Changing Dimension type 2, we will walk through step by step how to create a Talend data integration job to track attribute updates in a database table

Scenario

This scenario describes a Job that stores and manages both the current and historical employee data in a MySQL table following SCD (Slowly Changing Dimensions) design. In the table below Employee data contain various details such as age, name, Role, salary,…etc.

idnameagerolesalarycity
1Mark Smith30Tester11000.00NewYork
2Thomas Johanson32Developer12000.00Paris
3Teddy Brown23Engineer13000.00Cairo
Source Table before the change

We will insert the data into the MySQL table using SCD 
Then we will update the MySQL table using SCD to be like that

idnameagerolesalarycity
1 Mark Smith31Tester11000.00NewYork
2Thomas Johanson32Developer12000.00Paris
3Teddy Brown23Data Engineer13500.00Giza
Source table after change

As you see the age of Mark Smith is updated from 30 to 31, the Role of Teddy Brown  is changed from Engineer to Data Engineer, and his salary raised from 13000 to be 13500

In this Scenario

  • We want to retain the full history of the role data, always create a new record with  the changed data, and close the previous record, so will perform Type 2 – SCD 

Let’s Start Our Journey:

Start the project

  • Start Talend Open Studio
  • open a new project
  • start a new job
  • add  tMysqlConnection component  from  palette
  • add two component of tFixedFlowInput ,tDBSCD,tDBInput,tLogRow, components  by typing their names in the dropping them from the palette .
  • add tMysqlClose to the design workspace

Link the component

  • Link the first tFixedFlowInput component to the first tDBSCD component using a Row > Main connection .
  • Link the First tDBInput component  to the first tLogRow Using a Row > Main Connection .
  • Link  the second tFixedFlowInput component to the second tDBSCD Component using a Row > Main .
  • Link the second tDBInput to the second tLogRow component using a Row> Main .
  • Do the same to link the first tFixedFlowInput component to the first tMysqlInput component, the first tDBInput component to the second tFixedFlowInput component, the second tFixedFlowInput component to the second tDBInput component, and the second tDBInput component to the tDBClose component.

Configure the components

  • Double click the component to open it’s basic setting view .
  • In the Host, Port, Database, Username, and Password     fields, enter the information required for the connection to the MySQL database

insert the employee Data into mysql table using SCD

  • Configure the tFixedFlowInput component and the tDBSCD
  • Double Click on tFixedFlowInput  to open the Basic View
  • Click the button next to Edit schema and in the pop up window will appear then add the columns : idage with Integer type name ,role type of String and salary type of Double
  • When Done click ok  to save the changes .
  • In the pop up dialog box click yes .
  • In the mode area , select Useonlinecontent ,then in the content field insert the following data 

New York;1;Mark Smith;30;tester;11000.00
Paris;2;Thomas johnson;31;developer;12000.00
Cairo;3;Teddy Brown;23;Engineer;13000.00

Configure tDBSCD_1 Component

  • Click the tDBSCD_1 to open its basic view.
  • Select  Use an existing connecting check box and choose tDBconnection_1
  • In the table field enter the name of scd_type2_data.
  • choose the Create table if does not exist
  • Click the button next to  SCD Editor.
  • In the name field of the surrogate Key panel, enter the name S_K in the Example.
  • From the unused drag the following :
    • Id to the Source key field.
    • Name, Role to the  Type 0 field ,no action will perform after the changing of the data  .
    • Age,Salary,city to the Type 2 field , to perform type 2 SCD .
  • In the Versioning panel ,select the version check box to hold the version numbers for the historical and current records in the SCD table, and select also the active check box to add the column that will hold the True value for the current active record or the False value for the historical records in the SCD table.
  • When done, click OK to save the changes and close the SCD editor .

Retrieving the inserted employee data from MySQL

Configure  the tDBInput_1 and the tLogRow_1 to retrieve the inserted employee data
  • Double click the tDBInput_1 to open basic view .
  • Select the use an existing connecting , from the drop down list choose the connection component you configured .
  • In the table name field  enter the name of the table that the data will retrieve from .
  • When Done click Ok , then the pop up window will appear click yes.
  • Click the Guess Query to fill the query field with generated columns and schema .
  • Double click on tLogrow_1 ,to open the basic view .
  • In the mode area ,select Table(print values in cells of table ) for better readability of the result .

Update the employee data in mysql using SCD

Configure tFixedFlowInput_2,tDBSCD_2
  • Configure tFixedFlowInput_2 .
  • Double click on tFixedFlowInput_2 to open basic view .
  • Click button next to Edit schema , enter the schema columns as we did first time on tFixedFlowInput_1 .
  • In mode panel choose use Inline Content , then enter the following data .

1;Mark Smith;31;developer;11500.00;New York
2;Thomas Johanson;31;tester;12000.00;paris
3;Teddy Brown;23;engineer;13500.00;giza

Configure tDBSCD_2
  • Double click on tDBSCD_2 to open it’s basic view .
  • Select Use an Existing connection ,choose the connection you configured before .
  • In the table field enter the name of the table ,choose create the table if it does not exist .
  • Click the button next to SCD Editor .
  • In the name field of surgget’s keys panel , enter the name .
  • Drag the columns as we added it before
    • ID at the source keys .
    • Name,Role  at the Type 0 SCD , no action will perform  upon dimension changed .
    • Age,Salary at the Type 2 ,to perform Type 2 SCD .
  • In the Versioning panel, select the version check box to hold the version numbers for the historical and current records in the SCD table, and select also the active check box to add the column that will hold the True value for the current active record or the False value for the historical records in the SCD table.
  • When done, click OK to save the changes and close the SCD editor.

  Retrieving the updated data 

Configure  the tDBInput_2 and the tLogRow_2 to retrieve the inserted employee data
  •  Double click the tDBInput_2 to open it’s basic view .
  • Choose use an existing connection,select your config connection .
  • Click the button  next to Edit schema ,will open a window to insert the columns .
  • Click on Guess Query 
  • Double click  the tLogRow_2 to open its Basic settings view .
  • In the Mode area, select Table (print values in cells of a table) for better readability of the result displayed on the console .

Close the Connection

  • Double click on the tDBClose_1 , to open it’s basic view .
  • From the component list select the tDBConnection_1 .

Execution the Job

  • Save the job , press ctrl+s .
  • Press F6 to execute the job .

Facebook
Twitter

Leave a Reply

Your email address will not be published. Required fields are marked *

Unlimited access to educational materials for subscribers

Ask ChatGPT
Set ChatGPT API key
Find your Secret API key in your ChatGPT User settings and paste it here to connect ChatGPT with your Tutor LMS website.
Hi, Welcome back!
Forgot?
Don't have an account?  Register Now