data engineering – Page 2

Detailed Guide for String Wrangling in SQL | MySQL

Extracting information from string columns is almost a repetitive necessity in Data Engineers, Data Scientists, and Business Analysts day to day tasks, and this task can be done using a programming language such as Python, or by SQL depends on your application and on the task required. In this tutorial, we will discover together how […]

Apache Hive Table Types | Apache Hive

Apache Hive is designed to give data engineers and data scientists a SQL like access to the big data available in the Hadoop cluster, so we can think of it as a normal RDBMS, in normal RDBMS we have a database, and tables, in Hive we have the same except in Hive we have two […]

Introduction to Hive | Apache Hive

Hive was initially developed by Facebook in 2007 to help the company handle massive amounts of new data. At the time Hive was created, Facebook had a 15TB dataset they needed to work with. A few short years later, that data had grown to 700TB. Their RDBMS data warehouse was taking too long to process […]

Implement SCD Types 2 on Talend Open Studio

Introduction In this article, we will explore together how to use Talend data integration capabilities to implement one of the most important use cases in Data Warehouse implementation which is Slowly Changing Dimensions (SCD) tables. Before moving on and following the next steps, make sure you have and running Talend solution, you can check our […]

Setup Talend Open Studio on Linux

Introduction Talend is an open-source data integration platform. It provides different solutions and services for data integration, data quality, cloud storage, and Big Data. According to the latest Gartner report, Talend is named in the leader’s quadrant among other data integration solutions. In this article, we will show you step-by-step how to install and configure Talend […]

How to choose your ETL solution | Data Integration

ETL stands for Extraction Transform Load is a common concept in data engineering, and as we can imply from the name of the concept that this concept has three types of operations, Extract which indicate the process of extracting data from the source system of information, Transform to represent the process of manipulating the data […]

Azure Storage Account | Microsoft Azure

Storage Account A storage account is a container that groups a set of Azure Storage services together. Only data services from Azure Storage can be included in a storage account (Azure Blobs, Azure Files, Azure Queues, and Azure Tables) Storage Account is an Azure resource, so it can be grouped under a Resource Group. Under […]

ER vs Dimensional Modeling simplified under 10 Minutes

In this video we will go through the main differences between ER modeling and Dimensional modeling by using simple and straight forward examples, and we will understand the importance of dimensional modeling in Data Warehouse design

Dimension Keys – Part 1 – Natural Keys | Data Warehouse

Dimensions tables are core part of any Data Warehouse modeling. In general dimension tables store details side of any event or business process, for example, for a purchase operation from a retail store we will have dimension tables to store customer information, product information, store information, and so on, on the other hand, Fact tables […]

Migrate Files from local files system to Amazon S3 with Python Application

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. S3 storage well fit in different use cases, such as websites, mobile applications, backup and restore, archiv2e, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize […]

Create Scala Project on Intellij with Scala Worksheets

Scala is a a multi-paradigm language that supports both functional and object-oriented programming with a growing community and many useful features Scala worth learning, and it has been adopted by big enterprises such as Linkedin , Twitter, and many others. When you start experimenting Scala you can use Scala interactive REPL (Read Evaluate Print Loop) […]

Setup Apache Spark environment on Windows | Apache Spark

Apache Spark is easy to use, unified platform for all purposes of big data processing, and equipped with rich set of APIs for different application needs as Spark DataFrame and Spark SQL for structured data processing, Spark Streaming and Structured Streaming for streaming applications, Spark MLib for machine learning applications, Spark Graphx for Graph analytics […]

Tag: data engineering