data engineering - Data Valley

Dimensional Modeling |Part 1: Introduction and Fact Types

Dimensional Modeling Dimensional modeling is one of the data modeling techniques used for designing the data warehouses, It also considered a suitable technique for representing analytic data, because it understandably delivers data for users and is optimized for query performance which increases the data retrieval speed. Normalized databases are very useful in transactional processing because […]

ETL vs ELT | Differences and Use Cases

1. What is ETL? ETL stands for Extract, Transform, and Load. ETL process starts by extract data from one or multiple sources, then, Transform this data to match the data warehouse schema, and finally load the transformed data to the data warehouse. ETL system should enforce data quality, consistency standards, and ensure that separated data […]

Denormalization when, why, and how !?

What is de-normalization? De-normalization is an optimization technique to make our database respond faster to queries by reducing the number of joins needed to satisfy user needs. In de-normalization, we mainly aim to reduce the number of tables that are needed by re-joining these tables together and add redundant data. De-normalization is commonly used with […]

Normalization in Depth

Designing and understanding a data model is all about understanding the concepts and the options you have in your use case and what is the best use case for each design option you have, in this article we will go through the normalization types and understand how to implement each option and pros and cons […]

Building a data pipeline using Dataflow | GCP Dataflow

Data uncover deep insights, support informed decisions, and enhances efficient processes. But when data coming from various sources, in varying formats, and stored across different infrastructures, so here are data pipelines are coming as the first step to centralizing data for reliable business intelligence, operational insights, and analytics. By contrast, the data pipeline is a […]

Introduction to Impala .. Architecture and Components | Impala

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala query UI in Hue) as Apache Hive. This provides a familiar […]

Dimensional Modeling … Design Methodology for Analytics Oriented Data Warehouse

Data warehouses have been around since the 80s. Throughout these years, it has proven its capabilities to support decision-making and business analysis. Data warehouses allow the Integrating of many source systems such as databases, spreadsheets, and flat files. Cleansing and Transformation can be applied to these data after integration and then organizes in a way […]

Getting Started with Containers & Dockers | Dockers

Introduction Containerization revolutionized the software development and it becomes a common building block in today’s architecture, applications, big data environments, and data engineering applications can be deployed and developed inside containers In this article, we will know more containers and its advantage, and we will discuss Dockers which is a container image that packages all […]

Aggregation Queries in Apache Hive | Apache Hive

Introduction Data aggregation is the process of gathering and expressing data in a summary to get more information about particular groups based on specific conditions. HiveQL offers several built-in aggregate functions, such as max, min, avg,..etc. It also supports advanced aggregation using keywords such as Variance and Standard Deviation and different types of window functions. […]

Quick Reference to six D’s of the data field

For any professional or beginner in the data field, regardless of your specialty or technology you will work on, you will hear about one or more of the following concepts, and we can say it is absolutely important for any data professional to know at least the general concept of any of the following concepts. […]

Analyze COVID-19 Dataset with Databricks

In this article, we will analyze COVID-19 Dataset using Databricks unified analytics platform using the community edition of the platform, which is totally for free and you can use it as your playground to test Apache Spark applications in Python or R depends on your favorite API of development. Dataset will be used in this […]

Data Engineering Detailed Roadmap | Data Engineering

Data Engineering become a critical part in the past few years in almost any organization that use data heavily in their system, and I am sure you heard a lot about the comparison between data engineers and data scientist and which is better but actually, there is no role is better than another role, each […]

Tag: data engineering