Data Engineering - Data Valley

Create a Kafka Pipeline using Java Application | Apache Kafka

Introduction This Article is about Programming Apache Kafka producer and consumer using Java language, as we’ll see, using Java we’ll be able to reproduce what the CLI does and even more. Prerequisites Kafka Installation and configuration article ( To setup cluster will be used in this article) Any java programming editor Ex. (Netbeans – IntelliJ […]

Setup Apache Flink Environment Standalone on Windows | Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams, for introduction about Apache Flink components please check our previous article In this article we will learn together how to setup and run Apache Flink in Standalone mode. Run Apache Flink Standalone Flink has been designed to […]

Introduction to Apache Flink | Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Apache Flink is powerful open source engine which provides: Batch Processing Interactive Processing Real-time (Streaming) Processing Graph […]

Azure Storage Account | Microsoft Azure

Storage Account A storage account is a container that groups a set of Azure Storage services together. Only data services from Azure Storage can be included in a storage account (Azure Blobs, Azure Files, Azure Queues, and Azure Tables) Storage Account is an Azure resource, so it can be grouped under a Resource Group. Under […]

Setup Apache Kafka Environment | Apache Kafka

Introduction This article is about configuring and starting an Apache Kafka server on a Windows OS and Linux. This guide will also provide instructions to set up Java and Apache Zookeeper, and after the setup we will create a simple pipeline to test our installation. Kafka on windows Make sure you have the following prerequisites […]

ER vs Dimensional Modeling simplified under 10 Minutes

In this video we will go through the main differences between ER modeling and Dimensional modeling by using simple and straight forward examples, and we will understand the importance of dimensional modeling in Data Warehouse design

Apache Kafka Components

What Is Apache Kafka? Apache Kafka is an open source project, initially created by LinkedIn, that is designed to be a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design, which we will investigate in more detail in this Article. Kafka was designed with a […]

Dimension Keys – Part 1 – Natural Keys | Data Warehouse

Dimensions tables are core part of any Data Warehouse modeling. In general dimension tables store details side of any event or business process, for example, for a purchase operation from a retail store we will have dimension tables to store customer information, product information, store information, and so on, on the other hand, Fact tables […]

Migrate Files from local files system to Amazon S3 with Python Application | AWS S3 | Python

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. S3 storage well fit in different use cases, such as websites, mobile applications, backup and restore, archiv2e, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize […]

Create Scala Project on Intellij with Scala Worksheets | Scala

Scala is a a multi-paradigm language that supports both functional and object-oriented programming with a growing community and many useful features Scala worth learning, and it has been adopted by big enterprises such as Linkedin , Twitter, and many others. When you start experimenting Scala you can use Scala interactive REPL (Read Evaluate Print Loop) […]

Setup Apache Spark environment on Windows | Apache Spark

Apache Spark is easy to use, unified platform for all purposes of big data processing, and equipped with rich set of APIs for different application needs as Spark DataFrame and Spark SQL for structured data processing, Spark Streaming and Structured Streaming for streaming applications, Spark MLib for machine learning applications, Spark Graphx for Graph analytics […]

Apache Spark Application Execution Mode | Apache Spark

Apache Spark is a powerful processing platform for big data applications that supports different big data processing types. In this article we will discover together how Apache Spark application can be executed in multiple modes, depending on the environment architecture and on the application requirements. Before going into details, if you would like to setup […]

Category: Data Engineering