Big Data - Data Valley

Introduction to Impala .. Architecture and Components | Impala

Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Cloudera Impala query UI in Hue) as Apache Hive. This provides a familiar […]

Analyze COVID-19 Dataset with Databricks

In this article, we will analyze COVID-19 Dataset using Databricks unified analytics platform using the community edition of the platform, which is totally for free and you can use it as your playground to test Apache Spark applications in Python or R depends on your favorite API of development. Dataset will be used in this […]

Apache Hive Table Types | Apache Hive

Apache Hive is designed to give data engineers and data scientists a SQL like access to the big data available in the Hadoop cluster, so we can think of it as a normal RDBMS, in normal RDBMS we have a database, and tables, in Hive we have the same except in Hive we have two […]

Introduction to Hive | Apache Hive

Hive was initially developed by Facebook in 2007 to help the company handle massive amounts of new data. At the time Hive was created, Facebook had a 15TB dataset they needed to work with. A few short years later, that data had grown to 700TB. Their RDBMS data warehouse was taking too long to process […]

Apache Kafka and Apache Spark Integration | Apache Kafka | Apache Spark

Introduction Apache Kafka is a scalable, high performance, low latency platform that allows reading and writing streams of data like a messaging system. We can start writing Kafka applications using Java fairly easily, check our previous article on how to design a Kafka pipeline in Java. If you research the variety of real-world use-cases for Kafka, you will very […]

Create a Kafka Pipeline using Java Application | Apache Kafka

Introduction This Article is about Programming Apache Kafka producer and consumer using Java language, as we’ll see, using Java we’ll be able to reproduce what the CLI does and even more. Prerequisites Kafka Installation and configuration article ( To setup cluster will be used in this article) Any java programming editor Ex. (Netbeans – IntelliJ […]

Setup Apache Flink Environment Standalone on Windows | Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams, for introduction about Apache Flink components please check our previous article In this article we will learn together how to setup and run Apache Flink in Standalone mode. Run Apache Flink Standalone Flink has been designed to […]

Introduction to Apache Flink | Apache Flink

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale. Apache Flink is powerful open source engine which provides: Batch Processing Interactive Processing Real-time (Streaming) Processing Graph […]

Setup Apache Kafka Environment | Apache Kafka

Introduction This article is about configuring and starting an Apache Kafka server on a Windows OS and Linux. This guide will also provide instructions to set up Java and Apache Zookeeper, and after the setup we will create a simple pipeline to test our installation. Kafka on windows Make sure you have the following prerequisites […]

Apache Kafka Components

What Is Apache Kafka? Apache Kafka is an open source project, initially created by LinkedIn, that is designed to be a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design, which we will investigate in more detail in this Article. Kafka was designed with a […]

Migrate Files from local files system to Amazon S3 with Python Application

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. S3 storage well fit in different use cases, such as websites, mobile applications, backup and restore, archiv2e, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize […]

Create Scala Project on Intellij with Scala Worksheets

Scala is a a multi-paradigm language that supports both functional and object-oriented programming with a growing community and many useful features Scala worth learning, and it has been adopted by big enterprises such as Linkedin , Twitter, and many others. When you start experimenting Scala you can use Scala interactive REPL (Read Evaluate Print Loop) […]

Category: Big Data