Apache Kafka

DataValley Team
January 1, 1970
12:00 am
No Comments

in this article we are going throw the Kafka streaming tool , we will define what’s Apache Kafka ,it’s components and structure ,it’s behavior and finally end with real streaming app

Apache Kafka

Is an open-source-streaming-processing software platform. Written in scala and java , created by Linkedin Data Engineers in 2011,the technology was handed over to the open-source community as a highly scalable messaging system. Apache Kafka is playing a significant role in the message streaming landscape ,Data and logs involved in today’s complex systems must be processed, reprocessed, analyzed and handled – often in real-time.The key design principles of Kafka were formed based on the growing need for high-throughput architectures that are easily scalable and provide the ability to store, process, and reprocess streaming data.

What can I use Kafka for ?

To know what kafka uses, let’s discuss the concept of streaming first .

Streaming is the digital equivalent of the human body’s central nervous system. It is the technological foundation for the ‘always-on’ world where businesses are increasingly software-defined and automated, and where the user of software is more software.Technically speaking, streaming is the practice of capturing data in real-time from sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events, storing these event streams durably for later retrieval , manipulating, processing, and reacting to the event streams in real-time as well as retrospectively, and routing the event streams to different destination technologies as needed. streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.

So we can use kafka for the next use cases and more than it :

To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances.
To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
To connect, store, and make available data produced by different divisions of a company.
To serve as the foundation for data platforms, event-driven architectures, and micro-services.

So kafka debended on three capabilities for implementation :-

Publish and Subscribe streams of events (streaming data) including import/export your data from other systems .
Storing streaming data reliably on the fly for as long as you want .
Process streaming data as accrue retrospectively .

How does Kafka work ?

Kafka is a distributed system consisting of servers and clients connected together via tcp protocol .

Kafka as a server : Kafka is run as a cluster of one or more servers that can span multiple data centers or cloud regions. Some of these servers from the storage layer, called the brokers,other servers run Kafka connect .
Kafka as a client : Kafka allow the user to write distributed apps and micro services that read ,write and process streaming data in parallel

Kafka components :-

Topics
Producer
Consumer

Kafka have three main component that build on it so let’s take an example before we discuss the component and how it works :

On your PC when you need to write a plan what will you do ?

You will create a folder (topic) to specific the subject and to be unique of other folders Then you will create the file (event) that you will write (produce )on it when you write in the file you will write the data ( streaming ) then when you want to check the plan and make one of the process (consume) you will read the plan again

Topics

A feed of events and streaming data can be stored and saved in containers called topics .

Topics can be multi-producers and multi-consumers,also can have zero or more producers that write on it,can zero or more consumers that read from it ,Streaming data or events are not deleted after consumption,

Topics are partitioned that meaning topic is spread over a number of ‘buckets’ located in different brokers and that achieve the role of scalability because that allow client to read and write from/to many brokers at the same time.

When a new event is published to a topic ,it’s appended to one of the topic’s partitions,Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition’s events in exactly the same order as they were written.

Every topic has a 3 factor copy of your data called replica .This replication is performed at the level of topic-partitions,Each partition usually has one or more replicas meaning that partitions contain messages that are replicated over a few Kafka brokers in the cluster.

every topic have 3 partitions called replica

Every partition (replica) has one server acting as a leader and the rest of them as followers. The leader replica handles all read-write requests for the specific partition and the followers replicate the leader. If the lead server fails, one of the follower servers becomes the leader by default. You should strive to have a good balance of leaders so each broker is a leader of an equal amount of partitions to distribute the load.

When a producer publishes a record to a topic, it is published to its leader. The leader appends the record to its commit log and increments its record offset. Kafka only exposes a record to a consumer after it has been committed and each piece of data that comes in will be stacked on the cluster .

A producer must know which partition to write to, this is not up to the broker. It’s possible for the producer to attach a key to the record dictating the partition the record should go to. All records with the same key will arrive at the same partition. Before a producer can send any records, it has to request metadata about the cluster from the broker. The metadata contains information on which broker is the leader for each partition and a producer always writes to the partition leader. The producer then uses the key to know which partition to write to, the default implementation is to use the hash of the key to calculate partition, you can also skip this step and specify partition yours

Producer

Producers are the process of publishing (writing) events to a Kafka topic ,it’s decides which partition that the event and data will written on it ,also Producer sends the events of data to the leader of the partitions, not to the topic ,Producer request the metadata and get the key , ask the leader of partition to send an event then the leader decide which partition in which topic hold and store the event of data.

Consumer

Kafka consumers read from topics They consume the messages in a topic’s partitions.consumers belong to consumer groups. Designating a consumer group is something that you do by defining a group name and assigning it to your consumers.You can have multiple consumer groups, each consisting of 1 or more actual consumer processes.These consumer instances that belong to a consumer group, can be in separate processes or even on separate machines.

Apache Kafka

Apache Kafka

What can I use Kafka for ?

How does Kafka work ?

Kafka components :-

Topics

Producer

Consumer

Leave a Reply Cancel reply

Unlimited access to educational materials for subscribers

Resources

Information

Social Media

We Accept