What Is Apache Kafka?
Apache Kafka is an open source project, initially created by LinkedIn, that is designed to be a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design, which we will investigate in more detail in this Article. Kafka was designed with a few important characteristics in mind, one being that it is very fast. Think Big Data here – Kafka can handle hundreds of megabytes of reads and writes per second from a large number of clients. Kafka allows you to partition data streams and spread messages over a cluster of machines which allows you to have data streams larger than a single machine could handle. Additionally it allows for clusters of coordinated consumers to consume data from the system.
Overview of Kafka components
The first concept or component is Topics. In Kafka, a feed of messages is organized into what is called a topic. Think of it like a category of messages.
The next concept is Brokers. Kafka runs on a cluster of one or more servers which are called brokers.
Producers are processes that publish messages to Kafka topics.
Consumers are the processes that subscribe to these topics and process the feed of published messages.
A Kafka topic can be thought of as a user-defined category or feed name to which messages are published.
Let me give you a quick example. If you were using Kafka for website activity tracking, you might have one topic that receives messages when users login to the website and another topic that receives messages every time a user clicks a link on the web page.
Each topic consists of a partitioned log. Behind the scenes the log for a topic partition is stored as a directory of segment files. By partitioning topics, Kafka is able to scale horizontally. That means you can add additional servers and scale the log beyond what just a single server could handle. There’s also other great things that can be done, such as having multiple consumer processes read data for a single topic, and more.
Now, each of these topic partitions are an ordered, immutable sequence of messages that are continually appended to. Take a look at the above image. The topic in this example has 3 partitions – partition 0, partition 1, and partition 2. Partition 0 currently has messages with sequential id’s, known as the “offsets”, of 0 through 11 and a message with offset of 12 is being written to the partition. The offsets uniquely identify messages within the partition.
It is possible to have total order over all the messages in a topic, however your topic would need to have only one partition and that reduces the benefits of being able to parallelize reads.
Kafka retains all the messages that are published regardless if they have been consumed or not, for a configurable period of time. You could set log retention to one day which would mean that a message is kept for 1 day after it’s published, or you could set log retention to 30 days – it’s your choice. Kafka’s performance is not greatly affected regardless of the data size.
A Kafka cluster is comprised of one or more servers which are called “brokers”. Each of these Kafka brokers stores one or more partitions on it. Kafka is able to spread a single topic’s partitions across multiple brokers, which allows for the horizontal scaling we previously mentioned. By spreading the topic’s partitions across multiple brokers, consumers can read from a single topic in parallel. The Kafka brokers are stateless, in that consumers are responsible for keeping track of what offset position they would like to read. The broker does not keep track of this, which is one of the design features that helps Kafka be so fast. Kafka also allows the topic’s partitions to be replicated across the brokers, which we will look at next.
In Kafka, each partition can be replicated across a number of servers. The number of servers a partition is replicated across, is called the “replication factor” and it is configurable. By being able to replicate partitions, we are able to provide fault tolerance to the system, meaning that if there is a failure of a broker, the messages in the partition can remain available because Kafka is able to automatically failover to the replicas.
Each partition has one server which acts as the “leader” and then depending on the replication factor you configure, there will be zero or more servers acting as “followers”. As an example, a replication factor of 3 means there would be one leader and two followers. In that example, 2 of the 3 servers could fail before you’d lose access to the data. The leader handles all read and write requests for the partition. The followers passively replicate the leader. If a leader were to fail, then a follower is elected to be the new leader. The leaders keep track of what is called the In Sync Replicas, which is abbreviated to “ISR”. The In Sync Replicas are a set of replicas that are alive and are fully caught up with the leader. If one of the followers dies, gets stuck or falls behind, then it is removed from the list of In Sync Replicas.
A message is “committed” when all the In Sync Replicas for a partition have applied it to their log. Only committed messages are given to consumers. Producers on the other hand have the option of whether to wait for the commit or not. It’s a choice between latency and durability and likely depends on the business needs of your application.
In Kafka, producers are processes that publish messages to a Kafka topic. A producer decides which message is written to which partition in a topic. It can do this in a basic round-robin style for simple load balancing or something custom can be defined. As an example, you could have your producer write messages to certain partitions based on some sort of key.
when a producer has a message to write to a partition, it does the write to a single leader. This provides load balancing, as a different broker can then service each write – you may have one broker contain the leader for partition 1 and another broker contain the partition 2 leader.
Producers can choose to send a key with the message (string, number, ..etc.), if key is null data is sent round robin, and if key is sent, then all messages for that key will always go to the same partition.
A key is basically sent if you need message ordering for specific field (ex: truck_id)
Kafka consumers read from topics. They consume the messages in a topic’s partitions. Kafka comes with a command line consumer client or you have the option to write your own using APIs. Consumers belong to consumer groups. Designating a consumer group is something that you do by defining a group name and assigning it to your consumers. You can have multiple consumer groups, each consisting of 1 or more actual consumer processes. These consumer instances that belong to a consumer group, can be in separate processes or even on separate machines.
As an example, imagine that you have multiple consumer instances that all belong to the same consumer group. The load gets balanced across them. Each of the consumer instances in the group reads from a single unique partition each in this scenario. The group as a whole consumes all the messages in the topic. In this scenario, you would not need or want any more consumer instances than there are partitions to be read, because those additional consumers would just sit idle.
In another scenario, imagine we had consumer instances that belong to different consumer groups, that all subscribe to the same topic. In this scenario, each consumer group gets a copy of the messages. Let’s take a look at the diagram that the Kafka documentation uses. The Kafka Cluster in this scenario, consists of two brokers which are hosting 4 partitions – P0, P1, P2, and P3. The partitions are spread across the cluster – two on each server. In this example, there are two consumer groups – Consumer Group A and Consumer Group B. Consumer Group A consists of two consumers called C1 and C2 here. Consumer Group B consists of 4 consumers called C3, C4, C5 and C6 here. Both of the consumer groups are subscribed to the same topic. Consumer Group A’s two consumers are responsible for reading messages from all 4 of the partitions. That means that consumer C1 reads from two
partitions, and consumer C2 reads from the other two partitions. Consumer Group B has 4 consumers, which means each consumer in that group only has to read from one partition each. Having four consumers share the load is likely faster than having two in this scenario.
When a consumer in a group has processed data received from Kafka, it should be committing the offsets. So, if a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
Kafka requires Apache Zookeeper. This means that you first need a ZooKeeper server started in order to run Kafka. There’s a script that comes with Kafka that starts a simple single-node ZooKeeper instance, though in a production environment you would likely want to have a more robust ZooKeeper installation. Kafka uses ZooKeeper for a variety of things, such as cluster membership, electing a broker as controller, topic configuration such as
which topics exists, who’s the leader and so on. In version 0.9 ZooKeeper is used for Quotas and ACLs also. In the older version of Kafka, high-level
consumers used ZooKeeper to do things such as keep track of offsets, however starting in version 0.9 there is a new consumer which uses Kafka instead of ZooKeeper for these purposes.
A simple use-case
You could have several IoT devices sending data to Kafka, much like users visiting your website. If the fleet of devices grows, you can configure Kafka to scale out or scale in to fulfill peak loads of traffic. Let’s say that you put an IoT device in every train in a city. Each IoT device will be sending information about the train. For instance, about the status of critical parts of the engine, like a sensor. An application can subscribe to a Kafka topic to programmatically alert when a sensor exceeds a certain threshold. A train could be flagged as a candidate for maintenance and temporarily disabled for security reasons. You wouldn’t want to wait for the end of the day to get all the information from each device. It might be too late. That’s why Kafka’s real-time capability is so valuable.