What is Apache Kafka?
Simply put, Kafka is a messaging system, making it possible for data to be moved in between systems. For the laymen, however, the concept may not be easy to understand. Technically, it is defined as a commit log service that is distributed, partitioned, and replicated. It provides a messaging system that is fast, scalable, durable, distributed by design, and fault-tolerant. For large-scale data systems, this is the preferred choice by many when it comes to an ideal type of communication between systems.
To have a better understanding of Kafka, let us first have a glimpse of some of the most common messaging terminologies.
First off, let us discuss the concept of “topics”. To be able to send a message, you have to do so to a specific topic. The same thing holds true if you want to read a message. You need to read from one or more specific topics.
Messages are published by topic producers. Those who subscribe to them, on the other hand, are called consumers. Brokers, meanwhile, are the clusters which are ran by the distributed systems.
You can think of it, like this:
A language agnostic TCP protocol, which is high-performing and simple, assumes the responsibility for facilitating communications involving servers and clients.
Topics and Logs
A topic is considered a key abstraction in Kafka. A topic refers to a feed name or category where the messages will be published. Each topic is further subdivided into partitions. Partitions split data across different brokers. In each of the messages within the partition, there is an ID number assigned, which is technically known as the offset.
All of the messages published can be retained in the cluster, regardless if they have been consumed or not. For instance, if the configuration states that it will be available only for one week, then it can be retained within such period only. Once the period has lapsed, it will be automatically deleted, which, in turn, will provide the system with additional space.
On a per consumer basis, only the meta-data, which is also technically known as the offset, is going to be retained. The consumers will have complete control of this. It is often consumed linearly, but the user can often go back and reprocess, basically because he or she is in control.
Consumers of Kafka are lightweight, cheap, and they do not have a significant impact on the way the clusters will perform. This differs from more traditional message systems. Every consumer has the ability to maintain their own offset, and hence, their actions will not affect other consumers.
Why partitions? Partitions are leveraged for two purposes. The first is to adjust the size of the topic to make it fit on a single server. The second purpose of partition is for parallelism. It makes it possible for one consumer to simultaneously browse messages in concurrent threads. This will be further discussed later on.
Every partition is replicated for fault tolerance. In every partition, there is a single server, which is labeled as the “Leader.”. The latter will be the one responsible for the handling of the requests for read-write. There will also be “Followers,” which can be zero or more servers. They will be replicating the actions undertaken by the leader. In the case of the failure of the leader, one of the followers will immediately assume the leadership position. We’ll cover more on Kafka fault tolerance in later tutorials.
So, Why Kafka? When?
When compared to the conventional messaging system, one of the benefits of Kafka is that it has ordering guarantees. In the case of a traditional queue, messages are kept on the server based on the order at which they are kept. Then messages are pushed out based on the order of their storage and there is no specific timing requirement in their transmission. This means that their arrival to different consumers could be random. Hold the phone. Did you just write “could be random”? Think about that for 5 seconds. That may be fine in some use cases, but not others.
In a Kafka messaging system, there are essentially three guarantees:
- A message sent by a producer to a topic partition is appended to the Kafka log based on the order sent.
- Consumers see messages in the order of their log storage.
- For topics with replication factor N, Kafka will tolerate N-1 number of failures without losing the messages previously committed to the log.
Hopefully, this article provides more high-level insight into Apache Kafka architecture and Kafka benefits over traditional messaging systems. Stay tuned for additional forthcoming tutorials on Apache Kafka.
And if there any particular “topics” 😉 you would like to see, let us know.
Featured image based on https://flic.kr/p/eEQpPj