What is Apache Kafka?
Apache Kafka is a messaging system which can provide the foundation for data to be moved between systems without tight coupling. Apache Kafka is often defined as a distributed log service that is partitioned and possibly replicated. It provides a messaging system that is fast, scalable, durable, distributed by design, and fault-tolerant. For large-scale data systems, this is the preferred choice by many when it comes to an ideal type of communication between systems.
To have a better understanding of Kafka, let us first have a glimpse of some of the most common Apache Kafka terminologies.
First off, let’s discuss the concept of “topics”. To be able to send a message to Kafka from a Producer, you have to do so to a specific topic. The same thing holds true if you want to read a message. You need to read from one or more specific topics from a Consumer.
You can think of it, like this:
Messages are published by topic Producers. Those who subscribe to them, on the other hand, are called Consumers.
Communication to and from Apache Kafka from Producers and Consumers is through a language agnostic TCP protocol, which is high-performing and simple. It assumes the responsibility for facilitating communications involving servers and clients.
Topics and Logs
A topic is considered a key abstraction in Kafka. A topic can be considered a feed name or category where the messages will be published. Each topic is further subdivided into partitions. Partitions split data across different nodes by Kafka brokers. In each of the messages within the partition, there is an ID number assigned, which is technically known as the offset.
All of the messages published can be retained in the cluster, regardless if they have been consumed or not. For instance, if the configuration states that it will be available only for one week, then it can be retained within such period only. Once the period has lapsed, it will be automatically deleted, which, in turn, will provide the system with additional space.
On a per consumer basis, only the meta-data, which is also technically known as the offset, is going to be retained. The consumers will have complete control of this retention. It is often consumed linearly, but the user can often go back and reprocess, basically because he or she is in control.
Consumers of Kafka are lightweight, cheap, and they do not have a significant impact on the way the clusters will perform. This differs from more traditional message systems. Every consumer has the ability to maintain their own offset, and hence, their actions will not affect other consumers.
Why partitions? Partitions are leveraged for two purposes. The first is to adjust the size of the topic to make it fit on a single node. The second purpose of partition is for parallelism or performance tuning. It makes it possible for one consumer to simultaneously browse messages in concurrent threads. This will be further discussed later on.
Every partition is replicated for fault tolerance. In every partition, there is a single server, which is labeled as the “Leader.”. The leader is responsible for the handling of the requests for read-write. There will also be “Followers,” which can be zero or more nodes in the Kafka cluster. They will be responsible for replicating the actions undertaken by the leader. In the case of the failure of the leader, one of the followers will immediately assume the leadership position. We’ll cover more on Kafka fault tolerance in later tutorials.
So, Why Kafka? When?
When compared to the conventional messaging system, one of the benefits of Kafka is that it has ordering guarantees. In the case of a traditional queue, messages are kept on the server based on the order at which they are kept. Then messages are pushed out based on the order of their storage and there is no specific timing requirement in their transmission. This means that their arrival to different consumers could be random. Hold the phone. Did you just write “could be random”? Think about that for 5 seconds. That may be fine in some use cases, but not others.
Kafka is often deployed as the foundational element in Streaming Architectures because it can facilitate loose coupling between various components in architectures such as databases, microservices, Hadoop, data warehouses, distributed files systems, search applications, etc.
In a Kafka messaging system, there are essentially three guarantees:
- A message sent by a producer to a topic partition is appended to the Kafka log based on the order sent.
- Consumers see messages in the order of their log storage.
- For topics with replication factor N, Kafka will tolerate N-1 number of failures without losing the messages previously committed to the log.
Hopefully, this article provides more high-level insight into Apache Kafka architecture and Kafka benefits over traditional messaging systems. Stay tuned for additional forthcoming tutorials on Apache Kafka.
And if there any particular “topics” 😉 you would like to see, let us know.
Featured image based on https://flic.kr/p/eEQpPj