• Home
  • Contact

SUPERGLOO

  • Stream ProcessingStart Here
  • SparkSpark Tutorials
    • Spark Tutorials With Scala
    • Spark Tutorials with Python
  • KafkaKafka Tutorials
  • Books
    • Clean Code Summary
    • Mythical Man Month Summary
    • Learning Spark Summary
    • Pragmatic Programmer Summary
    • Spark Tutorials with Scala
    • Data Science from Scratch Summary
  • Courses
    • Scala for Spark Course
    • Spark with Scala Course
    • Spark with Python Course
  • Contact

Apache Kafka Overview

what-is-apache-kafka-sm

What is Apache Kafka?

Apache Kafka is a messaging system which can provide the foundation for data to be moved between systems without tight coupling. Apache Kafka is often defined as a distributed log service that is partitioned and possibly replicated. It provides a messaging system that is fast, scalable, durable, distributed by design, and fault-tolerant. For large-scale data systems, this is the preferred choice by many when it comes to an ideal type of communication between systems.

To have a better understanding of Kafka, let us first have a glimpse of some of the most common Apache Kafka terminologies.

First off, let’s discuss the concept of “topics”. To be able to send a message to Kafka from a Producer, you have to do so to a specific topic. The same thing holds true if you want to read a message.  You need to read from one or more specific topics from a Consumer.

You can think of it, like this:

Kafka Topics

Messages are published by topic Producers. Those who subscribe to them, on the other hand, are called Consumers.

Communication to and from Apache Kafka from Producers and Consumers is through a language agnostic TCP protocol, which is high-performing and simple.  It assumes the responsibility for facilitating communications involving servers and clients. 

Topics and Logs

A topic is considered a key abstraction in Kafka. A topic can be considered a feed name or category where the messages will be published. Each topic is further subdivided into partitions. Partitions split data across different nodes by Kafka brokers. In each of the messages within the partition, there is an ID number assigned, which is technically known as the offset.

Kafka Topic Partitions

All of the messages published can be retained in the cluster, regardless if they have been consumed or not. For instance, if the configuration states that it will be available only for one week, then it can be retained within such period only. Once the period has lapsed, it will be automatically deleted, which, in turn, will provide the system with additional space.

On a per consumer basis, only the meta-data, which is also technically known as the offset, is going to be retained. The consumers will have complete control of this retention. It is often consumed linearly, but the user can often go back and reprocess, basically because he or she is in control.

Consumers of Kafka are lightweight, cheap, and they do not have a significant impact on the way the clusters will perform.  This differs from more traditional message systems. Every consumer has the ability to maintain their own offset, and hence, their actions will not affect other consumers.

Kafka Partitions

Why partitions?  Partitions are leveraged for two purposes. The first is to adjust the size of the topic to make it fit on a single node. The second purpose of partition is for parallelism or performance tuning. It makes it possible for one consumer to simultaneously browse messages in concurrent threads. This will be further discussed later on.

Distribution

Every partition is replicated for fault tolerance. In every partition, there is a single server, which is labeled as the “Leader.”. The leader is responsible for the handling of the requests for read-write. There will also be “Followers,” which can be zero or more nodes in the Kafka cluster. They will be responsible for replicating the actions undertaken by the leader. In the case of the failure of the leader, one of the followers will immediately assume the leadership position.  We’ll cover more on Kafka fault tolerance in later tutorials.

So, Why Kafka?  When?

When compared to the conventional messaging system, one of the benefits of Kafka is that it has ordering guarantees. In the case of a traditional queue, messages are kept on the server based on the order at which they are kept. Then messages are pushed out based on the order of their storage and there is no specific timing requirement in their transmission. This means that their arrival to different consumers could be random.  Hold the phone.  Did you just write “could be random”?  Think about that for 5 seconds.  That may be fine in some use cases, but not others.  

Kafka is often deployed as the foundational element in Streaming Architectures because it can facilitate loose coupling between various components in architectures such as databases, microservices, Hadoop, data warehouses, distributed files systems, search applications, etc.

Guarantees

In a Kafka messaging system, there are essentially three guarantees:

  1. A message sent by a producer to a topic partition is appended to the Kafka log based on the order sent.
  2. Consumers see messages in the order of their log storage.
  3. For topics with replication factor N, Kafka will tolerate N-1 number of failures without losing the messages previously committed to the log.

Conclusion

Hopefully, this article provides more high-level insight into Apache Kafka architecture and Kafka benefits over traditional messaging systems.  Stay tuned for additional forthcoming tutorials on Apache Kafka.

And if there any particular “topics” 😉 you would like to see, let us know.

 

 

Featured image based on https://flic.kr/p/eEQpPj

 

Jun 16, 2016Todd M
IntelliJ Scala and Apache Spark - Well, Now You KnowSpark Broadcast and Accumulator Examples in Scala
You Might Also Like
 
Apache Spark Transformations in Python Examples
 
Spark Performance Monitoring Tools – A List of Options

Leave a Reply Cancel reply

Todd M

Provider of tutorials, training, and other learning resources for Data Engineers, Data Scientists, and Data Architects. I created a few courses and books.

2 years ago Kafka1,914
Categories
  • Kafka
  • Spark
  • Streaming
  • Summary Series
Recent Posts
  • Kafka Streams – Transformations Examples February 13, 2019
  • Kafka Producer January 29, 2019
  • Kafka Consumer January 27, 2019
Most Commented
Spark SQL mySQL JDBC
Spark SQL MySQL Example with JDBC
19 Comments
Spark Cluster on EC2
How To: Apache Spark Cluster on Amazon EC2 Tutorial
13 Comments
Intellij Scala Spark
IntelliJ Scala and Apache Spark – Well, Now You Know
8 Comments
Tags
scalaspark tutorialpythonstreamingspark sqlbooksummaryspark pythonapache sparkkafka streamslearningsparksparkclusterintellijspark apachemachine learningchange data capturekinesisarchitecturekafka connectcassandracdc
  • Privacy Policy
  • Terms of Use
Most Viewed
Spark Transformation Examples
Apache Spark: Examples of Transformations
47,544 views
Intellij Scala Spark
IntelliJ Scala and Apache Spark – Well, Now You Know
42,187 views
Spark Streaming with Kafka
Spark Streaming with Kafka Example
36,301 views
Recent Posts
  • Kafka Streams – Transformations Examples
  • Kafka Producer
  • Kafka Consumer
  • Kafka Streams – Why Should You Care?
  • Kafka Streams Joins Examples
2019 © Supergloo