Posts /

3 usages of kafkacat that I really love

25 Sep 2018

3 usages of kafkacat that I really love

What can I say? I love kafkacat. I use this handly commandline tool to interact with Kafka clusters on a regular basis. In dev environments, I typically use it to produce and consume messages from a local Kafka cluster. However, kafkacat comes in handy when troubleshooting issues in our prod cluster as well. This article discusses some commands I have found particularly useful, and I hope it will encourage you to play around with kafkacat was well :)


Useful commands

Poor man’s mirror maker

Mirrormaker is a useful tool that can be used for mirroring data from one Kafka cluster to another. Sometimes, when I’m working on application code (usually a Kafka Streams app), mirroring prod data in my local dev cluster allows me to test and troubleshoot very easily . With kafkacat, you an actually achieve simple, mirrormaker like functionality by piping messages from one cluster into another. Here’s an example:

# pipe 100 messages from the prod cluster to my local dev cluster
$ kafkacat -b prod.cluster:9092 -t api_logs -c 100 | \
  kafkacat -b localhost:9092 -t api_logs -P

The above command reads 100 messages (-c 100) from a prod cluster and produces them directly to a local cluster. Now, if you have a dev version of your consumer application reading from your local cluster, it will see the messages from prod and process them according to your application logic. I find this particularly useful for testing new code locally.

Key sampling

Some of our topics at Mailchimp contain binary payloads, but we often embed useful information inside the message key. While troubleshooting issues, I have found it helpful to sample the message keys coming across our topic and count the number of times each key appears.

Here is a trivial example. Imagine you have a topic that captures API usage logs for your website. You notice a sudden spike in traffic on one of your topic’s partitions. Luckily, your message keys contain the user ID of the person making the request. With kafkacat, you can use format strings (via the -f flag) to print the keys, and then use some basic operations (sort and uniq) to get the number of messages by key.

$ kafkacat -q -b localhost:9092 -t api_responses -o -10 -f "%k\n" -c 10 | sort | uniq -c
   5 user142494
   3 user209821
   1 user340293
   1 user402948

In this example, we only sample 10 records from the topic, but I typically use much larger samples (sometimes as much as 500k messages). Note: -c 10 tells kafkacat to pull 10 records, and o -10 rewinds the offset by 10 records. Setting a relative offset like this allows us to pull the last 10 records that were produced to the topic. Without this flag, you’ll need to wait for 10 new records (or whatever your sample size is) to appear in the topic before the consumer exits (which, depending on your sample size and topic throughput, could take awhile).

Locating offsets by time

Sometimes, you may want to start streaming messages from a fixed point in time (e.g. 5 minutes ago). This is possible with kafkacat, and is made easier with a simple bash function that I have included below. First, put this function in your ~/.bash_profile.

# get a timestamp (epoch in ms) for x minutes ago
minago () {
    echo $(($(date +"%s000 - ($1 * 60000)")))

Now, lets say you want to stream messages from the api_responses topic, partition 9, starting with messages that were produced around 20 minutes ago. Simply run the following command to get the offset for messages that were produced around this time:

$ kafkacat -Q -b prod.cluster:9092 -t api_responses:9:$(minago 20)

# example response
api_responses [9] offset 331086143

Now that you have the offset, you can pass it in the -o flag when consuming via kafkacat.

$ kafkacat -q -b prod.cluster:9092 -t api_responses -p 9 -o 331086143

What now

Thanks for reading. Kafkacat is a great tool to have if you work with Kafka clusters on a regular basis, so I encourage you to play around with it more.