Posts /

The most dangerous part of expanding a 0.10.0 or earlier Kafka cluster

Background

Expanding a Kafka cluster is pretty straight forward. However, one particular step can be fraught with danger when adding brokers to a 0.10.0 or earlier cluster: running the partition reassignment script. The danger involved in this step stems from the fact that in Kafka versions 0.10.0 or earlier, the replication process “will use all available resources to replicate as fast as possible” (KAFKA-1464). In other words, as soon as you kick off the partition reassignment script, your network could be saturated with replication traffic. The implications of this are far-reaching, and we will discuss this in detail below.

Note: The above issue was actually fixed in version 0.10.1, so if you are using that 0.10.1 or later, simply run the partition reassignments with the throttle flag when running the partition reassignments. If you are stuck with 0.10.0 or earlier brokers, keep reading.

What could go wrong?

As I mentioned above, the primary thing to be concerned about is saturating the network with replication traffic. When this happens, any existing producers or consumers you have runnning against the cluster may receive metadata timeouts since communication with the brokers will suffer.

Split the job up

When generating a candidate reassignment, split the job into small chunks.

Introduce one broker at a time

Stir in slowly.

Make sure you have plenty of diskspace

Think twice about killing the job

If something does go wrong, you may be tempted to kill the job. One method is to delete the /admin/reassign_partitions key in Zookeeper. However, doing this may leave your cluster in a weird state. Here is an actual of what I mean. You can see in the chart below

Upgrade to 0.10.1 or higher