
Expanding a Kafka cluster is pretty straight forward. However, one particular step can be fraught with danger when adding brokers to a 0.10.0 or earlier cluster: running the partition reassignment script. The danger involved in this step stems from the fact that in Kafka versions 0.10.0 or earlier, the replication process “will use all available resources to replicate as fast as possible” (KAFKA-1464). In other words, as soon as you kick off the partition reassignment script, your network could be saturated with replication traffic. The implications of this are far-reaching, and we will discuss this in detail below.
Note: The above issue was actually fixed in version 0.10.1, so if you are using that 0.10.1 or later, simply run the partition reassignments with the throttle
flag when running the partition reassignments. If you are stuck with 0.10.0 or earlier brokers, keep reading.
As I mentioned above, the primary thing to be concerned about is saturating the network with replication traffic. When this happens, any existing producers or consumers you have runnning against the cluster may receive metadata timeouts since communication with the brokers will suffer.
When generating a candidate reassignment, split the job into small chunks.
Stir in slowly.
If something does go wrong, you may be tempted to kill the job. One method is to delete the /admin/reassign_partitions
key in Zookeeper. However, doing this may leave your cluster in a weird state. Here is an actual of what I mean. You can see in the chart below