What Has Our Team Been Up To?

Around the start of this year, I created this blog, to replace my static website. Since then, I have mostly been writing about talks I have given, and I have a few posts in preparation that detail what I am working on (in case you didn’t figure it out yet, it’s called Nakadi.)

Some of my colleagues have already written, on Zalando’s tech blog, about some of the things that we do in our team. Not only do we work on Nakadi, but we also operate it as a service, running on AWS. They wrote about some of the challenges we met, and how we tackled them. We are happy to report that, even when on call, we sleep very well at night: our services are pretty resilient, and out-of-hours calls are the exception, rather than the rule.

Last year, Andrey wrote about his work with Kafka and EBS volumes. We keep a lot of data in Kafka. It used to be that, every time we lost a Kafka broker, or every time we had to restart one, the broker would have to collect once again all its data from other replicas in the cluster. This would take a long time, and while the data was being replicated, we would only have two in-sync replicas for each of the partitions on that broker. Upgrading Kafka would take a week, during which brokers would use some bandwidth – and IO – to replicate data. Andrey solved the issue by making sure that Kafka’s data is stored on a permanent EBS volume, that does not get destroyed when the instance it is attached to goes down. He then worked on upgrade and recovery scripts, such that new brokers will automatically attach previously detached volumes, which greatly reduces the amount of data to synchronise: we only need to copy whatever had been written to Kafka since the broker went down. His work saved us, and continues to save us, considerable amounts of time. It also reduces dramatically the amount of time during which a number of partitions have less than 3 in-syn replicas.

Another post on our work was early this year, by Ricardo. He talked about how he solved one of our biggest pain points in terms of operations: Zookeeper. For the longest time, we were absolutely terrified of a Zookeeper node going down: it would come back, of course, but with a different IP address, and Kafka only takes a fixed list of IP addresses for Zookeeper. Losing a Zookeeper node is not the end of the world, of course, since we run an ensemble of 3. But it did require a rolling restart of Kafka (and a redeployment of Nakadi), which is a time-consuming operation. Losing 2 Zookeeper nodes would have been a catastrophe, but fortunately that hasn’t happened. In his work, Ricardo focused on making sure that Zookeeper nodes always get the same, private, IP address (EIPs were not an option for us). So now, when a Zookeeper node goes down, we know that it will be back a couple of minutes later, with the same address. No more rolling restarts of Kafka!

Last, but not least, Sergii very recently started writing about his previous experience with security while working for an airport in Ukraine. Go read it, it is both instructive and funny. I’m really looking forward to episode 3!

Stay tuned for more news from team Aruha (that’s our name!)