What Has Our Team Been Up To?

Around the start of this year, I created this blog, to replace my static website. Since then, I have mostly been writing about talks I have given, and I have a few posts in preparation that detail what I am working on (in case you didn’t figure it out yet, it’s called Nakadi.)

Some of my colleagues have already written, on Zalando’s tech blog, about some of the things that we do in our team. Not only do we work on Nakadi, but we also operate it as a service, running on AWS. They wrote about some of the challenges we met, and how we tackled them. We are happy to report that, even when on call, we sleep very well at night: our services are pretty resilient, and out-of-hours calls are the exception, rather than the rule.

Last year, Andrey wrote about his work with Kafka and EBS volumes. We keep a lot of data in Kafka. It used to be that, every time we lost a Kafka broker, or every time we had to restart one, the broker would have to collect once again all its data from other replicas in the cluster. This would take a long time, and while the data was being replicated, we would only have two in-sync replicas for each of the partitions on that broker. Upgrading Kafka would take a week, during which brokers would use some bandwidth – and IO – to replicate data. Andrey solved the issue by making sure that Kafka’s data is stored on a permanent EBS volume, that does not get destroyed when the instance it is attached to goes down. He then worked on upgrade and recovery scripts, such that new brokers will automatically attach previously detached volumes, which greatly reduces the amount of data to synchronise: we only need to copy whatever had been written to Kafka since the broker went down. His work saved us, and continues to save us, considerable amounts of time. It also reduces dramatically the amount of time during which a number of partitions have less than 3 in-syn replicas.

Another post on our work was early this year, by Ricardo. He talked about how he solved one of our biggest pain points in terms of operations: Zookeeper. For the longest time, we were absolutely terrified of a Zookeeper node going down: it would come back, of course, but with a different IP address, and Kafka only takes a fixed list of IP addresses for Zookeeper. Losing a Zookeeper node is not the end of the world, of course, since we run an ensemble of 3. But it did require a rolling restart of Kafka (and a redeployment of Nakadi), which is a time-consuming operation. Losing 2 Zookeeper nodes would have been a catastrophe, but fortunately that hasn’t happened. In his work, Ricardo focused on making sure that Zookeeper nodes always get the same, private, IP address (EIPs were not an option for us). So now, when a Zookeeper node goes down, we know that it will be back a couple of minutes later, with the same address. No more rolling restarts of Kafka!

Last, but not least, Sergii very recently started writing about his previous experience with security while working for an airport in Ukraine. Go read it, it is both instructive and funny. I’m really looking forward to episode 3!

Stay tuned for more news from team Aruha (that’s our name!)

Feb 3rd: At FOSDEM to talk about Nakadi

Back when I was studying in Belgium, I religiously attended FOSDEM – the Free and Open Source Software Developers’ European Meeting, every year, in Brussels. In fact, as a member of the NamurLUG, I was part of the team that recorded the talks at FOSDEM for quite a few years. Initially we recorded with consumer-grade cameras, but we soon upgraded to better quality equipment, and even started streaming the events live, after a couple of years. Since then, another team has taken over, and the quality of the recording has improved quite a lot from our very amateur debuts.

This year, I will be back at FOSDEM, but this time I’ll be on the other side: I will give a Lightning Talk about Nakadi, the Event Broker I work on at Zalando. Nakadi is Open Source Software, and provides a RESTful API on top of Kafka-like queues (we have plans to support Kinesis in the future), as well as a bunch of other features: schema validation with json-schema, schema evolution, per-event type authorization, and more. In this talk I will focus on one of my favourite features: Timelines. What is Timelines? Well, I guess you’ll have to watch my talk to find out (or wait for the blog post explaining it, I am working on one)! If you can’t make it to Brussels for FOSDEM, the talk will be recorded and streamed live.

Some of my colleagues will also speak at FOSDEM, and more will be in attendance. Oleskii Kliukin and Jan Mußler will give a talk called “Blue elephant on-demand: Postgres + Kubernetes” in the Postgres devroom. And Ferit Topcu will talk about “Automating styleguides with DocumentJS” in the “tool the docs” devroom.

See you all in Brussels in February!