What Has Our Team Been Up To?

Around the start of this year, I created this blog, to replace my static website. Since then, I have mostly been writing about talks I have given, and I have a few posts in preparation that detail what I am working on (in case you didn’t figure it out yet, it’s called Nakadi.)

Some of my colleagues have already written, on Zalando’s tech blog, about some of the things that we do in our team. Not only do we work on Nakadi, but we also operate it as a service, running on AWS. They wrote about some of the challenges we met, and how we tackled them. We are happy to report that, even when on call, we sleep very well at night: our services are pretty resilient, and out-of-hours calls are the exception, rather than the rule.

Last year, Andrey wrote about his work with Kafka and EBS volumes. We keep a lot of data in Kafka. It used to be that, every time we lost a Kafka broker, or every time we had to restart one, the broker would have to collect once again all its data from other replicas in the cluster. This would take a long time, and while the data was being replicated, we would only have two in-sync replicas for each of the partitions on that broker. Upgrading Kafka would take a week, during which brokers would use some bandwidth – and IO – to replicate data. Andrey solved the issue by making sure that Kafka’s data is stored on a permanent EBS volume, that does not get destroyed when the instance it is attached to goes down. He then worked on upgrade and recovery scripts, such that new brokers will automatically attach previously detached volumes, which greatly reduces the amount of data to synchronise: we only need to copy whatever had been written to Kafka since the broker went down. His work saved us, and continues to save us, considerable amounts of time. It also reduces dramatically the amount of time during which a number of partitions have less than 3 in-syn replicas.

Another post on our work was early this year, by Ricardo. He talked about how he solved one of our biggest pain points in terms of operations: Zookeeper. For the longest time, we were absolutely terrified of a Zookeeper node going down: it would come back, of course, but with a different IP address, and Kafka only takes a fixed list of IP addresses for Zookeeper. Losing a Zookeeper node is not the end of the world, of course, since we run an ensemble of 3. But it did require a rolling restart of Kafka (and a redeployment of Nakadi), which is a time-consuming operation. Losing 2 Zookeeper nodes would have been a catastrophe, but fortunately that hasn’t happened. In his work, Ricardo focused on making sure that Zookeeper nodes always get the same, private, IP address (EIPs were not an option for us). So now, when a Zookeeper node goes down, we know that it will be back a couple of minutes later, with the same address. No more rolling restarts of Kafka!

Last, but not least, Sergii very recently started writing about his previous experience with security while working for an airport in Ukraine. Go read it, it is both instructive and funny. I’m really looking forward to episode 3!

Stay tuned for more news from team Aruha (that’s our name!)

Talk: Nakadi at Zalando’s first Kafka meetup

Back in November, a few colleagues and I (actually, our entire team) organised our first Kafka meetup. In this meetup we wanted to bring together engineers and devops who run software around Kafka, or maintain Kafka in production, to exchange knowledge and discuss our experiences. We wanted to talk about successes as well as failures and challenges. No sales pitches, just the truth of what we have struggled with.

For this first edition, all speakers were engineers at Zalando, as we didn’t know how much interest there would be from outside. We had short talks (10 minutes each, plus another 5 minutes for questions), and we had seven of them (yes, seven).

After an introduction by our team’s engineering lead, Himanshu Gahlaut, I talked about Nakadi for a bit. My colleague Ricardo de Cillo then talked about operating Kafka on AWS. He talked about choosing the right EC2 instances, the size of the cluster, the amount of disk space to use; failures, and how to recover from them; and configuring Kafka to run smoothly on virtual machines that could get terminated at any moment.

Dmitry Sorokin then spoke about Bubuku, our open source supervisor for running Kafka on AWS. Bubuku is a very interesting supervisor, with a lot of features. Not only can it control individual brokers, but it can also trigger rolling restarts of an entire cluster, calculate a fair distribution of partitions among brokers and trigger the appropriate rebalance operations, and much more.

After the break, Andrey Dyachkov discussed how we upgrade Kafka brokers without losing the broker’s data, and how the same mechanism can deal with a brokers that gets terminated in the middle of the night. Long story short, it comes back with the same storage, without manual intervention.

The next speaker was Max Schulze, from a team we work with very closely. He works on Zalando’s data lake, and talked about some aspects of how it was built.

Daniel Truemper, from yet another team, talked about how they operate Kafka for communication between microservices.

All the talks were recorded. As far as I know the video is not yet available online, but hopefully that will change soon. Unfortunately, something was wrong with the projector, and the slides were displayed with a very strong shade of alien green. I wish I had taken pictures. You’ll see it on the video. It’s really green.

The meetup turned out to be very successful: the room was full, and the feedback we got during and after the event was very positive (we’ll try to have more, and colder, beer next time). So we decided to have another one. We’ll have a bit less talks, to leave more time for discussions, but we will keep the 10-15 minutes per talk format. Watch out for the annoucement, it should be out around the end of February!

Feb 3rd: At FOSDEM to talk about Nakadi

Back when I was studying in Belgium, I religiously attended FOSDEM – the Free and Open Source Software Developers’ European Meeting, every year, in Brussels. In fact, as a member of the NamurLUG, I was part of the team that recorded the talks at FOSDEM for quite a few years. Initially we recorded with consumer-grade cameras, but we soon upgraded to better quality equipment, and even started streaming the events live, after a couple of years. Since then, another team has taken over, and the quality of the recording has improved quite a lot from our very amateur debuts.

This year, I will be back at FOSDEM, but this time I’ll be on the other side: I will give a Lightning Talk about Nakadi, the Event Broker I work on at Zalando. Nakadi is Open Source Software, and provides a RESTful API on top of Kafka-like queues (we have plans to support Kinesis in the future), as well as a bunch of other features: schema validation with json-schema, schema evolution, per-event type authorization, and more. In this talk I will focus on one of my favourite features: Timelines. What is Timelines? Well, I guess you’ll have to watch my talk to find out (or wait for the blog post explaining it, I am working on one)! If you can’t make it to Brussels for FOSDEM, the talk will be recorded and streamed live.

Some of my colleagues will also speak at FOSDEM, and more will be in attendance. Oleskii Kliukin and Jan Mußler will give a talk called “Blue elephant on-demand: Postgres + Kubernetes” in the Postgres devroom. And Ferit Topcu will talk about “Automating styleguides with DocumentJS” in the “tool the docs” devroom.

See you all in Brussels in February!

Data Natives Panel: Open Source Data Projects in Berlin

Back in November, I was at the Data Natives conference in Berlin, to take part in a panel on ‘Open Source Data projects in Berlin’. The other panelists were Kostas Tzoumas, Co-Founder and CEO of data Artisans, Ines Montani, Founder of Explosion AI and spaCy developer, and Andreas Dewes, Founder at 7scientists GmbH. The panel was moderated by Dr. Kristian Rother, Python Trainer at Academis. I was there, representing Zalando, and in particular my team, that maintains two open source projects, Nakadi and Bubuku.

All of us are involved in open source software development, so it will be a surprise to nobody that we all agreed that open source is a good thing. I guess it’s getting hard to find people that disagree with that statement, these days. Still, our approach to open source software differs. The way we do open source at Zalando, a relatively large tech company, is quite different from the way a two-people consultancy does it.

In our discussion, we had an interesting exchange on a variety of questions, such as the reasons for a company to open source software, licensing, building and managing a community, and many more. The talk was recorded, you can find the video below.

And Now for Something Completely Different

Gone is the static website. Instead, this new blog.


I’m a software engineer, and currently I work on an open source event broker called Nakadi. I’m interested in data engineering, open source software, self-adaptive systems, and authorization.

On this blog I will talk about what I am currently working on, post (technical) book reviews, and updates about talks I attend or even give. And jokes. I have a few posts in drafts to get started with. Then, we’ll see how it goes.