Last Month in Nakadi: June 2018

This is the fifth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[Changed] Reduced logging

Pull Request

Nakadi logs a lot of stuff. It’s very useful, but also comes with a cost. Recently, we were looking at our logs, and noticed that our SLO logging amounts for a large percentage of our logs. So we combined it with our access log, which significantly reduced the number of lines logged.

[Changed] Stricter JSON parser

Pull Request 1

Pull Request 2

While working on an internal service that consumes data from Nakadi, we noticed that the JSON parser in Nakadi is a bit too lax, and allows event producers to publish incorrect JSON. The issue is that our new service’s parser couldn’t parse that. In this change, we implemented a new parser, which is both more strict and more efficient. So, Nakadi performs a little bit better, and consumers have stronger guarantees regarding the events they consume.

In order to avoid breaking existing producers, we released this feature in two parts: first, we would log events that would be accepted by the old parser but not by the new one. We ran this version for a few days and got in touch with affected producers. Once we were satisfied that all producers would not be affected, we release the second pull request, which only uses the new, stricter parser.

[New] Feature flag to allow the deletion of event types with subscriptions

Pull Request

So far, deleting an event type has only been possible if there were no subscriptions for this event type. The reason behind this is to make sure that no consumers are taken by surprise when an event type is deleted. In our staging deployment, we found that it can sometimes cause issues and delays, especially when consuming applications are configured to automatically re-create subscriptions when they are deleted.

Now, there is a new feature flag, XXX, that allows users to delete event types that have subscriptions attached to them. Note that this will also delete the subscriptions, so we do not recommend turning this feature on for production systems – accidents happen!

[Removed] No more feature flag for subscriptions API

Pull Request

A few months ago, we announced that the low-level consumption API was now considered deprecated, and that clients should not use it anymore. We have not yet set a deadline for its removal from the code, but it will come eventually. To consume events from Nakadi, clients should now use the subscription API, also known as HiLA (High-Level API). This API could be turned on and off with a feature toggle, but since it is now the only supported one, the toggle doesn’t make much sense anymore. So we removed it, in order to make the code a little easier to read and maintain.

[New] New attributes

Pull Request 1

Pull Request 2

These two pull requests add optional attributes to an event type, for consistency with Zalando’s REST API guidelines.

The first one is ordering_key_fields, which can be used when the order of events across multiple partitions is important. Nakadi only guarantees ordering within partitions, so this can be used by consumers to re-order events once they have been consumed.

The second one is audience, which determines the target of an event type. The target is a class of consumers, such as component_internal, external_partner, or others.

Both attributes are optional, and Nakadi does not perform any kind of enforcement as a result of these being set – they are purely informational.

[Fix] Latest_available_offsets

Pull Request

There was a bug in the subscription stats endpoint. In some rare cases, it could look like there were unconsumed events in a subscription, but that was not actually the case. No events would be lost, but monitoring systems would report unconsumed events – which couldn’t be consumed, since they didn’t actually exist.

And that’s it for June. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Open Sourcing Nakadi-UI

Almost 2 years ago, my colleague Sergii Kamenskyi started working on a web UI for Nakadi. So far it has been used internally at Zalando, providing our users with an easy way to find out about the data that flows through Nakadi. Last Friday, after getting approval from our open source team, Sergii released nakadi-ui with an open source license, and anyone who deploys Nakadi can now deploy the web UI as well.

Nakadi-ui is written in Elm, a functional language for web development, which I learned by reviewing some of Sergii’s pull requests. As far as I know, nakadi-ui is one of the largest open source codebase in Elm so far. And pretty much all of it was written by Sergii alone!

Nakadi-ui allows users to create and browse event types and subscriptions. They can see the details of an event type, such as the schema, retention period, authorization policy, and more. They can also get a list of producers and consumers, and even inspect the events in the event type. They can make changes to the event types they are allowed to edit, and delete them if necessary. For subscriptions, users can get the details of the subscriptions, as well as statistics about the number of unconsumed events, or the lag for each position. As we add new features to Nakadi, Sergii keeps improving nakadi-ui, so you can expect exciting new things coming soon!

This slideshow requires JavaScript.

Users of Nakadi tell us that the UI makes it much easier to use Nakadi and monitor their event types and subscriptions. Engineers use it to debug issues and recover from incidents; users who do not have a strong technical background also use it, to inspect event types and find out who is consuming which data; and operators of Nakadi, such as myself, use it to help troubleshoot users’ issues, test new features, or keep tabs on the system health.

You can get the code on GitHub. Bug reports, feature requests, and of course, pull requests, are very welcome to help us make nakadi-ui event better. As per the rules at Zalando, nakadi-ui is currently in the Zalando incubator. We will spend time, together with Zalando’s open source team, to build a community around it. We hope that nakadi-ui will soon graduate to a “proper” open source project!

Last Month in Nakadi: May 2018

This is the fourth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[New] Admins can set unlimited retention time

Pull Request

Every user can set and change the retention time of their event types very easily, up to the maximum retention time set by the Nakadi administrators. From now on, Nakadi administrators can bypass this limitation, and set an arbitrary retention time for event types.

This is useful when a user suffers an incident, that will require them to re-consume data after they fix their software. Sometimes, releasing a fix can take a while, and users can ask administrators to increase the retention time temporarily, to avoid data loss.

[Fix] Improve performance of listings subscriptions with status

Pull Request

Last month, I talked about this new flag in the subscriptions endpoint, that gives the status of each subscription. We have improved the performance of this endpoint by quite a lot, and now the response always comes immediately.

[New] Extend subscription statistics with time-lag information

Pull Request

The subscription stats endpoint (/subscriptions/{subscription_id}/stats) can return a new, optional field: the time lag. The time lag is the time in seconds between when the first unconsumed event in a partition was produced to Nakadi, and the time the request to the stats endpoint is made. To get the time lag, just set `show_time_lag` to `true` in your stats request.

This is really useful for monitoring subscriptions: you can tell how far back you are currently. If the time lag increases for some time, you are probably not catching up with the rate at which events are produced.

[Fix] Provide gzip compression for POST responses if requested

Pull Request

‘GET-with-body’ (so, POST) queries weren’t getting gzip responses if they requested it. This is now fixed.

 

And that’s it for May. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: April 2018

This is the third instalment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

New URL

Nakadi now has its own domain name! You can check out https://nakadi.io

[Fixes] Don’t log a complete stack trace when a resource does not exist

Pull Request 1
Pull Request 2

These two fixes are very similar. We found that, when a user tries to perform an operation on a resource that does not exist, Nakadi logs a complete stack trace. This is unnecessary, and can be an issue if Nakadi processed a lot of requests for resources that don’t exist: disks may fill up. The first fix is for non-existing subscriptions, and the second one, for non-existing event types.

Now, we just log one line to describe the error.

[Fix] Fix event parsing during production

Pull Request

So far, Nakadi has been parsing events for validation, then storing them to a stream of bytes for Kafka. The issue was that numbers might change between the user-provided format and the one exported by the json library (e.g., 0.0 would become 0). This fix ensures that Nakadi does not modify the users’ events, except for enrichment of course.

[New] Subscriptions status

Pull Request

It is now possible to get the status of each partition in a subscription, from the subscriptions endpoint. This is useful for monitoring, so users know which partitions are assigned to which consumers, if any. With the introduction in March of the ability to consume from specific partitions, this is even more valuable.

This is not really a new feature, as it was already available in the subscription’s /stats endpoint. However, this one is much faster, as it does not try to compute the number of unconsumed events – an expensive operation.

To use it, just set the show_status flag to true in your request to the subscriptions endpoint.

[Fix] Multiple bug fixes

Pull Request 1 
Pull Request 2
Pull Request 3

Finally, we released 3 bug fixes together with the subscription status feature: the first one fixes a bug that occurred when committing offsets for a new subscription under specific circumstances.

The second one fixes a bug that could occur when users tried to consume from busy event types with large values for batch_limit and max_uncommitted_events.

The third one is for consumers who reach the stream_timeout they have set. At that point, Nakadi should flush the events it has accumulated so far, before closing the stream.

 

And that’s it for April. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: March 2018

This is the second instalment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

March saw an important dependency update, as well as a new feature. The former is thanks to our colleague Peter Liske, who has been working on the issue for quite some time.

JSON-schema validation library now uses RE2/J for regex pattern matching

Peter alerted us about the problem, and fixed it upstream. It turns out that a well-crafted regular expression in a schema could become a regex bomb, when used to evaluate a simple string. Peter demonstrated how easy it would be to “kill”one instance of Nakadi with a single message for several minutes – and kill a whole cluster by sending a sufficient number of messages.

The issue is with the default (PCRE) regex matching library used in Java. Peter swapped it for the RE2/J library in the dependency we use for json-schema validation, and now Nakadi can survive evaluating even the nastiest of regular expressions.

New feature: select which partitions to read from in a subscription

In February, we made the decision to deprecate the low-level API in Nakadi. It is currently still supported, but will be removed in the future. The subscription API did not cover one common use case that the low-level API provided: the ability for a consumer to choose which partitions to consume events from. For some users, it is important to make sure that all events in a given partition will be consumed by the same consumer. Perhaps they do some form of de-duplication, aggregation, or re-ordering, and such a feature makes their job a lot easier.

When we announced the deprecation of the low-level API, we promised to implement that feature in the subscriptions API, to allow users to migrate without issues. This is now done, and users can check out the relevant part of the documentation.

Here is a simple example of how it works. The “usual” way to consume from the subscription API is by creating a stream to get events from a subscription. Nakadi will automatically balance partitions between the consumers connected to the subscription, so that each partition is always connected to exactly one consumer. Given a subscription with ID 1234, it works like this:

GET {nakadi}/subscriptions/1234/events

Pretty simple. Now, if you want to specify which specific partitions you want to consume form, you need to send a “GET with body” (so, a POST) request, and specify the partitions you want to the body. For example, if you want to get partitions 0 and 1 from event type my-event-type, you would do something like this:

POST {nakadi}/subscriptions/1234/events -d '{"partitions": [{"event-type": "my-event-type", "partition": "0"},{"event-type": "my-event-type", "partition": "1"}]}'

Simple. And of course, you can have both types of consumers simultaneously consuming from the same connection. In this case, the rebalanced consumers will share the partitions that have not been requested specifically.

And that’s it for March. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: February 2018

I’m experimenting with a new series of posts, called “Last Month in Nakadi”. In the Nakadi project, we maintain a changelog, that we update on each release. Each entry in the file is a one-line summary of a change that was implemented, but that alone is not always sufficient to understand what happened. There is still a fair amount of discussion and context that stays hidden inside Zalando, but we are working on changing that too.

Therefore, I will try, once a month, to provide some context on the changes that we released the month before. I hope that users of Nakadi, and people interested in deploying their own Nakadi-based service, will find this summary useful. Let’s start then, with what we released last month, February 2018.

2.5.7

Released on the 15th of February, this version includes one bug fix, and one performance improvement.

Fix: Problem JSON for authorization issues

A user of Nakadi reported that Nakadi does not provide a correct Problem JSON when authorization has failed.

Improvement: subscription rebalance

We found that, when rebalancing a subscription, Nakadi calls Zookeeper several times, which is costly. This improvement reduces the number of calls to Zookeeper when rebalancing subscriptions, improving the speed of rebalances.

2.5.8

Released on the 22nd of February, this version brings a new feature: the ability to allow a set of applications to get read access to all event types, overriding individual event types’ authorization policies, for archival purposes.

At Zalando, we maintain a data lake, where data is stored and made available to authorised users for analysis. One of the preferred ways to get data into the data lake is to push it to our deployment of Nakadi. Events are then consumed by the data lake ingestion applications, and saved there. Over time, we have noticed that event type owners, when setting or updating their event types’ authorisation policies, would on occasion forget to whitelist the data lake applications, causing delays in data ingestion. Another issue we noticed is that, should the data lake team use a different application to ingest data (they actually use several applications, working together), they would have to contact the owners of all event types from which data is ingested – that’s a lot of people, and a huge burden.

So, we decided to allow these applications to bypass the event types’ authorization policies, such that event type owners would not accidentally block the data lake’s read access. In a future release, we could add a way for the event type owner to indicate that they do not want their data ingested into the data lake.

We also added an optional warning header, sent when an event type is created or updated. We use it to remind our users that their data may be archived, even if the archiving application is not whitelisted for their event type. You can choose the message you want – or no message at all.

And that’s it for February. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

;

What Has Our Team Been Up To?

Around the start of this year, I created this blog, to replace my static website. Since then, I have mostly been writing about talks I have given, and I have a few posts in preparation that detail what I am working on (in case you didn’t figure it out yet, it’s called Nakadi.)

Some of my colleagues have already written, on Zalando’s tech blog, about some of the things that we do in our team. Not only do we work on Nakadi, but we also operate it as a service, running on AWS. They wrote about some of the challenges we met, and how we tackled them. We are happy to report that, even when on call, we sleep very well at night: our services are pretty resilient, and out-of-hours calls are the exception, rather than the rule.

Last year, Andrey wrote about his work with Kafka and EBS volumes. We keep a lot of data in Kafka. It used to be that, every time we lost a Kafka broker, or every time we had to restart one, the broker would have to collect once again all its data from other replicas in the cluster. This would take a long time, and while the data was being replicated, we would only have two in-sync replicas for each of the partitions on that broker. Upgrading Kafka would take a week, during which brokers would use some bandwidth – and IO – to replicate data. Andrey solved the issue by making sure that Kafka’s data is stored on a permanent EBS volume, that does not get destroyed when the instance it is attached to goes down. He then worked on upgrade and recovery scripts, such that new brokers will automatically attach previously detached volumes, which greatly reduces the amount of data to synchronise: we only need to copy whatever had been written to Kafka since the broker went down. His work saved us, and continues to save us, considerable amounts of time. It also reduces dramatically the amount of time during which a number of partitions have less than 3 in-syn replicas.

Another post on our work was early this year, by Ricardo. He talked about how he solved one of our biggest pain points in terms of operations: Zookeeper. For the longest time, we were absolutely terrified of a Zookeeper node going down: it would come back, of course, but with a different IP address, and Kafka only takes a fixed list of IP addresses for Zookeeper. Losing a Zookeeper node is not the end of the world, of course, since we run an ensemble of 3. But it did require a rolling restart of Kafka (and a redeployment of Nakadi), which is a time-consuming operation. Losing 2 Zookeeper nodes would have been a catastrophe, but fortunately that hasn’t happened. In his work, Ricardo focused on making sure that Zookeeper nodes always get the same, private, IP address (EIPs were not an option for us). So now, when a Zookeeper node goes down, we know that it will be back a couple of minutes later, with the same address. No more rolling restarts of Kafka!

Last, but not least, Sergii very recently started writing about his previous experience with security while working for an airport in Ukraine. Go read it, it is both instructive and funny. I’m really looking forward to episode 3!

Stay tuned for more news from team Aruha (that’s our name!)