Last Month in Nakadi: May 2018

This is the fourth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[New] Admins can set unlimited retention time

Pull Request

Every user can set and change the retention time of their event types very easily, up to the maximum retention time set by the Nakadi administrators. From now on, Nakadi administrators can bypass this limitation, and set an arbitrary retention time for event types.

This is useful when a user suffers an incident, that will require them to re-consume data after they fix their software. Sometimes, releasing a fix can take a while, and users can ask administrators to increase the retention time temporarily, to avoid data loss.

[Fix] Improve performance of listings subscriptions with status

Pull Request

Last month, I talked about this new flag in the subscriptions endpoint, that gives the status of each subscription. We have improved the performance of this endpoint by quite a lot, and now the response always comes immediately.

[New] Extend subscription statistics with time-lag information

Pull Request

The subscription stats endpoint (/subscriptions/{subscription_id}/stats) can return a new, optional field: the time lag. The time lag is the time in seconds between when the first unconsumed event in a partition was produced to Nakadi, and the time the request to the stats endpoint is made. To get the time lag, just set `show_time_lag` to `true` in your stats request.

This is really useful for monitoring subscriptions: you can tell how far back you are currently. If the time lag increases for some time, you are probably not catching up with the rate at which events are produced.

[Fix] Provide gzip compression for POST responses if requested

Pull Request

‘GET-with-body’ (so, POST) queries weren’t getting gzip responses if they requested it. This is now fixed.

 

And that’s it for May. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: April 2018

This is the third instalment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

New URL

Nakadi now has its own domain name! You can check out https://nakadi.io

[Fixes] Don’t log a complete stack trace when a resource does not exist

Pull Request 1
Pull Request 2

These two fixes are very similar. We found that, when a user tries to perform an operation on a resource that does not exist, Nakadi logs a complete stack trace. This is unnecessary, and can be an issue if Nakadi processed a lot of requests for resources that don’t exist: disks may fill up. The first fix is for non-existing subscriptions, and the second one, for non-existing event types.

Now, we just log one line to describe the error.

[Fix] Fix event parsing during production

Pull Request

So far, Nakadi has been parsing events for validation, then storing them to a stream of bytes for Kafka. The issue was that numbers might change between the user-provided format and the one exported by the json library (e.g., 0.0 would become 0). This fix ensures that Nakadi does not modify the users’ events, except for enrichment of course.

[New] Subscriptions status

Pull Request

It is now possible to get the status of each partition in a subscription, from the subscriptions endpoint. This is useful for monitoring, so users know which partitions are assigned to which consumers, if any. With the introduction in March of the ability to consume from specific partitions, this is even more valuable.

This is not really a new feature, as it was already available in the subscription’s /stats endpoint. However, this one is much faster, as it does not try to compute the number of unconsumed events – an expensive operation.

To use it, just set the show_status flag to true in your request to the subscriptions endpoint.

[Fix] Multiple bug fixes

Pull Request 1 
Pull Request 2
Pull Request 3

Finally, we released 3 bug fixes together with the subscription status feature: the first one fixes a bug that occurred when committing offsets for a new subscription under specific circumstances.

The second one fixes a bug that could occur when users tried to consume from busy event types with large values for batch_limit and max_uncommitted_events.

The third one is for consumers who reach the stream_timeout they have set. At that point, Nakadi should flush the events it has accumulated so far, before closing the stream.

 

And that’s it for April. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: March 2018

This is the second instalment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

March saw an important dependency update, as well as a new feature. The former is thanks to our colleague Peter Liske, who has been working on the issue for quite some time.

JSON-schema validation library now uses RE2/J for regex pattern matching

Peter alerted us about the problem, and fixed it upstream. It turns out that a well-crafted regular expression in a schema could become a regex bomb, when used to evaluate a simple string. Peter demonstrated how easy it would be to “kill”one instance of Nakadi with a single message for several minutes – and kill a whole cluster by sending a sufficient number of messages.

The issue is with the default (PCRE) regex matching library used in Java. Peter swapped it for the RE2/J library in the dependency we use for json-schema validation, and now Nakadi can survive evaluating even the nastiest of regular expressions.

New feature: select which partitions to read from in a subscription

In February, we made the decision to deprecate the low-level API in Nakadi. It is currently still supported, but will be removed in the future. The subscription API did not cover one common use case that the low-level API provided: the ability for a consumer to choose which partitions to consume events from. For some users, it is important to make sure that all events in a given partition will be consumed by the same consumer. Perhaps they do some form of de-duplication, aggregation, or re-ordering, and such a feature makes their job a lot easier.

When we announced the deprecation of the low-level API, we promised to implement that feature in the subscriptions API, to allow users to migrate without issues. This is now done, and users can check out the relevant part of the documentation.

Here is a simple example of how it works. The “usual” way to consume from the subscription API is by creating a stream to get events from a subscription. Nakadi will automatically balance partitions between the consumers connected to the subscription, so that each partition is always connected to exactly one consumer. Given a subscription with ID 1234, it works like this:

GET {nakadi}/subscriptions/1234/events

Pretty simple. Now, if you want to specify which specific partitions you want to consume form, you need to send a “GET with body” (so, a POST) request, and specify the partitions you want to the body. For example, if you want to get partitions 0 and 1 from event type my-event-type, you would do something like this:

POST {nakadi}/subscriptions/1234/events -d '{"partitions": [{"event-type": "my-event-type", "partition": "0"},{"event-type": "my-event-type", "partition": "1"}]}'

Simple. And of course, you can have both types of consumers simultaneously consuming from the same connection. In this case, the rebalanced consumers will share the partitions that have not been requested specifically.

And that’s it for March. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Last Month in Nakadi: February 2018

I’m experimenting with a new series of posts, called “Last Month in Nakadi”. In the Nakadi project, we maintain a changelog, that we update on each release. Each entry in the file is a one-line summary of a change that was implemented, but that alone is not always sufficient to understand what happened. There is still a fair amount of discussion and context that stays hidden inside Zalando, but we are working on changing that too.

Therefore, I will try, once a month, to provide some context on the changes that we released the month before. I hope that users of Nakadi, and people interested in deploying their own Nakadi-based service, will find this summary useful. Let’s start then, with what we released last month, February 2018.

2.5.7

Released on the 15th of February, this version includes one bug fix, and one performance improvement.

Fix: Problem JSON for authorization issues

A user of Nakadi reported that Nakadi does not provide a correct Problem JSON when authorization has failed.

Improvement: subscription rebalance

We found that, when rebalancing a subscription, Nakadi calls Zookeeper several times, which is costly. This improvement reduces the number of calls to Zookeeper when rebalancing subscriptions, improving the speed of rebalances.

2.5.8

Released on the 22nd of February, this version brings a new feature: the ability to allow a set of applications to get read access to all event types, overriding individual event types’ authorization policies, for archival purposes.

At Zalando, we maintain a data lake, where data is stored and made available to authorised users for analysis. One of the preferred ways to get data into the data lake is to push it to our deployment of Nakadi. Events are then consumed by the data lake ingestion applications, and saved there. Over time, we have noticed that event type owners, when setting or updating their event types’ authorisation policies, would on occasion forget to whitelist the data lake applications, causing delays in data ingestion. Another issue we noticed is that, should the data lake team use a different application to ingest data (they actually use several applications, working together), they would have to contact the owners of all event types from which data is ingested – that’s a lot of people, and a huge burden.

So, we decided to allow these applications to bypass the event types’ authorization policies, such that event type owners would not accidentally block the data lake’s read access. In a future release, we could add a way for the event type owner to indicate that they do not want their data ingested into the data lake.

We also added an optional warning header, sent when an event type is created or updated. We use it to remind our users that their data may be archived, even if the archiving application is not whitelisted for their event type. You can choose the message you want – or no message at all.

And that’s it for February. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

;