Last Month in Nakadi: March 2018

This is the second instalment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

March saw an important dependency update, as well as a new feature. The former is thanks to our colleague Peter Liske, who has been working on the issue for quite some time.

JSON-schema validation library now uses RE2/J for regex pattern matching

Peter alerted us about the problem, and fixed it upstream. It turns out that a well-crafted regular expression in a schema could become a regex bomb, when used to evaluate a simple string. Peter demonstrated how easy it would be to “kill”one instance of Nakadi with a single message for several minutes – and kill a whole cluster by sending a sufficient number of messages.

The issue is with the default (PCRE) regex matching library used in Java. Peter swapped it for the RE2/J library in the dependency we use for json-schema validation, and now Nakadi can survive evaluating even the nastiest of regular expressions.

New feature: select which partitions to read from in a subscription

In February, we made the decision to deprecate the low-level API in Nakadi. It is currently still supported, but will be removed in the future. The subscription API did not cover one common use case that the low-level API provided: the ability for a consumer to choose which partitions to consume events from. For some users, it is important to make sure that all events in a given partition will be consumed by the same consumer. Perhaps they do some form of de-duplication, aggregation, or re-ordering, and such a feature makes their job a lot easier.

When we announced the deprecation of the low-level API, we promised to implement that feature in the subscriptions API, to allow users to migrate without issues. This is now done, and users can check out the relevant part of the documentation.

Here is a simple example of how it works. The “usual” way to consume from the subscription API is by creating a stream to get events from a subscription. Nakadi will automatically balance partitions between the consumers connected to the subscription, so that each partition is always connected to exactly one consumer. Given a subscription with ID 1234, it works like this:

GET {nakadi}/subscriptions/1234/events

Pretty simple. Now, if you want to specify which specific partitions you want to consume form, you need to send a “GET with body” (so, a POST) request, and specify the partitions you want to the body. For example, if you want to get partitions 0 and 1 from event type my-event-type, you would do something like this:

POST {nakadi}/subscriptions/1234/events -d '{"partitions": [{"event-type": "my-event-type", "partition": "0"},{"event-type": "my-event-type", "partition": "1"}]}'

Simple. And of course, you can have both types of consumers simultaneously consuming from the same connection. In this case, the rebalanced consumers will share the partitions that have not been requested specifically.

And that’s it for March. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.