Data Engineering Meetup, 4th Edition: 25 September 2018

The fourth edition of our data engineering meetup is in exactly two weeks’ time, and we have just put up the event program. If you’re in Berlin on the 25th of September, and interested in data engineering, register quickly on the meetup page. Last time, the event was full in a few hours only!

For this edition, you will get to hear about BI, serverless data ingestion, streaming platforms, notebooks, and more, with speakers from Ifesca, Valohai, and Zalando.

The talks will be recorded, and we will make them available online shortly after the meetup. You can check out the videos from the last meetup on our Youtube channel.

The meetup is an event organised by engineers, for engineers. We don’t do sales pitches, but we talk about tales from the trenches, the not-always-pretty reality of data engineering. Sometimes we rant. Sometimes we celebrate. We keep talks short, and leave plenty of time for questions and informal discussions. So, if you are interested in data engineering, don’t hesitate, join us!

We aim to organise this meetup quarterly. Would you like to talk at the next meetup? Get in touch, and show us what you’d like to talk about!

PS: I didn’t tell you this, but if the event is full, and you want to join, just show up at the door. Chances are, we’ll find a way to squeeze you in!

Last Month in Nakadi: July 2018

This is the sixth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[New] Log-compacted topics

Pull Request 1
Pull Request 2 
Pull Request 3

Nakadi now supports log-compacted topics. A feature long available in Kafka can now be used from Nakadi. In a nutshell, a log-compacted topic is a topic where events are published as key-value pairs. At some interval, Kafka compacts the topic, which means that, for each key, it only keeps the latest published value; messages with the same key, but published earlier, are discarded. The Kafka documentation has more details about this feature.

How to use it in Nakadi? Simply set the cleanup policy to ‘compact’ when creating the event type. It is not possible to convert a “classic” event type to a log-compacted one, because the events in the “classic” event type do not have a key.

Log compacted topics currently do not support timelines, and it is possible to specify a different Kafka storage for them. There is also a feature flag, to turn on/off the creation of log-compacted event types.

[Updated] Kafka client 1.1.1

Pull Request

We have updated the Kafka client to the latest version, 1.1.1. A few days later, Kafka 2.0.0 was released, so this is no longer the latest version, but at least it is a recent one. We also updated the version of Kafka used in the docker-compose file, when running Nakadi for development and acceptance tests.

[Fixed] Audience field

Pull Request

The audience field was accepting values with_underscores, while the requirements from the architects was to accept fields with-hyphens instead. This is to be consistent across software developed by Zalando.

[Removed] Remove legacy feature toggles

Pull Request

This is the first PR opened by our new team mate, Suyash (welcome!). He did a bit of cleaning up on the code base, and removed unused feature toggles. Now the code looks a little bit nicer, and easier to read!

And that’s it for July. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Call for Submissions: Berlin Data Engineering Meetup, 25 September 2018

public speaking

Photo by Kane Reinholdtsen on Unsplash

The Berlin Data Engineering Meetup is a quarterly meetup organised by a few crazy people from Zalando’s data services department. The meetup is a venue for engineers to present their ideas, exchange best practices, and candidly talk about failures, accidents, and other catastrophes. What brings people together is their interest for all things data engineering: streaming platforms, machine learning, databases, storage formats, stream processing, etc.

For the next edition of the meetup, on the 25th of September 2018, we invite speakers of all levels of expertise to submit talks on any aspect of data engineering. Talks should be 20 minutes long, with a few extra minutes for questions. Topics of interest include, but are not limited to:

  • Stream processing
  • Data lakes
  • CQRS
  • Machine learning
  • Databases and data stores
  • Data formats
  • Data quality
  • BI
  • Infrastructure
  • Microservices
  • Access control
  • Data visualisation
  • When things go wrong

We favour talks where the speakers share their experiences, good and bad, talk about what they learned, and all the things you typically don’t find in the documentation.

How to submit

Send an email to [email protected] with the following information:

  • Name
  • Affiliation
  • Title of talk
  • Short abstract (up to 150 words)
  • Short bio (up to 100 words)

Important dates

  • 10 August 2018: submission deadline
  • 17 August 2018: speakers notification
  • 24 August 2018: schedule publication
  • 25 September 2018: meetup!

Location

The Hub @ Zalando
Tamara-Danz-Str. 1
10243 Berlin

Last Month in Nakadi: June 2018

This is the fifth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[Changed] Reduced logging

Pull Request

Nakadi logs a lot of stuff. It’s very useful, but also comes with a cost. Recently, we were looking at our logs, and noticed that our SLO logging amounts for a large percentage of our logs. So we combined it with our access log, which significantly reduced the number of lines logged.

[Changed] Stricter JSON parser

Pull Request 1

Pull Request 2

While working on an internal service that consumes data from Nakadi, we noticed that the JSON parser in Nakadi is a bit too lax, and allows event producers to publish incorrect JSON. The issue is that our new service’s parser couldn’t parse that. In this change, we implemented a new parser, which is both more strict and more efficient. So, Nakadi performs a little bit better, and consumers have stronger guarantees regarding the events they consume.

In order to avoid breaking existing producers, we released this feature in two parts: first, we would log events that would be accepted by the old parser but not by the new one. We ran this version for a few days and got in touch with affected producers. Once we were satisfied that all producers would not be affected, we release the second pull request, which only uses the new, stricter parser.

[New] Feature flag to allow the deletion of event types with subscriptions

Pull Request

So far, deleting an event type has only been possible if there were no subscriptions for this event type. The reason behind this is to make sure that no consumers are taken by surprise when an event type is deleted. In our staging deployment, we found that it can sometimes cause issues and delays, especially when consuming applications are configured to automatically re-create subscriptions when they are deleted.

Now, there is a new feature flag, XXX, that allows users to delete event types that have subscriptions attached to them. Note that this will also delete the subscriptions, so we do not recommend turning this feature on for production systems – accidents happen!

[Removed] No more feature flag for subscriptions API

Pull Request

A few months ago, we announced that the low-level consumption API was now considered deprecated, and that clients should not use it anymore. We have not yet set a deadline for its removal from the code, but it will come eventually. To consume events from Nakadi, clients should now use the subscription API, also known as HiLA (High-Level API). This API could be turned on and off with a feature toggle, but since it is now the only supported one, the toggle doesn’t make much sense anymore. So we removed it, in order to make the code a little easier to read and maintain.

[New] New attributes

Pull Request 1

Pull Request 2

These two pull requests add optional attributes to an event type, for consistency with Zalando’s REST API guidelines.

The first one is ordering_key_fields, which can be used when the order of events across multiple partitions is important. Nakadi only guarantees ordering within partitions, so this can be used by consumers to re-order events once they have been consumed.

The second one is audience, which determines the target of an event type. The target is a class of consumers, such as component_internal, external_partner, or others.

Both attributes are optional, and Nakadi does not perform any kind of enforcement as a result of these being set – they are purely informational.

[Fix] Latest_available_offsets

Pull Request

There was a bug in the subscription stats endpoint. In some rare cases, it could look like there were unconsumed events in a subscription, but that was not actually the case. No events would be lost, but monitoring systems would report unconsumed events – which couldn’t be consumed, since they didn’t actually exist.

And that’s it for June. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Open Sourcing Nakadi-UI

Almost 2 years ago, my colleague Sergii Kamenskyi started working on a web UI for Nakadi. So far it has been used internally at Zalando, providing our users with an easy way to find out about the data that flows through Nakadi. Last Friday, after getting approval from our open source team, Sergii released nakadi-ui with an open source license, and anyone who deploys Nakadi can now deploy the web UI as well.

Nakadi-ui is written in Elm, a functional language for web development, which I learned by reviewing some of Sergii’s pull requests. As far as I know, nakadi-ui is one of the largest open source codebase in Elm so far. And pretty much all of it was written by Sergii alone!

Nakadi-ui allows users to create and browse event types and subscriptions. They can see the details of an event type, such as the schema, retention period, authorization policy, and more. They can also get a list of producers and consumers, and even inspect the events in the event type. They can make changes to the event types they are allowed to edit, and delete them if necessary. For subscriptions, users can get the details of the subscriptions, as well as statistics about the number of unconsumed events, or the lag for each position. As we add new features to Nakadi, Sergii keeps improving nakadi-ui, so you can expect exciting new things coming soon!

This slideshow requires JavaScript.

Users of Nakadi tell us that the UI makes it much easier to use Nakadi and monitor their event types and subscriptions. Engineers use it to debug issues and recover from incidents; users who do not have a strong technical background also use it, to inspect event types and find out who is consuming which data; and operators of Nakadi, such as myself, use it to help troubleshoot users’ issues, test new features, or keep tabs on the system health.

You can get the code on GitHub. Bug reports, feature requests, and of course, pull requests, are very welcome to help us make nakadi-ui event better. As per the rules at Zalando, nakadi-ui is currently in the Zalando incubator. We will spend time, together with Zalando’s open source team, to build a community around it. We hope that nakadi-ui will soon graduate to a “proper” open source project!

Last Month in Nakadi: May 2018

This is the fourth installment in the series of blog posts, “Last Month in Nakadi”, where I try to give some more details on new features and bug fixes that were released over the previous month.

[New] Admins can set unlimited retention time

Pull Request

Every user can set and change the retention time of their event types very easily, up to the maximum retention time set by the Nakadi administrators. From now on, Nakadi administrators can bypass this limitation, and set an arbitrary retention time for event types.

This is useful when a user suffers an incident, that will require them to re-consume data after they fix their software. Sometimes, releasing a fix can take a while, and users can ask administrators to increase the retention time temporarily, to avoid data loss.

[Fix] Improve performance of listings subscriptions with status

Pull Request

Last month, I talked about this new flag in the subscriptions endpoint, that gives the status of each subscription. We have improved the performance of this endpoint by quite a lot, and now the response always comes immediately.

[New] Extend subscription statistics with time-lag information

Pull Request

The subscription stats endpoint (/subscriptions/{subscription_id}/stats) can return a new, optional field: the time lag. The time lag is the time in seconds between when the first unconsumed event in a partition was produced to Nakadi, and the time the request to the stats endpoint is made. To get the time lag, just set `show_time_lag` to `true` in your stats request.

This is really useful for monitoring subscriptions: you can tell how far back you are currently. If the time lag increases for some time, you are probably not catching up with the rate at which events are produced.

[Fix] Provide gzip compression for POST responses if requested

Pull Request

‘GET-with-body’ (so, POST) queries weren’t getting gzip responses if they requested it. This is now fixed.

 

And that’s it for May. If you would like to contribute to Nakadi, please feel free to browse the issues on github, especially those marked with the “help wanted” tag. If you would like to implement a large new feature, please open an issue first to discuss it, so we can all agree on what it should look like. We very much welcome all sorts of contributions: not just code, but also documentation, help with the website, etc.

Data Engineering Meetup, 3rd Edition: Data Engineering for AI, 26 June 2018

The third edition of our data engineering meetup is in just about a month, and we have just put up the event program. If you’re in Berlin on the 26th of June, and interested in data engineering, register quickly on the meetup page. Last time, the event was full in a few hours only!

The theme for this edition is ‘data engineering for AI’, and I really look forward to listening to what our speakers have to say on the subject.

For this third edition, we are getting bolder: we invited speakers from outside of Zalando to talk alongside our colleagues, and we hope that the audience will appreciate the variety of views and approaches that the speakers will take.

If you join us, you will get to hear:

  • Our very own VP of engineering, Eric Bowman, will give the keynote talk
  • Kai Wehner, from Confluent, will talk about
  • Fabian Hüske, from data Artisans, will present SQL using Flink
  • Georg Hildebrand, from Zalando, will discuss asset management for machine learning
  • Sebastian Bolz and Maik Goetze, from Scout24, will tell us about how they predict vehicle and property prices

The meetup is an event organised by engineers, for engineers. We don’t do sales pitches, but we talk about tales from the trenches, the not-always-pretty reality of data engineering. Sometimes we rant. Sometimes we celebrate. We keep talks short, and leave plenty of time for questions and informal discussions. So, if you are interested in data engineering, don’t hesitate, join us!

We aim to organise this meetup quarterly. We don’t have the exact date for the next edition yet, but expect it to be towards the end of September. Would you like to talk at the next meetup? Get in touch, and show us what you’d like to talk about!

PS: I didn’t tell you this, but if the event is full, and you want to join, just show up at the door. Chances are, we’ll find a way to squeeze you in!