Performant Streaming in Production: Preventing Common Pitfalls when Productionizing Streaming Jobs

Running a stream in a development environment is relatively easy. However, some topics can cause serious issues in production when they are not addressed properly. In this presentation we want to cover 4 topics that, when not addressed, can lead to serious issues for streams in production. The first topic considers what happens if input parameters of your stream are not properly configured. This can result in your stream having to suddenly process much more data than anticipated, causing considerable performance degradation.

The second topic will be about stateful streaming parameters and the consequences of not tuning these parameters correctly. This can lead to infinite state accumulation, and can be another source of degraded performance, as well as memory issues. In the third topic we discuss Structure Streaming output parameters. When not addressed, this can lead to a severe case of the small files problem. In the final topic, we will cover what to think about when you want to modify your streaming job while it is already in production and checkpoints are involved. We will provide practical hands-on examples on when aforementioned issues manifest and how to prevent them from occurring in your production streams. By the end of the talk you will know what to look out for when designing performant and fault-tolerant streams.

Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here:

Connect with us: