Just when you thought Apache Kafka would dominate the world of streaming data forever, an alternative technology has entered back into the limelight: Apache Flink! The big question: Why?!

The short answer is that the data industry has been shifting away from traditional batch processing toward real-time analytics, and Flink happens to be one of the best tools for handling true streaming data. A longer answer could go back to the earliest days of DM Radio, some 17 years ago, when I’d urge the data industry to “stop the madness” of constantly ETL’ing data around and around the enterprise, day after day after day.

Unlike Apache Spark Streaming, which still relies on micro-batching, Flink was designed from the ground up for low-latency, high-throughput stream processing. As businesses look to get instant insights from their data—whether it’s for fraud detection, AI-powered recommendations, or IoT telemetry—Flink is emerging as the go-to framework.

One of the biggest drivers behind Flink’s resurgence is the rise of AI and machine learning pipelines. Real-time data is becoming essential for training AI models, feeding feature stores, and improving inference accuracy. Flink’s stateful processing capabilities make it useful for these applications, especially when working with complex event-driven architectures. Companies like Netflix, Uber, and Alibaba have been leveraging Flink for years to power real-time personalization, dynamic pricing, and anomaly detection.

Another big factor is the explosive growth of cloud-native architectures. Today, most organizations are moving their data infrastructure to Kubernetes and fully managed cloud services. Flink fits right into this trend, thanks to its scalability, fault tolerance, and native support for Kubernetes deployments. Cloud providers like AWS, Google Cloud, and Alibaba have embraced Flink, integrating it into their streaming data services to make it easier for companies to adopt.

While Spark has dominated the data processing world for a decade or more, Flink has some clear advantages—especially when it comes to event-time processing, stateful computation, and exactly-once guarantees. These features are valuable in industries like financial services, ad tech, and cybersecurity, where processing data in real time isn’t just a luxury—it’s a necessity for compliance and risk management.

Beyond just the technical capabilities, Flink’s open-source ecosystem is booming. The Apache Flink community has been growing, with significant contributions from Alibaba, Confluent, AWS, and more. The introduction of Flink SQL and new connectors for Kafka, Pulsar, and Iceberg has made it even more accessible, allowing developers to build powerful streaming applications without diving deep into Java or Scala.

What we’re seeing now is a perfect storm of trends that make Flink more relevant than ever. Businesses want real-time insights, AI needs continuous data processing, and cloud-native architectures demand scalable, fault-tolerant solutions. Flink checks all those boxes, which is arguably why it’s making a comeback.

Mastering Flink

It’s important to appreciate that Flink isn’t just another data processing framework; it’s a whole different animal, a streaming-first engine that scales. Architecturally, it has layers, which does add complexity, but also provides great latitude for being able to optimize its use.

However, you don’t just wake up one day and master Flink. You level up. The deeper you go, the more you realize there’s always another nuance, another optimization, another trick that separates the rookies from the veterans. Let’s break it down, step by step.

At Level 0, you’re just getting your feet wet. You’ve fired up a Flink job using the CLI or the Flink Dashboard, and congratulations—you’re officially in the game. It might be as simple as a word count job, but that first step matters. The beauty of Flink is that it gives you an easy entry point, but don’t be fooled—this rabbit hole goes deep. The real question is, are you ready to go further?

Level 1 is where you grasp the core execution model. Flink blurs the line between streaming and batch, and if you don’t understand that, you’re already behind. You need to get familiar with its fundamental operators—Map, Filter, KeyBy, Window, Join, Process—because they’re your bread and butter. Then there are time semantics: Event Time vs Processing Time, plus how Watermarks handle late data. And let’s not forget Checkpoints and State—because if you don’t know how Flink guarantees exactly-once semantics, you’re playing with fire.

Now comes the real fun—Level 2, optimizing Flink jobs. State management is everything, and choosing between RocksDB, Heap, or the FileSystem backend makes a real difference. Shuffling data is expensive, so you can minimize unnecessary movement with KeyBy and Partitioning. Timers and Process Functions help fine-tune event processing, while Broadcast State lets you share configurations efficiently. And let’s talk about Async I/O—when external system calls are slowing you down, this is how you crank up the throughput.

Once your jobs are running smoothly, Level 3 is where you tune for performance at scale. Set Parallelism and Task Slots wisely, because poor resource allocation will crush your efficiency. Incremental Checkpoints reduce checkpointing overhead, keeping state snapshots lightweight. Heap vs Off-Heap memory? Get it wrong, and you’ll be drowning in OutOfMemoryErrors. RocksDB tuning becomes essential for handling state storage, and backpressure management prevents slow consumers from dragging down the pipeline.

Finally, Level 4 takes you deep into cluster resource management. Flink on Kubernetes, YARN, or Standalone? Each has tradeoffs, and you need to know them. Memory tuning is critical—JobManager and TaskManager have distinct requirements, and settings like taskmanager.memory.network need careful configuration.

Restart strategies matter—Fixed-delay or Exponential-backoff can be the difference between resilience and failure. And then there’s state migration—Savepoints ensure you can upgrade jobs without losing progress.

Mastering Flink isn’t about learning one trick and calling it a day. It’s about understanding the layers, digging deeper, optimizing harder, and squeezing every ounce of performance from your system. If you’re just getting started, buckle up—because this ride only gets wilder the further you go.

Hats off to Adi Polak of Confluent for inspiring this article! She recently posted the layers of Flink, which I found incredibly useful. And like most data visionaries, she’s excitable, writing on a separate post:

“Step into a world where data doesn’t just flow—it dances with precision! This isn’t the future; it’s our present. As Data Streaming Engineers and enthusiasts, we’re at the forefront of this revolution.”

And the revolution continues! Watch for streaming-first architectures to take hold in 2025. They solve too many business challenges too efficiently to not be seriously explored.