Streaming: Easy as E-S-B?
Twenty years ago, system or software architectures that were capable of supporting analytics in real-time against fresh data tended to be prohibitively expensive.[i] Today, an enterprise can select from among any of several open source stream-processing and messaging-middleware technologies to design a data architecture that supports real-time data ingest, integration, and analytics.
But doesn’t an enterprise already have most of what it needs for stream-processing? What about the venerable enterprise service bus (ESB), that backbone technology of service-oriented architecture (SOA) and application integration? The answer to these questions is frustratingly familiar: it depends.
Message-brokering by any other name…
Apache Kafka is the stream-processing superstar, even though it was neither the first – nor, for that matter, is it the latest – of mainstream open source stream-processing technologies. Arguably, Kafka was initially conceived as a highly scalable message-brokering framework; over time, its maintainers[ii] developed components that support import/export connectivity to databases/data stores, file systems, and other data sources (Kakfa Connect) as well as – crucially – event-driven stream processing (Kafka Streams). The combination of these features permits Kafka to ingest data of different types, perform operations on it in real-time, and distribute it to subscribers. Because Kafka is architected as a multi-threaded stream-processing platform, enterprises can deploy multi-node Kafka clusters that are capable of ingesting large volumes of data, processing it concurrently, and distributing it at scale.
The thing is, the venerable ESB also provides many of the same functions that Kafka provides. In addition, ESBs are capable of routing, mediating, filtering, normalizing, etc. messages, as well as of performing other types of (often complex) transformations on them. (The terms “message” and “stream” are abstractions that can describe the same thing: viz., an unbounded, constantly changing data set.) The best ESBs permit an enterprise to define highly granular rules that govern how messages are routed, as well as to orchestrate multiple (concurrent or dependent) transformations.[iii]
These observations invite two fairly obvious questions: first, aren’t enterprises already well-stocked with ESBs, thanks to nearly two decades of experimentation with service-oriented architecture (SOA); and, second, can an ESB function as a conduit and engine for stream-processing, too?
The appropriate answer to both questions is: “Yes, but….” In spite of their similarities, ESBs and stream-processing technologies such as Kafka are not so much designed for different use cases as for wholly different worlds. True, a flow of message traffic is potentially “unbounded” – e.g., an ESB might transmit messages that encapsulate the ever-changing history of an application’s state – but each of these messages is, in effect, an artifact of a world of discrete, partitioned – i.e., atomic – moments.
“Message queues are always dealing in the discrete, but they also work very hard to not lose messages, not to lose data, to guarantee delivery, and to guarantee sequence and ordering in message transmits,” said Mark Madsen, an engineering fellow with Teradata.
Stream-processing, by contrast, correlates with a world that is in a constant state of becoming; a world in which – as pre-Socratic philosopher Heraclitus famously put it – “everything flows.” (The Greek phrase, panta rhei, could literally be translated “everything streams.”) In other words, says Madsen, using an ESB to support stream processing is roughly analogous to using a Rube Goldberg-like assembly line of buckets – as distinct to a high-pressure feed from a hose – to fill a swimming pool.
On the one hand, you don’t get the atomicity, reliable message routing and delivery, etc. you get with an ESB. On the other hand, Madsen explains, you don’t care – or, rather, if you do care, you augment Kafka with complementary technologies (such as Apache Camel) that provide these capabilities.
“Basic Kafka provides none of those guarantees. If you want guarantees, you have to layer other things [for example, open source technologies such as Camel] on top of it,” he notes.
“The point of a stream is that it is constantly flowing. You tap into it as you are able to.”
Stream-processing is a foundational technology
Justin Reock, chief evangelist for open source software and API management with Perforce Software, describes the test he uses to help his company’s customers determine which kind of technology they should use to support stream-processing. “It really comes [down] to whether … they want to be operating on one piece of data at a time or whether they want to operate on a [large] set of data at a time, a potentially unbounded set of data at a time,” said Reock during a recent episode of DM Radio, a weekly data management-oriented program. In the former case, a lightweight technology (such as ActiveMQ) is sufficient; in the latter, Reock says, it is basically Kafka or bust. “If they’re working on an unbounded set of data, that’s a stream pattern right there,” he told DM Radio host Eric Kavanagh.
There is another reason the venerable ESB is not a suitable backbone for streaming traffic: ESB design is predicated upon a centralized, as distinct to a decentralized model. For practical and logistical reasons, decentralization is the new imperative in software architecture. This makes sense: useful data is always already distributed; rather than moving large volumes of data from the edge to the center via constricted WAN pipes, it makes sense to process data in situ – where it lives, originates, etc. “The ESB pattern is also becoming a bottleneck to systems that need to be able to [perform] distributed [data processing] in real-time,” Reock explains. “Because you want these [data processing workloads] to process as close to where the data originates as possible. In order to do that you probably have to distribute that analytic workload across multiple geographic data centers, regions, things like that.”
The upshot, DM Radio host Eric Kavanagh suggested, is that “Kafka can serve as a springboard for all sorts of new things.” By implementing a Kafka-based streaming foundation, an organization can provision data for different types of machine and human consumers – e.g., data scientists, business analysts, machine learning engineers – alike. (Organizations can define Kafka topics that are designed for the needs of specific consumers.) “[This is] also opening future doors at the same time because now as you sort of architect your future systems as you want them to be, you can have [Kafka] as one of your key foundations delivering data wherever it needs to go,” Kavanagh pointed out.
In addition to supporting real- or right-time data access for non-traditional consumers, enterprises can use stream-processing to transform their businesses – for example, by implementing event-driven architectures that choreograph ensembles of automated actions (across applications, services, systems, etc.) in response to triggers. The transition to event-driven architecture, in particular, has the potential to put business on a wholly different – a more active, agile, and responsive – footing.
Reock concurred, describing a Perforce customer that uses Kafka to stream data from relational databases running on IBM’s i midrange computing platform to the cloud. “Our company’s been helping IBM’s i folks for a while just throw Kafka up into the PASE layer of their IBM i [systems] and then easily couple … DB2 work queues … right from [System i] to [relevant] Kafka topics and then out into cloud analytics,” he told Kavanagh. “That’s a big deal. To be able to get that data in real-time out of these [systems] that are turning around and collecting the data from … [sensors on the] factory floor.”
The uncomfortable irony, Reock concluded, is that the days of the venerable ESB may, in fact, be numbered. “I think that the ESB pattern is still valuable for businesses that are highly acquisitive or that are federating a lot of … new systems or dealing with a lot of heterogeneous endpoints,” he allowed.
“If you look at what’s happening with like, say, the API gateway space combined with like the service mesh space – that I think is what you’re seeing replace the traditional SOA or ESB integration layer,” Reock argued. “So API-to-API [connectivity] with choreography, not orchestration…. [so that] service endpoints can let each other know where to go and where to find other resources, rather than pulling from something central. So I would say service mesh with API-to-API is what’s slowly replacing ESB.”
[i] For example, Teradata introduced its first “active data warehousing” (ADW) systems in ~2005. At the time, Teradata marketed ADW as a premium, high-end offering. Even prior to this, TIBCO and the former WebMethods (since acquired by Software AG) marketed premium enterprise application integration (EAI) software and connectors designed to support real-time use cases. Data integration vendors Ab Initio, the former Ascential Software, Informatica, and SAS Institute, among others, also marketed premium real-time-branded products.
[ii] It’s worth noting that many of the most prolific committers to Apache Kafka work for a single commercial vendor, Confluent Inc., that provides Kafka-related services and support. So while Kafka is nominally a “free” open source technology, the reality is that many, perhaps most, large production users tend to partner with Confluent (or with other technology providers) to obtain complementary software, services, and support.