Time-series data is inherently statistical. Each of the individual records of a time-series is indexed to a specific point in time. The “series” is the behavior that these records, in the aggregate, demonstrate over a period of time.
This is different from recording data in response to a specific event. In a time series, events are recorded at predefined intervals. Another way of putting this is to say that the interval is the event. (This interval is called the time-series “frequency.”) This is comparable to the method of sampling used in digital signal processing.
From a business perspective, collection and analysis of time-series data has the potential to yield insight into different kinds of stable or emergent trends. For example, is the time series stationary, or does it demonstrate deterministic behavior – a rising or falling trend – over time? Other time-series-related behaviors include seasonality (a behavior recurs on a predictable basis) and structural breaks, which denote sudden, usually anomalous departures from expected behavior.
All of this is to say that the difference between time-series data and the cross-sectional data associated with OLTP databases is anything but academic, contends Tim Hall, vice president of products with InfluxData, developer of InfluxDB, an open-source time-series database.
Hall says time-series data can be useful for understanding and correcting problems with business operations and their enabling “digital” services.
Appearing on an episode of DM Radio, a weekly, data management-themed program hosted by analyst Eric Kavanagh, Hall noted that managing application performance used to involve monitoring low-level indicators – such as disk I/O, CPU utilization, and network latency. The problem was that these metrics correlated only obliquely with the quality of business services. They were proximal indicators, at best. This is one of the reasons companies are building different kinds of “instrumentation” into the software that supports these services. The rich data generated by this “observable” software is useful for modeling and managing the delivery of business services.
Hall cited e-commerce giant Wayfair as an example in kind. “If you look at … Wayfair – who instruments their entire e-commerce shopping experience with Influx[DB] – they’re looking at the impact of the end-user shopping experience … [but] the way that they’re trying to measure it is by saying, ‘I want to understand whether there’s a … slowness in terms of checkout or an issue with the credit-card processor,’” he told listeners.
Hall stressed that merely capturing time-series data is insufficient, however. As always, the challenge lies in using this data: “If you can collect all of this fundamental data, and … store it as time-series, great, but then someone’s gonna have to provide their own unique insight on top of it to get to the answer.”
Time has come today
Hall’s remarks get at a concept called observability: the idea that software and the business services that depend on it should comprise an observable system. That even stuff which is not normally measured and monitored – e.g., the user experience – should be formalized and quantified, i.e., made observable. That problems with IT resources should be correlated to their real-world impact on the delivery of business services, to the point that businesses are able not only to diagnose problems with these services, but to understand how these problems impact consumers, too.
Instrumenting for this is extremely complex, however – especially in the context of a modern distributed software architecture. Not only must businesses design observability instrumentation into their software, but they must collect, sequence, and analyze data generated by this instrumentation, too.
These are problems that the time-series database, in particular, is engineered to address.
So, for example, dedicated time-series database systems incorporate functions and models used to sample and aggregate data; to align two or more time-series sources; to join and correlate them – e.g., at different intervals or granularities; and to deal with other pertinent issues. “The challenge that we’ve set out to try to address is how do you deal with distributed time? How do you deal with synchronizing those clocks in those [distributed] data collections?” Hall told DM Radio host Eric Kavanagh.
The business benefits have the potential to be hugely significant. After all, concomitant with the ability to observe phenomena at this level of abstraction is the potential to adjust or improve at this level of abstraction, too. So, for example, selectively prioritizing certain kinds of consumers in the midst of ongoing service provider availability issues, or proactively bringing up new cloud resources – and new instances of cloud applications and services – to meet unexpected demand for particular business services.
Observability is a formal concept in modern software architecture. Like all such concepts, however, it can be implemented more, or less, formally. A formal, doctrinaire approach to microservices architecture emphasizes the strict decomposition of an application and its constitutive capabilities into primitive functions: i.e., functions that are designed to do just one thing. Not all adopters build microservices this way, however. From the perspective of a microservices dogmatist, then, another, more pragmatic adopter’s “microservice” could look a lot like an application monolith.
This is true in the case of observability, too. A strictly observable system is ideal, to be sure, but software that achieves a kind of pragmatic observability – let’s call it improved “visibility” – is useful, too, is it not? What matters, argues Hyoun Park, principal analyst at Amalgam Insights, an AI, ML, and decision-automation strategy firm, is that businesses expect a different kind of visibility into the software they’re building and deploying to support business services. They’re no longer satisfied with the IT-centric lensing of old. Time-series data serves as a new lens through which they can view and understand their businesses.
“It’s less important to figure out the perfect model that describes exactly how something happened and much more important to figure out and prioritize the few key drivers that really mattered most, because, frankly, if you’re building a perfect model that describes exactly what happened, all you’re doing is kind of recreating the past and frankly probably overfitting,” he told Kavanagh.
The softly spoken magic spell
Quite aside from the observability angle, it is an irrefragable fact that businesses are collecting more data from more producers than ever before – and that with respect to much of this information (as with the data generated by connected devices, sensors, and other IoT signalers), the time dimension is of critical importance. “There are some folks that are doing test and measurement sort of instrumentation, so if you can imagine a turbine spinning, those things run at very rapid RPMs, and if they get out of balance, they can rip themselves apart. So, you want … to know very quickly how to shut those things off if they start to wobble,” Hall pointed out, stressing that “those kinds of environments [pose] really interesting problems at scale, [problems of] precision, and problems of distribution.”
As a thought experiment, imagine extrapolating from Hall’s preventative-maintenance use case to other sensor- or IoT-specific use cases. Then, imagine combining this data with the data produced by observability instrumentation, or with the data generated by conceivably any business process or practice area. Collecting data at this scale goes to probably the most ambitious application for time-series data: helping to construct a so-called “digital twin” of the business and its world. The idea of the digital twin presupposes an ability to model the business in ways (and at a scale) never before imagined; at its most extreme, the digital twin of the business would permit decision makers to simulate activities: e.g., to game out potential strategies before putting them into practice.
Lori Witzel, director of product marketing with TIBCO Software, described to Kavanagh how one of her company’s customers, the Mercedes-AMG Petronas Formula 1 racing team, analyzes data generated by a profusion of sensors to model and represent the conditions and constraints of Formula 1 racing. She claims that modeling at this level of realism gives Mercedes-AMG Petronas an edge on the track.
The digital twin is as much a conative as a purposive idea – something businesses would like to do, but that few, if any large businesses are actually capable of doing. (Call me a pessimist, but I expect that simulation at this level will successfully elude essentially all businesses for quite some time to come.)
However, Witzel sees interest in time-series data, in particular, as linked to a trend that is not in any sense aspirational: the commodification of the technologies used to ingest and analyze real-time data. Even if their ambitions stop short of creating digital replicas of their businesses, companies are using right-time data, along with historical data, to accurately model facets of their businesses, from the behavior of turbines in the context of preventive maintenance to the experiences of consumers as they interact with digital services. Time-series data is an important right-time data source.
“You want stuff as close to real-time as you can get it. You need not only sensor data … you need things that will cover you over a thinner slice of recent time to try to sense, predict, and prescribe for the future,” she told Kavanagh.
Sign o’ the times
DM Radio host Kavanagh agreed. In fact, he explicitly connected the explosion of new time-series products and services to mainstream demand for real-time data, as well as to an upsurge in demand for real-time analytics. This is evident in the open source space, where InfluxDB competes against Apache Druid and Prometheus, as well as OpenTSDB. Meanwhile, established open-source databases such as Cassandra, MongoDB, PostgreSQL, Redis, and Riak incorporate time-series capabilities, too.
In SQL-speak, time-series data is INSERT-only: instead of UPDATE-ing an existing record with current data, the database INSERTs a new record, typically a new row. If this sounds like a workload that could also be performed by a relational database, it can. In fact, most commercial relational databases – DB2, Oracle, Snowflake, SQL Server, and Teradata, among others – now incorporate time-series functionality. One partner even describes Snowflake as “the best time-series database in the world.”
However, the value-add of a dedicated time-series database is that it incorporates pre-built functions, models, etc. designed for time-series data. So, not only is it optimized for storing and managing time-series data, but it is also optimized for ordering and performing operations on this data, as well as for using math to correct errors (e.g., gaps, inconsistencies, statistical irregularities) in the time series.
As a source with a commercial vendor told me, the difference between the RBDMS produced by their company and a dedicated time-series database is “a question of math.” Their RDBMS “is deficient in pre-built time-series math. It has the basic functions, but not more advanced things,” this person said.
 A delicious 21st-century irony is that they must also design software – and instrumentation – to observe the behavior of this instrumentation. This is one role for Apache Prometheus, which has its own time-series store.