Data warehousing your timeseries data: danger zone
Nowadays, data warehouses seem to be the solution for everything – one-stop shops for all your enterprise data requirements. And if you have some experience with data warehouses, you might ask yourself, why would I need anything else? Data warehouses work great!
However, suppose you apply a data warehouse to a large volume of time-series data, for example, high-frequency waveform data from your factory’s IoT sensors. In that case, you will soon run into several critical issues, such as extremely high data storage costs, slow queries, data lag, hitting a performance wall, or simply being unable to ingest that large amount of data at a constant rate.
And that’s when a time-series specialized database management system (TSDBMS) comes into play.
In this article, we’ll explore why a TSDBMS offers such drastic performance and cost benefits for time-stamped data, and why this should be one of the critical data solutions you need for your company.
Time-series specific needs
Before discussing the difference in the systems, let’s quickly go over the typical problems that arise when working with time-series data.
- Volume – time-series data is intrinsically cumulative, and the inflow is constant, with irregular spikes.
- Latency – You usually need a quick answer on fresh and historical data for monitoring or decision purposes.
- Time-specific queries – Data transformation, joins, and aggregations typical of time-series that are difficult or inefficient on non-specialized systems, things like temporal joins, downsampling, and interpolation.
What does “optimized for time-series” data mean?
Unless you’ve designed a database engine, it might not be immediately obvious why specializing a database management system for certain workloads can offer immense benefits.
Imagine managing a warehouse that must handle various packages all day, with deliveries at regular, scheduled intervals. You never know in advance what kind of package you may need to service, but you know it’s ok not to be able to serve a package right after it arrives.
With those constraints in mind, you might come up with the following organizational process:
- Optimize your process for bulk loads: terminals for the trucks, forklifts, staging areas…
- Organize packages by size and type to make it easy to access a specific item.
- Ensure your storage can handle any package.
- Make sure you can fulfill any request “Give me five of the one-meter-wide square metal sheets that arrived a month ago from Germany.”
That’s a data warehouse: bulk loads of heterogenous data and being able to handle any complex, unexpected queries.
Now, imagine that instead of regular deliveries, you can have deliveries at any time, often in bursts at high volume.
Additionally, you no longer have the luxury of being able to stage the deliveries, as you may be requested to serve a package immediately after it arrives.
You’d think: “Well, that sucks; I guess I will need a bigger staff and warehouse…”
But what if the packages had a standardized form? You could start to create shelves optimized for that size, allowing you to store more in the same space. That means you could load and unload faster.
It would also be easier to catalog the items because they are all identical in shape, and you could opt for a simplified organization to find an item quickly.
This new “warehouse” is your timeseries optimized database. It would probably not do a great job at solving the original problem, but it makes a difference in solving the second, saving time, resources, and money.
This is a tradeoff between versatility and efficiency. Those tradeoffs are all around you: in your kitchen, in your toolbox, in your car…
What do the time-series databases do better?
The immediate follow-up question might then be, “Great! I understand that an optimized time-series platform will be better, but where is it better, and how does it benefit my organization or me?”
We could summarize the benefits in three categories:
Time-series data typically arrive ordered (or almost ordered), and the insertion patterns are usually append-mostly, meaning data will be appended and not inserted potentially anywhere in the database.
The append-mostly nature of time-series data allows for optimized data retrieval, as systems can predict with much greater accuracy where to put the new data. This sequential access pattern enables the storage layer to leverage sequential I/O operations, caching mechanisms, and other optimization techniques to enhance data retrieval speeds.
This can result in several orders of magnitude faster insertion speed.
Some TSDBMS take this a little too far by limiting the engine’s capacities in terms of updates and removal, so make sure when you opt for an engine to check if it has limitations.
Quasar has no limitations regarding updates and deletions and even supports all possible updates and deduplication scenarios.
Since time is likely an index for tables, having time indexes as a first citizen in the system allows for greatly improved performance for time-based lookups.
By default, data can be sorted by time, which optimizes insertions and queries as it allows for dichotomic searches instead of linear searches.
The problem with indexes is that they cost extra storage, but if your data is already sorted, you don’t need to build an additional index, saving both speed and disk space.
In Quasar, for example, we pushed the time index optimization so deep into the engine that we can localize a time-series bucket within a cluster in constant amortized time.
Knowing that data is in a time-series format provides significant advantages regarding data compression. First, time-series data often exhibit high similarity and regularity levels, which can be leveraged to achieve superior compression ratios. Time-series data often contains patterns, trends, and repetitive values, especially in sensor readings or IoT data. By recognizing and exploiting these patterns, compression algorithms can efficiently encode the data using fewer bits, leading to higher compression ratios than general-purpose compression techniques.
Second, the append-only nature of time-series data enables compression techniques that take advantage of temporal locality. As new data points are added sequentially, compression algorithms can leverage temporal compression schemes such as delta or delta-of-delta. These techniques store the differences or changes between consecutive data points instead of storing the absolute values. Since neighboring data points in a time series tend to have small differences, this approach further reduces the data size by representing the changes rather than the entire values, resulting in more efficient storage utilization.
Compression algorithms can adapt and tailor their strategies to exploit the inherent characteristics of time-series data by understanding the underlying time-series nature of the data. This domain-specific compression can achieve higher compression ratios than generic compression techniques that lack awareness of time-series patterns. Consequently, knowing that data is in a time-series format allows applying specialized compression techniques and unlocking enhanced compression capabilities, reducing storage requirements and improving overall efficiency.
Quasar, for example, has an extensive, bespoke compression suite called Delta4C, optimized for time-series built on exabytes of real data.
Efficient compression also has another advantage, and it improves I/O further. If you can compress data faster than you can write it, you save disk space and make writes faster.
Time-series specific queries
Time-series database management systems (TSDBMS) excel at executing a range of time-series-specific queries, leveraging their specialized capabilities and optimized structures—two key types of queries where TSDBMS shine is time-based range queries and complex aggregations.
Time-based range queries are a fundamental requirement in time-series analysis. TSDBMS efficiently handles queries that retrieve data within specific time intervals or periods. With their built-in indexing and optimized data structures, TSDBMS can quickly identify and retrieve relevant data points within the desired time range. This enables users to extract specific subsets of data for analysis, visualize trends over time, or identify anomalies within a specified temporal window. The ability to perform time-based range queries swiftly and accurately is essential for conducting meaningful analysis of time-series data.
In addition to time-based range queries, TSDBMS offers powerful capabilities for complex aggregations. Aggregations over time-series data involve calculating averages, sums, maximum or minimum values, and statistical metrics like standard deviation or percentiles. TSDBMS optimizes these aggregations by leveraging specialized algorithms and pre-computed summaries. These pre-computed summaries, often stored at different levels of granularity, enable rapid aggregation operations without needing to process the entire raw data set. By efficiently executing complex aggregations, TSDBMS empowers users to gain insights into their time-series data’s overall behavior, trends, and statistical characteristics.
In Quasar we even infused advanced, domain-specific, time-series transformations such as Fast Fourier Transform (FFT) or order book processing.
Column-oriented storage is well-suited for time-series data due to its inherent query performance and storage efficiency advantages. Time-series data often involves analyzing specific attributes or columns across many data points.
With columnar storage, each column is stored separately, allowing for highly efficient compression and improved query performance. Since queries typically involve selecting specific attributes or performing aggregations on columns rather than rows, column-oriented databases can selectively load only the necessary columns, reducing I/O operations and minimizing disk read times. This storage layout enables faster data retrieval and query execution, making column-oriented storage an optimal choice for time-series data analysis where fast access to specific attributes or columns is crucial.
If you opt for a TSDBMS, ensure it’s column-oriented both in memory and disk.
Specialized functions and features
Time-series data have typical queries that can be hard to express in vanilla SQL, making developers and DBAs less productive and sometimes severely impacting query performance.
We’re not going to make an exhaustive list of all possible time-series-specific functions, we’re going to pick three that are typical and painful to perform in a traditional relational database: downsampling, temporal joins, and multi-version management.
Downsampling is a crucial technique in time-series data analysis, and time-series database management systems (TSDBMS) excel in efficiently performing this operation. Downsampling involves reducing the granularity of time-series data by aggregating multiple data points into larger time intervals. TSDBMS leverage specialized downsampling algorithms and functions that enable users to summarize and condense vast amounts of time-series data while preserving key patterns and trends. By downsampling, users can reduce the data volume, simplify the analysis, and improve query performance, especially when dealing with high-frequency or long-term time-series data. TSDBMS makes downsampling a seamless process, allowing users to balance data granularity and storage efficiency, ultimately enhancing the usability and performance of time-series data analysis.
Downsampling is very useful when building a dashboard as there might be no benefit of having, let’s say, 20,000 points per second, but you might still want to keep the high-resolution data for analysis.
Quasar, for example, supports downsampling and interpolation to adjust data on the fly for reporting or analytical purposes.
The “ASOF JOIN” operation, short for “As-Of Join,” is a powerful feature provided by some time-series database management systems (TSDBMS).
ASOF JOIN enables users to match and align two time-series datasets based on the closest preceding timestamp. This operation is beneficial when dealing with multiple time-series data streams that may not be perfectly synchronized or have irregular update frequencies. By performing an ASOF JOIN, users can associate or combine data points from different streams by aligning them with the closest preceding timestamp.
This allows for accurate temporal analysis, correlation, and event synchronization across multiple time-series datasets, providing a comprehensive view of the data and enabling deeper insights into the relationships between disparate data sources.
ASOF JOIN is a valuable tool in time-series analysis, facilitating the integration of heterogeneous data streams and enhancing the completeness and accuracy of analytical results.
In Quasar, we turned up the joins to 11 by allowing you to join in multiple directions and interpolate missing data.
Multiple versions of the same data
Let’s take the example of weather prediction. Weather predictions are updated multiple times daily, but you might be interested in keeping all the versions, for instance, if you use weather prediction for your pricing algorithms.
Multiple versions of the same data is a cumbersome problem in a traditional relational database. Still, a timeseries database can have an elegant solution as they are not strictly bound to the SQL standard.
In Quasar, we solved the problem by adding a RESTRICT TO feature that combines storage and query engine optimization to allow you to fetch a specific data version.
TSDBMS can offer massive benefits both in terms of costs and performance.
If you’re pleased with your current data infrastructure, we hope this article helped inform you about the tools you might need in the future.