Time series in quasardb
At first an interrogation – At the beginning, there was a question. Why are we waiting for machines to give us an answer? Why aren’t we able to analyze everything in real time?
What kind of technology would be required to make it possible?
Imagine that you could analyze and process all the data in the world, in an instant. We would take better large scale decisions. Scientific studies could happen on all the data, not on samples. And we would have the answer to one of the deepest mysteries of the world: what are cats actually doing during the night?
In the last decade, there has been a clear tendency to pile up layers of technology to solve problems, without worrying about efficiency. “The hardware will take care of it!”
We forgot that computers can do billions of computations per second. Users have accepted to wait for hours (if not days!) what should be done in seconds.
The focus on genericity and inter-operability made us lost track of what matters: efficiency, simplicity, speed.
At quasardb we want to bring efficiency back.
We want to make it possible to analyze the world in real time.
How to make it possible to analyze the world in real time?
When thinking about what would be the next step to make it possible to analyze the world in real time, we quickly reached the conclusion that building another generic framework to do distributed computing would lead us to the same dead end than everyone else.
With today raw computing power, below a certain amount of data, any technology, even flat files, deliver almost instant answers.
The problem is when you increase the volume of data, and the complexity of queries.
With great volumes, maintaining the schema takes a lot of human and technical resources, and sometimes making a schema is not feasible.
Hence the success of NoSQL solutions and Big Data platforms such as Hadoop, promising a solution with generic frameworks, such as map/reduce, to solve these problems.
Putting aside the inherent inefficiency of some of the implementations, these frameworks failed to deliver the productivity and efficiency of SQL, a model that had more than 40 years to settle in the mind of software engineers.
We thought that making yet-another-nosql database or a NewSQL database would not help us accomplish our mission.
How do you eat a whale? One bite at a time.
When speaking with our customers – mainly in finance and transportation – we noticed that modeling their problems as time series would enable them to find a solution quickly and efficiently.
We found our next bite: time series.
Time series: unsolved problems
The simplest time series database is the csv file. And it is a perfectly valid option for certain use cases. A relational or a document oriented database can also “do the job” when the complexity or the frequency of the requests is low.
But…
If you’re an analyst working on your model, having to wait several hours (if not more) will severely negate your productivity.
If you are building a platform for predictive maintenance on airliners, you both need speed and mass.
If you are back testing your financial model, the faster you access the data, the more time you can spend on risk computations and model validations.
A time series database?
Databases optimized for time series are nothing new, they are generally column oriented database with a time-series friendly API.
There are solutions that enable real-time processing on time series and data, if it fits in memory.
There are solutions optimized for time series in the context of monitoring using sliding windows.
There are solutions that enable distributed processing on large amounts of data, provided you don’t need interactivity.
But there is no solution for real-time analytics on large time series. No solution that combines mass and speed.
Quasardb time series design goals
Time series are meant to be used in demanding environments where in one second, millions of events must be recorded and indexed with very high precision, and kept for several years.
At quasardb, when we decided to implement time series, we did it with the following goals:
- Limitless: It must be possible to record thousands of events every microsecond, for one hundred year, without ever removing a single record.
- Reliable: Writes must be durable and reads consistent. Period.
- Flexible: Although quasardb has server-side aggregations and computations, the user may manipulate the data with her own tools such as Python or R. Extracting a subset of a time series must be simple, fast, and efficient. When inserting data, the user must not have to define a complex schema, and can change her mind afterwards.
- Interactive: Transfers, computations and aggregations must be so fast that analytics can access quasardb directly, regardless of the amount of data stored, to enable the analyst to work interactively.
- Transparent: When a user wants to average the value of a column, it should not be her concern whether the time series resides on a single node or is distributed over a big cluster, and if 10,000 users are doing the same thing at the same time. The database must solve all the distribution, concurrence and reduction problems and present a naive interface to the user.
Time series are available starting with quasardb 2.1.0.
Some key facts about quasardb
Scalability
Using our own variant of the Chord algorithm, we managed to build a database that scales horizontally. Time series within quasardb are transparently distributed over the nodes of the cluster.
Last but not least, quasardb supports concurrent reads and writes through fine grained locks.
Performance
Thanks to our low-level C++ 14 implementation, a single mid-range server can deliver south of one million requests per second. Aggregating values on columns is north of three billion rows per second!
Flexibility
Add time series as you like, tag them for reverse lookup. Aggregate at will. No need to prepare in advance, the database will adapt!
Blobs in columns accepted.
See for yourself!
Quasardb 2.1.0 is already available for beta testing. Tell us if we are on the right track!