Tuesday October 11, 2022
“Just insert the data!” has a very different meaning depending on your field, and it is sometimes frustrating to get the proper behavior from your DataBase Management System (DBMS).
For example, in investment banking, you may want to keep all the versions and have (partial) duplicates, whereas, in manufacturing, duplicates may be undesired.
Some DBMS have only one possible behavior, some are more customizable.
This article gives an objective, technical description of how data insertion behaves, giving you the power to pick and choose the one you need.
Quasar supports a wide range of data insertion modes. We’ll use that as a template to study the differences. Whether you’re a Quasar user, we hope you find this helpful.
When inserting large volumes of data, INSERT statements are usually not the way to go.
For instance, high-performance insertion in Quasar must be made with the batch inserter, a low-overhead API dedicated to fast insertion, and that supports all the settings we will describe further.
Study your system’s documentation to ensure you use the most efficient API.
The first question is, how long does it take for my data to be visible, and what are my guarantees regarding integrity?
In Quasar, we have three modes:
We made non-transactional and immediate the default as most insertion patterns don’t benefit from the extra cost of transactions.
Asynchronous is excellent when you insert small chunks of data and want Quasar to consolidate multiple minor updates into a larger one. Delayed writes improve I/O efficiency at the cost of delayed visibility.
Study your system’s documentation to understand the guarantees offered and the visibility delay.
In this mode, the data is inserted “as is” and unprocessed. This can create duplicates, as existing rows are never deleted or modified.
Additive is the default mode of Quasar, as it matches the bulk loading behavior in most cases. For example, when you insert a CSV into a table or stream data from a source.
Duplicates in timeseries are annoying but a fact of life. They happen because of human error (e.g. inserting the same file twice) or because of replays in your streaming chain (which are very hard to eliminate).
Duplicates are best handled at the database level.
You could run a DELETE query to remove them (Quasar has an optimized DELETE for that), or you could deduplicate them at query time using our query dedupe functionality.
However, if you’re willing to pay a bit of CPU time at insertion, you can deduplicate at insertion.
Here is how you can configure deduplication:
Truncated insertion is the “overwrite all existing data; this is the new data.” Truncated insertion deletes all the existing data in the provided range and replaces it with the provided data.
Truncation is a powerful way to safely and efficiently replace chunks of data that you think may be incorrect.
Looking at all the modes above, you may have noticed that one configuration is missing.
What if you want matching rows to be updated and new data inserted as is?
That’s upsert semantics. Upsert will update existing data with the new provided data, based on customizable matching parameters, and insert new data “as is”.
This quick review of insertion modes should give you a better appreciation for the specifics of your business and perhaps help you leverage your system’s features better.
Do you have an insertion need that has not been covered above? We would love to hear about it!
And why not give Quasar a try? The community edition is free of charge.