The many ways to insert data and why they matter

Blog

2 years ago

Business semantics

“Just insert the data!” has a very different meaning depending on your field, and it is sometimes frustrating to get the proper behavior from your DataBase Management System (DBMS).

For example, in investment banking, you may want to keep all the versions and have (partial) duplicates, whereas, in manufacturing, duplicates may be undesired.

Some DBMS have only one possible behavior, some are more customizable.

This article gives an objective, technical description of how data insertion behaves, giving you the power to pick and choose the one you need.

Quasar supports a wide range of data insertion modes. We’ll use that as a template to study the differences. Whether you’re a Quasar user, we hope you find this helpful.

Use the right API

When inserting large volumes of data, INSERT statements are usually not the way to go.

For instance, high-performance insertion in Quasar must be made with the batch inserter, a low-overhead API dedicated to fast insertion, and that supports all the settings we will describe further.

Study your system’s documentation to ensure you use the most efficient API.

Data visibility and guarantees

The first question is, how long does it take for my data to be visible, and what are my guarantees regarding integrity?

In Quasar, we have three modes:

Transactional and immediate: insertions are treated as all or nothing, and data is visible as soon as the function returns. To be clear: if one write fails, all writes are rolled back.
Non-transactional and immediate (default): data is inserted in a best-effort fashion, successful writes.
Non-transactional and asynchronous (delayed visibility): data is buffered by the server and committed at a later stage.

We made non-transactional and immediate the default as most insertion patterns don’t benefit from the extra cost of transactions.

Asynchronous is excellent when you insert small chunks of data and want Quasar to consolidate multiple minor updates into a larger one. Delayed writes improve I/O efficiency at the cost of delayed visibility.

Study your system’s documentation to understand the guarantees offered and the visibility delay.

Additive

In this mode, the data is inserted “as is” and unprocessed. This can create duplicates, as existing rows are never deleted or modified.

Additive is the default mode of Quasar, as it matches the bulk loading behavior in most cases. For example, when you insert a CSV into a table or stream data from a source.

Summary

Source data isn’t analyzed or compared with the destination.
If identical data exists at the destination, duplicates entries.
Existing data is left untouched.

Deduplicated

Duplicates in timeseries are annoying but a fact of life. They happen because of human error (e.g. inserting the same file twice) or because of replays in your streaming chain (which are very hard to eliminate).

Duplicates are best handled at the database level.

You could run a DELETE query to remove them (Quasar has an optimized DELETE for that), or you could deduplicate them at query time using our query dedupe functionality.

However, if you’re willing to pay a bit of CPU time at insertion, you can deduplicate at insertion.

Here is how you can configure deduplication:

Refuse any row that has a timestamp identical to an existing one. That is, only one row per timestamp is allowed.
Accept multiple rows per timestamp, but refuse complete matches, that is, two rows with all identical columns.
Accept multiple rows per timestamp, but reject partial matches. This is great if you’re interested in keeping multiple versions of the same data (our RESTRICT TO functionality helps to work on a specific version).

Summary

Source data is compared with the destination.
If identical data exists at the destination, source data is not written.
What constitutes “identical data” can be customized. Full or partial matches are allowed.
Existing data is left untouched.

Truncated

Truncated insertion is the “overwrite all existing data; this is the new data.” Truncated insertion deletes all the existing data in the provided range and replaces it with the provided data.

Truncation is a powerful way to safely and efficiently replace chunks of data that you think may be incorrect.

Summary

Source data isn’t analyzed or compared with the destination.
Destination data in the specified range is unconditionally deleted and replaced with the source data.
Existing data is deleted.

Upsert

Looking at all the modes above, you may have noticed that one configuration is missing.

What if you want matching rows to be updated and new data inserted as is?

That’s upsert semantics. Upsert will update existing data with the new provided data, based on customizable matching parameters, and insert new data “as is”.

Summary

Source data is compared with the destination.
If identical data exists at the destination, destination data is replaced with the source data.
What constitutes “identical data” can be customized. Full or partial matches are allowed.
Existing data may be updated with source data.

Conclusion

This quick review of insertion modes should give you a better appreciation for the specifics of your business and perhaps help you leverage your system’s features better.

Do you have an insertion need that has not been covered above? We would love to hear about it!

Menu

By Use Cases

The many ways to insert data and why they matter

Business semantics

Use the right API

Data visibility and guarantees

Additive

Summary

Deduplicated

Summary

Truncated

Summary

Upsert

Summary

Conclusion

Related Tags:

Share this post:

Solution

By Use Cases

Resources

Company