Optimized QuasarDB, part 1: storage

Thursday August 19, 2021

Reaching your performance, durability and efficiency goals in a production setup can be a challenge: there are about a million ways to deploy complex systems, and QuasarDB is no exception. In this series of blog posts, we will be taking you through the various aspects on how to implement a successful QuasarDB deployment in production.

In this post, we will kick off the series with a deep dive on the tradeoffs of choosing your storage layer.

Storage configuration

QuasarDB instances typically need to handle a lot of i/o throughput, and store many terrabytes of data. As storage costs are typically a very important consideration, we need to be critical of the type of storage we use AWS’ instance storage, EBS, provides us with the following storage types:

General Purpose SSD

Provisioned IOPS SSD

Volume type gp3 gp2 Io2
Block Express
io2 io1
Volume size 1 GiB – 16 TiB 4 GiB – 64 TiB 4 GiB – 16 TiB
Max IOPS 16,000 256,000 64,000 †
Max throughput 1,000 MiB/s 250 MiB/s * 4,000 MiB/s 1,000 MiB/s †

(Source:  https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-volume-types.html)

From this information, we can conclude that apart from io2 Block Express, all EBS volume types have a throughput limitation of either 1GB/s or 250MB/sec, and a limitation of 16 TiB per volume.

Since the instances we would like to deploy are typically able to handle much more data than this, per AWS’ own recommendations, we can combine multiple volumes together using Linux software RAID. This will enable us to use General Purpose (gp2) storage, which is an order of magnitude cheaper than the privisioned (io1/io2) storage.

Setting up software RAID

Assuming an instance with 8 EBS volumes attached, the typical instance may look as follows:

[[email protected]:~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme8n1 259:7 0 4T 0 disk
nvme7n1 259:0 0 4T 0 disk
nvme6n1 259:1 0 4T 0 disk
nvme5n1 259:3 0 4T 0 disk
nvme4n1 259:5 0 4T 0 disk
nvme3n1 259:6 0 4T 0 disk
nvme2n1 259:2 0 4T 0 disk
nvme1n1 259:4 0 4T 0 disk
nvme0n1 259:8 0 80G 0 disk
├─nvme0n1p1 259:9 0 80G 0 part /
└─nvme0n1p128 259:10 0 1M 0 part
[[email protected]:~]#

A small 80GB root volume on /, and 8 EBS volumes under /dev/nvme1n1 – /dev/nvme8n1. We can then combine this into a single raid volume using mdraid as follows:

[[email protected]:~]# mdadm --create --verbose /dev/md0 --level=0 --raid-devices=8 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 
                                                                                /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
[[email protected]:~]#

You can optionally change –level to 1 or 5 if you would like to combine the volumes using raid1 or raid5/6 for additional durability.

Creating the filesystem

We can now create the filesystem on top of it. In the example below we use the XFS filesystem, but you can also use ext4. We recommend assigning an identifying label to the filesystem; we will use “qdb_db” below:

[[email protected]:~]# mkfs.xfs -f -L qdb_db /dev/md0
[[email protected]:~]# mount /dev/disk/by-label/qdb_db /var/lib/qdb
[[email protected]:~]#

We now have our software-raid filesystem ready to be used by QuasarDB. You may wish to restart QuasarDB at this point, and add a relevant entry to /etc/fstab.

Persistent cache & magnetic storage

In some situations with a lot of cold data, you may wish to optimize the storage costs even further by making use of a QuasarDB feature called “persistent cache”. The persistent cache allows you to put a fast SSD cache in front of slow, magnetic storage. As data is read from the magnetic storage, QuasarDB uses the SSD storage to cache these objects. As new data is always written to the magnetic storage, it is safe to use ephemeral / temporary storage for the persistent cache, as this storage area is not expected to be durable.

The persistent cache effectively allows you to have three different caching layers:

  • All “hot” data will be in-memory, typically between 128GB – 512GB per instance;
  • All “warm” data will be in the persistent cache, typically between 1TB – 8TB per instance;
  • All “cold” data will be in in slow magnetic storage, up to hundreds of TB per instance.

Our own benchmarks shows this provides a nice balance between costs and performance. The following are the results for storing & querying 2TB of data using various storage types:

Storage type Ingestion time Query time
SSD 11min 11min
Magnetic 11min 45min
Magnetic + SSD (perisistent_cache) 11min 15min

From these benchmarks, we can conclude:

  • At 11min for 2TB, the read/write speed of SSD is more than 2GB/sec, which means we’re limited by SSD throutput only;
  • Ingestion time is entirely unaffected by magnetic vs SSD;
  • Magnetic + SSD persistent cache has about 20% drop in performance compared to pure SSD.

Provided that magentic storage is about 50% cheaper than SSD general purpose storage, we can conclude hat moving to this type of setup results in a ~50% cost reduction for a ~20% drop in query performance.

To set this up, we would set up a secondary raid volume with filesystem as follows:

[[email protected]:~]# mdadm --create --verbose /dev/md1 --level=0 --raid-devices=8 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme12n1 
                                                                                /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
[[email protected]quasardb:~]# mkfs.xfs -f -L qdb_cache /dev/md1
[[email protected]:~]# mount /dev/disk/by-label/qdb_cache /var/lib/qdb_cache
[[email protected]:~]#

Last but not least, we must tell QuasarDB that it can use the persistent cache available. In your qdbd.conf, edit “local.depot.rocksdb” and add a persistent_cache_path such as this:

“data_cache”: 134217728,

“persistent_cache_nvme_optimization”: true,
“persistent_cache_path”: “/var/lib/qdb_cache“,
“persistent_cache_size”: 0,
“root”: “/var/lib/qdb”,

“threads”: 8

Set the “persistent_cache_nvme_optimization” variable to true if you’re using NVMe SSD, otherwise leave to the default value of false. A value of “0” for the persistent_cache_size means it is allowed to use the entire disk.

Conclusion

Controlling storage costs is paramount to the success of your database deployment. QuasarDB works well with default Linux software raid, and provides a persistent caching feature which enables you to put a “warm” cache in front of cheap, magnetic storage. These features provide you with the building blocks to tune the storage costs vs performance to your needs.

Recent Posts

Quasar “Seneca” 3.13.0 Beta Released

Where is 3.11? You may have noticed that we skipped the 3.11 (beta) and 3.12

Quasar showcased on AWS marketplace

We are very proud to announce that Quasar is front and center on the AWS

QuasarDB is now Quasar

I am thrilled to announce that QuasarDB has become Quasar. If you're already a Quasar

QuasarDB 3.10.0 Stable Released

We are very pleased to announce the immediate availability of QuasarDB 3.10.0. You can get

QuasarDB 3.9.9 Beta released

We are very pleased to announce the immediate availability of QuasarDB 3.9.9. This release brings

Try the community edition now!