Optimized QuasarDB, part 1: storage
Reaching your performance, durability and efficiency goals in a production setup can be a challenge: there are about a million ways to deploy complex systems, and QuasarDB is no exception. In this series of blog posts, we will be taking you through the various aspects on how to implement a successful QuasarDB deployment in production.
In this post, we will kick off the series with a deep dive on the tradeoffs of choosing your storage layer.
QuasarDB instances typically need to handle a lot of i/o throughput, and store many terrabytes of data. As storage costs are typically a very important consideration, we need to be critical of the type of storage we use AWS’ instance storage, EBS, provides us with the following storage types:
General Purpose SSD
Provisioned IOPS SSD
|Volume size||1 GiB – 16 TiB||4 GiB – 64 TiB||4 GiB – 16 TiB|
|Max IOPS||16,000||256,000||64,000 †|
|Max throughput||1,000 MiB/s||250 MiB/s *||4,000 MiB/s||1,000 MiB/s †|
From this information, we can conclude that apart from io2 Block Express, all EBS volume types have a throughput limitation of either 1GB/s or 250MB/sec, and a limitation of 16 TiB per volume.
Since the instances we would like to deploy are typically able to handle much more data than this, per AWS’ own recommendations, we can combine multiple volumes together using Linux software RAID. This will enable us to use General Purpose (gp2) storage, which is an order of magnitude cheaper than the privisioned (io1/io2) storage.
Setting up software RAID
Assuming an instance with 8 EBS volumes attached, the typical instance may look as follows:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme8n1 259:7 0 4T 0 disk
nvme7n1 259:0 0 4T 0 disk
nvme6n1 259:1 0 4T 0 disk
nvme5n1 259:3 0 4T 0 disk
nvme4n1 259:5 0 4T 0 disk
nvme3n1 259:6 0 4T 0 disk
nvme2n1 259:2 0 4T 0 disk
nvme1n1 259:4 0 4T 0 disk
nvme0n1 259:8 0 80G 0 disk
├─nvme0n1p1 259:9 0 80G 0 part /
└─nvme0n1p128 259:10 0 1M 0 part
A small 80GB root volume on /, and 8 EBS volumes under /dev/nvme1n1 – /dev/nvme8n1. We can then combine this into a single raid volume using mdraid as follows:
[root@quasardb:~]# mdadm --create --verbose /dev/md0 --level=0 --raid-devices=8 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 mdadm: chunk size defaults to 512K mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md0 started. [root@quasardb:~]#
You can optionally change –level to 1 or 5 if you would like to combine the volumes using raid1 or raid5/6 for additional durability.
Creating the filesystem
We can now create the filesystem on top of it. In the example below we use the XFS filesystem, but you can also use ext4. We recommend assigning an identifying label to the filesystem; we will use “qdb_db” below:
[root@quasardb:~]# mkfs.xfs -f -L qdb_db /dev/md0 [root@quasardb:~]# mount /dev/disk/by-label/qdb_db /var/lib/qdb [root@quasardb:~]#
We now have our software-raid filesystem ready to be used by QuasarDB. You may wish to restart QuasarDB at this point, and add a relevant entry to /etc/fstab.
Persistent cache & magnetic storage
In some situations with a lot of cold data, you may wish to optimize the storage costs even further by making use of a QuasarDB feature called “persistent cache”. The persistent cache allows you to put a fast SSD cache in front of slow, magnetic storage. As data is read from the magnetic storage, QuasarDB uses the SSD storage to cache these objects. As new data is always written to the magnetic storage, it is safe to use ephemeral / temporary storage for the persistent cache, as this storage area is not expected to be durable.
The persistent cache effectively allows you to have three different caching layers:
- All “hot” data will be in-memory, typically between 128GB – 512GB per instance;
- All “warm” data will be in the persistent cache, typically between 1TB – 8TB per instance;
- All “cold” data will be in in slow magnetic storage, up to hundreds of TB per instance.
Our own benchmarks shows this provides a nice balance between costs and performance. The following are the results for storing & querying 2TB of data using various storage types:
|Storage type||Ingestion time||Query time|
|Magnetic + SSD (perisistent_cache)||11min||15min|
From these benchmarks, we can conclude:
- At 11min for 2TB, the read/write speed of SSD is more than 2GB/sec, which means we’re limited by SSD throutput only;
- Ingestion time is entirely unaffected by magnetic vs SSD;
- Magnetic + SSD persistent cache has about 20% drop in performance compared to pure SSD.
Provided that magentic storage is about 50% cheaper than SSD general purpose storage, we can conclude hat moving to this type of setup results in a ~50% cost reduction for a ~20% drop in query performance.
To set this up, we would set up a secondary raid volume with filesystem as follows:
[root@quasardb:~]# mdadm --create --verbose /dev/md1 --level=0 --raid-devices=8 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme12n1 /dev/nvme13n1 /dev/nvme14n1 /dev/nvme15n1 /dev/nvme16n1 mdadm: chunk size defaults to 512K mdadm: Defaulting to version 1.2 metadata mdadm: array /dev/md1 started. [root@quasardb:~]# mkfs.xfs -f -L qdb_cache /dev/md1 [root@quasardb:~]# mount /dev/disk/by-label/qdb_cache /var/lib/qdb_cache [root@quasardb:~]#
Last but not least, we must tell QuasarDB that it can use the persistent cache available. In your qdbd.conf, edit “local.depot.rocksdb” and add a persistent_cache_path such as this:
Set the “persistent_cache_nvme_optimization” variable to true if you’re using NVMe SSD, otherwise leave to the default value of false. A value of “0” for the persistent_cache_size means it is allowed to use the entire disk.
Controlling storage costs is paramount to the success of your database deployment. QuasarDB works well with default Linux software raid, and provides a persistent caching feature which enables you to put a “warm” cache in front of cheap, magnetic storage. These features provide you with the building blocks to tune the storage costs vs performance to your needs.