Skip to content

Compression

Paul Stovell edited this page Apr 16, 2020 · 19 revisions

JSON documents can be big! 📄

Nevermore historically and by default assumes the JSON is stored as text (in an nvarchar(max) column), and this makes sense for documents that are small. There are some optimizations in the Nevermore 12.0+ that make it work better for larger documents, but it's still a lot of data to shuffle back and forth.

Rather than storing documents as text, Nevermore 12.0+ supports compressing the JSON.

With compression enabled, your JSON is stored in a varbinary(max) column called JSONBlob. The data is compressed with GZIP. If you find yourself trying to query it in SQL, you'll be glad to know that SQL 2016 knows how to work with GZIP data using its COMPRESS() and [DECOMPRESS()](https://docs.microsoft.com/en-us/sql/t-sql/functions/decompress-transact-sql?view=sql-server-ver15) functions.

Why use compression?

Much of the data in JSON documents is repetitive. Property names, types and ID's are repeated over and over.

In Octofront, for example, the LicenseKeys table keeps a history of the changes made to each license key. Each history entry has a type, date, and some properties. Those strings are repeated potentially hundreds of times. Compressing this data reduces the size of the table at the time of writing from 435 MB to 118 MB.

Bigger documents with collections as properties will naturally mean repeating many of the same strings.

The benefits of compression are:

  1. Less space used in the database
  2. Less space used in data transfer over the network (in the SQL data reader)

The downside, of course, is:

  1. CPU time spent compressing and decompressing when reading data
  2. Loss of the ability to use features like JSON_VALUE

In testing with (local)\SQLEXPRESS and a 120K document with a 50% compression ratio, these benefits canceled each other out.

Format Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Text 1.560 ms 0.2761 ms 0.0151 ms 64.4531 29.2969 - 357.26 KB
Compressed 1.703 ms 0.3508 ms 0.0192 ms 66.4063 31.2500 - 356.59 KB

But SQL Server rarely runs on the same machine. In those cases, compression makes a big difference. Here's the same test running when using NetLimiter to limit bandwidth in/out of SQLEXPRESS to 100mbps:

Format Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Text 21.242 ms 1.637 ms 0.0898 ms 62.5000 31.2500 - 357.29 KB
Compressed 7.756 ms 1.099 ms 0.0602 ms 62.5000 23.4375 - 356.6 KB

100mbps is probably a real-world best case too. In Octofront, it's unlikely we get that when talking from our web apps to the SQL Azure database for example. I suspect the same will be true for Octopus Cloud. Customer networks may be different, but I'd expect it might make a considerable impact on many of them.

Enabling compression

To use compression, you need to:

  1. Add a [JSONBlob] column of type varbinary(max)
  2. In your DocumentMap, set the JsonStorageFormat property.
  3. Migrate the data somehow

Here's an example of a document map:

public class BigDocumentMap : DocumentMap<BigDocument>
{
    public BigDocumentMap()
    {
        Column(m => m.Name);
        Column(m => m.Type);
        JsonStorageFormat = JsonStorageFormat.CompressedOnly;
    }
}

There are four options for JsonStorageFormat. The first two are for when you're creating a new type of document:

  • TextOnly: The default. Use when you only have a [JSON] column.
  • CompressedOnly: Use when you only have a [JSONBlob] column.

The second two are for when you've got documents with data in them already, and you can't do a big-bang migration. Nevermore supports automatic migration for each document as you load and save it.

  • MixedPreferCompressed: Use when you have both [JSONBlob] and [JSON] columns, and want data in the [JSON] column to be migrated to the [JSONBlob] column
  • MixedPreferText: Use when you have both [JSONBlob] and [JSON] columns, and want data in the [JSONBlob] column to be migrated to the [JSON] column.

How it works

When inserting or updating data, Nevermore will use the JSON serializer to encode it as unicode, and GZipStream the result, to store in the JSONBlob column.

When reading, Nevermore will use GZipStream to read it back, again assuming the result is unicode.

Querying compressed data in SQL

Data in the columns is stored compressed using GZIP, which means you can use the COMPRESS and DECOMPRESS functions in SQL Server 2016 to interact with it.

To use functions like JSON_VALUE, just decompress the data first:

select * from dbo.Person 
    where JSON_VALUE(CAST(DECOMPRESS([JSONBlob]) as nvarchar(max)), '$.Name') = @name

Migration

If you enable compression on a table that already has data, you need to migrate the data. There are three options.

You can use the COMPRESS function in SQL Server 2016 to do this for you easily.

-- First, add a column named `JSONBlob` of type `varbinary(max)`. 
alter table dbo.Person add JSONBlob varbinary(max) null
go

update dbo.Person set JSONBlob = COMPRESS([JSON])

-- Either null the [JSON] column, or drop it
update dbo.Person set [JSON] = null

The benefit of this approach is you can drop the old column and switch to CompressedOnly in code.

This should work quickly on small tables, but could take longer on big tables. From #98:

On a table with 1GB of JSON and 64,000 rows (VariableSet). It completed in 78 seconds and compressed size was 280MB. The events table, 3.2GB, 1,700,000 rows took 10 minutes

So the second approach is to use one of the "Mixed" JSON storage modes, which will transparently do this for you as you load and write documents.

A third hybrid approach would be run code in the background that loads and writes each document back. This will slowly compress all the records, but users will still be able to interact with the application.

Why not...

We could have chosen other approaches for compression - using BSON instead of JSON, or using a better compression algorithm than GZIP. Not having to use the 2-byte encoding for Unicode would have been very nice. However, those would make it difficult to use any of the data from within SQL directly (SQL doesn't support UTF-8, and even in SQL 2019 when it does, it will be collation dependent). The compatibility and migration benefits of this setup seem to be the optimal balance.

Further reading:

Clone this wiki locally