-
Notifications
You must be signed in to change notification settings - Fork 11
Compression
JSON documents can be big! 📄
Nevermore historically and by default assumes the JSON is stored as text (in an nvarchar(max)
column), and this makes sense for documents that are small. There are some optimizations in the Nevermore 12.0+ that make it work better for larger documents, but it's still a lot of data to shuffle back and forth.
Rather than storing documents as text, Nevermore 12.0+ supports compressing the JSON.
With compression enabled, your JSON is stored in a varbinary(max)
column called JSONBlob
. The data is compressed with GZIP. If you find yourself trying to query it in SQL, you'll be glad to know that SQL 2016 knows how to work with GZIP data using its COMPRESS()
and [DECOMPRESS()](https://docs.microsoft.com/en-us/sql/t-sql/functions/decompress-transact-sql?view=sql-server-ver15)
functions.
Much of the data in JSON documents is repetitive. Property names, types and ID's are repeated over and over.
In Octofront, for example, the LicenseKeys
table keeps a history of the changes made to each license key. Each history entry has a type, date, and some properties. Those strings are repeated potentially hundreds of times. Compressing this data reduces the size of the table at the time of writing from 435 MB to 118 MB.
Bigger documents with collections as properties will naturally mean repeating many of the same strings.
The benefits of compression are:
- Less space used in the database
- Less space used in data transfer over the network (in the SQL data reader)
The downside, of course, is:
- CPU time spent compressing and decompressing when reading data
- Loss of the ability to use features like JSON_VALUE
In testing with (local)\SQLEXPRESS and a 120K document with a 50% compression ratio, these benefits canceled each other out.
Format | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
Text | 1.560 ms | 0.2761 ms | 0.0151 ms | 64.4531 | 29.2969 | - | 357.26 KB |
Compressed | 1.703 ms | 0.3508 ms | 0.0192 ms | 66.4063 | 31.2500 | - | 356.59 KB |
But SQL Server rarely runs on the same machine. In those cases, compression makes a big difference. Here's the same test running when using NetLimiter to limit bandwidth in/out of SQLEXPRESS to 100mbps:
Format | Mean | Error | StdDev | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|
Text | 21.242 ms | 1.637 ms | 0.0898 ms | 62.5000 | 31.2500 | - | 357.29 KB |
Compressed | 7.756 ms | 1.099 ms | 0.0602 ms | 62.5000 | 23.4375 | - | 356.6 KB |
100mbps is probably a real-world best case too. In Octofront, it's unlikely we get that when talking from our web apps to the SQL Azure database for example. I suspect the same will be true for Octopus Cloud. Customer networks may be different, but I'd expect it might make a considerable impact on many of them.
To use compression, you need to:
- Add a
[JSONBlob]
column of typevarbinary(max)
- In your
DocumentMap
, set theJsonStorageFormat
property. - Migrate the data somehow
Here's an example of a document map:
public class BigDocumentMap : DocumentMap<BigDocument>
{
public BigDocumentMap()
{
Column(m => m.Name);
Column(m => m.Type);
JsonStorageFormat = JsonStorageFormat.CompressedOnly;
}
}
There are four options for JsonStorageFormat
. The first two are for when you're creating a new type of document:
-
TextOnly: The default. Use when you only have a
[JSON]
column. -
CompressedOnly: Use when you only have a
[JSONBlob]
column.
The second two are for when you've got documents with data in them already, and you can't do a big-bang migration. Nevermore supports automatic migration for each document as you load and save it.
-
MixedPreferCompressed: Use when you have both
[JSONBlob]
and[JSON]
columns, and want data in the[JSON]
column to be migrated to the[JSONBlob]
column -
MixedPreferText: Use when you have both
[JSONBlob]
and[JSON]
columns, and want data in the[JSONBlob]
column to be migrated to the[JSON]
column.
When inserting or updating data, Nevermore will use the JSON serializer to encode it as unicode, and GZipStream
the result, to store in the JSONBlob
column.
When reading, Nevermore will use GZipStream
to read it back, again assuming the result is unicode.
Data in the columns is stored compressed using GZIP, which means you can use the COMPRESS
and DECOMPRESS
functions in SQL Server 2016 to interact with it.
To use functions like JSON_VALUE
, just decompress the data first:
select * from dbo.Person
where JSON_VALUE(CAST(DECOMPRESS([JSONBlob]) as nvarchar(max)), '$.Name') = @name
If you enable compression on a table that already has data, you need to migrate the data. There are three options.
You can use the COMPRESS
function in SQL Server 2016 to do this for you easily.
-- First, add a column named `JSONBlob` of type `varbinary(max)`.
alter table dbo.Person add JSONBlob varbinary(max) null
go
update dbo.Person set JSONBlob = COMPRESS([JSON])
-- Either null the [JSON] column, or drop it
update dbo.Person set [JSON] = null
The benefit of this approach is you can drop the old column and switch to CompressedOnly
in code.
This should work quickly on small tables, but could take longer on big tables. From #98:
On a table with 1GB of JSON and 64,000 rows (VariableSet). It completed in 78 seconds and compressed size was 280MB. The events table, 3.2GB, 1,700,000 rows took 10 minutes
So the second approach is to use one of the "Mixed" JSON storage modes, which will transparently do this for you as you load and write documents.
A third hybrid approach would be run code in the background that loads and writes each document back. This will slowly compress all the records, but users will still be able to interact with the application.
We could have chosen other approaches for compression - using BSON instead of JSON, or using a better compression algorithm than GZIP. Not having to use the 2-byte encoding for Unicode would have been very nice. However, those would make it difficult to use any of the data from within SQL directly (SQL doesn't support UTF-8, and even in SQL 2019 when it does, it will be collation dependent). The compatibility and migration benefits of this setup seem to be the optimal balance.
Further reading:
Overview
Getting started
Extensibility
Misc