Alternate JSON object storage for clickhouse - Arrays evaluation #7874
Labels
clickhouse
Related to ClickHouse specific bugs or improvements
enhancement
New feature or request
performance
Has to do with performance. For PRs, runs the clickhouse query performance suite
team/product-analytics
Background
We currently store JSON properties as a string in clickhouse and parse/query these at runtime.
Clickhouse has better support for semi-structured data like this in works currently, but this doesn't mean we can't consider other solutions in the internim.
This issue doesn't have any tasks associated - documenting for posterity.
Tested solution
Ubers logging with clickhouse article proposed a schema where property keys and values are stored as arrays on each row.
In our case, the following schema was added to the benchmarking server.
To query these columns, we extract values via
property_values[indexOf(property_keys, 'some_property')]
in SQL.Benchmark (timing) results
Comparison to current schema shown in brackets.
Full benchmark results can be found under #7863.
Conclusion: This alternative schema would speed up queries which use event properties ~5-20%. Note that other queries stay unaffected.
Storage
On the benchmarking server, storage looks like this:
Conclusion: Arrays seem to compress better than raw json strings, probably due to repeated keys/values being handled better. The win is upwards to 60%, but might be less on real datasets.
Click to see query used to analyze this:
Should we implement this?
I don't think so right now unless storage becomes an issue. This is because:
That said, we should consider this if clickhouse development stalls too long.
The text was updated successfully, but these errors were encountered: