Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate JSON object storage for clickhouse - Arrays evaluation #7874

Open
macobo opened this issue Jan 4, 2022 · 3 comments
Open

Alternate JSON object storage for clickhouse - Arrays evaluation #7874

macobo opened this issue Jan 4, 2022 · 3 comments
Labels
clickhouse Related to ClickHouse specific bugs or improvements enhancement New feature or request performance Has to do with performance. For PRs, runs the clickhouse query performance suite team/product-analytics

Comments

@macobo
Copy link
Contributor

macobo commented Jan 4, 2022

Background

We currently store JSON properties as a string in clickhouse and parse/query these at runtime.

Clickhouse has better support for semi-structured data like this in works currently, but this doesn't mean we can't consider other solutions in the internim.

This issue doesn't have any tasks associated - documenting for posterity.

Tested solution

Ubers logging with clickhouse article proposed a schema where property keys and values are stored as arrays on each row.

In our case, the following schema was added to the benchmarking server.

ALTER TABLE events ADD COLUMN property_keys Array(String) DEFAULT arrayMap(x -> x.1, JSONExtractKeysAndValues(properties, 'String'))
ALTER TABLE events ADD COLUMN property_values Array(String) DEFAULT arrayMap(x -> x.2, JSONExtractKeysAndValues(properties, 'String'))
ALTER TABLE events UPDATE property_values = property_values, property_keys = property_keys WHERE 1=1

ALTER TABLE person ADD COLUMN property_keys Array(String) DEFAULT arrayMap(x -> x.1, JSONExtractKeysAndValues(properties, 'String'))
ALTER TABLE person ADD COLUMN property_values Array(String) DEFAULT arrayMap(x -> x.2, JSONExtractKeysAndValues(properties, 'String'))
ALTER TABLE person UPDATE property_values = property_values, property_keys = property_keys WHERE 1=1

To query these columns, we extract values via property_values[indexOf(property_keys, 'some_property')] in SQL.

Benchmark (timing) results

Benchmark name Current schema Arrays Materialized columns
track_trends_event_property_filter 20359ms 18810ms (0.92) 4208ms (0.21)
track_trends_filter_by_action_current_url 41310ms 32029ms (0.78) 4305ms (0.10)
track_trends_person_property_filter 5607ms 5258ms (0.94) 4872ms (0.87)
track_correlations_by_event_properties 37481ms 30301ms (0.81) 12499ms (0.33)
track_retention_filter_by_person_property 18254ms 16888ms (0.93) 16493ms (0.90)

Comparison to current schema shown in brackets.

Full benchmark results can be found under #7863.

Conclusion: This alternative schema would speed up queries which use event properties ~5-20%. Note that other queries stay unaffected.

Storage

On the benchmarking server, storage looks like this:

Row 1:
──────
column:                          properties
type:                            String  
parts:                           15
on_disk:                         5.09 GiB
compressed:                      5.09 GiB
uncompressed:                    83.35 GiB
marks_size:                      232.88 KiB
percentage_of_properties_column: 100%

Row 2:
──────
column:                          property_values
type:                            Array(String) DEFAULT arrayMap(x -> (x.2), JSONExtractKeysAndValues(properties, 'String'))
parts:                           15
on_disk:                         1.64 GiB
compressed:                      1.64 GiB
uncompressed:                    19.60 GiB
marks_size:                      465.75 KiB
percentage_of_properties_column: 32.1058%

Row 5:
──────
column:                          property_keys
type:                            Array(String) DEFAULT arrayMap(x -> (x.1), JSONExtractKeysAndValues(properties, 'String'))
parts:                           15
on_disk:                         361.16 MiB
compressed:                      360.70 MiB
uncompressed:                    16.37 GiB
marks_size:                      465.75 KiB
percentage_of_properties_column: 6.9233%

Conclusion: Arrays seem to compress better than raw json strings, probably due to repeated keys/values being handled better. The win is upwards to 60%, but might be less on real datasets.

Click to see query used to analyze this:
with (
    SELECT sum(column_bytes_on_disk) 
    FROM system.parts_columns
    where table = 'events' AND column = 'properties'
) as properties_column_on_disk
select 
    column, 
    concat(any(type), ' ', any(default_kind), ' ', any(default_expression)) as type, 
    count(*) as parts, 
    sum(rows) as rows, 
    formatReadableSize(sum(column_bytes_on_disk)) as on_disk, 
    formatReadableSize(sum(column_data_compressed_bytes)) as compressed, 
    formatReadableSize(sum(column_data_uncompressed_bytes)) as uncompressed,
    formatReadableSize(sum(column_marks_bytes)) as marks_size,
    concat(toString(round(100 * sum(column_bytes_on_disk) / properties_column_on_disk, 4)), '%') as percentage_of_properties_column
from system.parts_columns 
where table = 'events'
group by column
order by sum(column_bytes_on_disk) desc

Should we implement this?

I don't think so right now unless storage becomes an issue. This is because:

  • Clickhouse is actively working on a solution for semi-structured data which will blow this out the water
  • It's another really costly migration on both self-hosted and cloud with a large opportunity cost
  • Opportunity cost: Diverting resources from clickhouse upgradability and general deployments
  • This would significantly increase backup complexity due to two events table schemas

That said, we should consider this if clickhouse development stalls too long.

@macobo macobo added enhancement New feature or request clickhouse performance Has to do with performance. For PRs, runs the clickhouse query performance suite labels Jan 4, 2022
@hazzadous hazzadous mentioned this issue Jan 4, 2022
2 tasks
@guidoiaquinti
Copy link
Contributor

Stellar analysis (as always) @macobo 👏

I confirm what you’ve already written, and I too am inclined to see ClickHouse developments of the JSON type while keeping this migration in the back pocket in case we need it (or in case upstream feature development slows down).

@fuziontech
Copy link
Member

lol paging @EDsCODE this will cause some flashbacks!

I echo what @guidoiaquinti said as well - this is a fantastic analysis, well done. Forever ago Eric and I tested this as well and found similar results (not quite as refined though nor documented) but basically we found it not obviously faster using our super small dataset at the time when we were migrating from PG to CH. We decided to just stick with the JSON serialized properties. We also tried using a event_id, key, value table to look up with which was definitely not a winner.

I also second waiting for more solid JSON support in CH. We are not yet hurting for space (and have a plan for that) and I'd prefer not to do another large migration followed by another when they ship JSON features.

@macobo macobo added the clickhouse Related to ClickHouse specific bugs or improvements label Mar 10, 2022
@posthog-bot
Copy link
Contributor

This issue hasn't seen activity in two years! If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in two weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse Related to ClickHouse specific bugs or improvements enhancement New feature or request performance Has to do with performance. For PRs, runs the clickhouse query performance suite team/product-analytics
Projects
None yet
Development

No branches or pull requests

7 participants