Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Position of SETTINGS matters for configuration and batching #1152

Closed
ivenhov opened this issue Dec 15, 2022 · 3 comments
Closed

Position of SETTINGS matters for configuration and batching #1152

ivenhov opened this issue Dec 15, 2022 · 3 comments
Labels

Comments

@ivenhov
Copy link

ivenhov commented Dec 15, 2022

Hi

I'm using JDBC java client v0.3.2-patch11 in spring-boot application.
I'm trying to enable async_insert which I have mixed results with

Testing with a single inserts at the moment, inserting as quickly as possible.
If during my insert I do

        StringBuilder sb = new StringBuilder(1024);
        sb.append("INSERT INTO ").append(_tableDescriptor.getFqn());
        sb.append(" SETTINGS async_insert=1,wait_for_async_insert=0");
        sb.append(" VALUES ");
        sb.append("("?,?,?)");
        String sql = sb.toString();

I can see that on the server disk space does not grow rapidly and number of parts created is under control considering that I'm not batching. This means the setting is working. It also matches documentation example
https://clickhouse.com/docs/en/cloud/bestpractices/asynchronous-inserts/

However, if I use

        StringBuilder sb = new StringBuilder(1024);
        sb.append("INSERT INTO ").append(_tableDescriptor.getFqn());
        sb.append(" VALUES ");
        sb.append("("?,?,?)");
        sb.append(" SETTINGS async_insert=1,wait_for_async_insert=0");
        String sql = sb.toString();

then I can see disk usage growing quickly eventually leading to out of disk space.
This means setting is not working.
Why is the position of Settings relevant?

I discovered 2nd version not working one when I was trying to do batching 1000 values per insert.
With that version batching is extremely quick, much faster than with the 1st version so I would like to use that if possible.
I think batching does not work at all in the 1st version .
It may have something to do with parsing broken due to SETTING inserted.
Constructor of SqlBasedPreparedStatement seems to bypass the
https://github.com/ClickHouse/clickhouse-jdbc/blob/f285b73d45597e29cf36dff4daf4cb4f9f53453a/clickhouse-jdbc/src/main/java/com/clickhouse/jdbc/internal/SqlBasedPreparedStatement.java#L78

due to parsedStmt.hasValues() returning false

The same In ClickHouseConnectionImpl
https://github.com/ClickHouse/clickhouse-jdbc/blob/f285b73d45597e29cf36dff4daf4cb4f9f53453a/clickhouse-jdbc/src/main/java/com/clickhouse/jdbc/internal/ClickHouseConnectionImpl.java#L705
it skips this section due to parsedStmt.hasValues() returning false.

As it stands I cannot reliably configure inserts to have decent performance and avoid rapid parts growth and at the same time benefit from bulk inserts.

Any help with this would be appreciated.

@ivenhov
Copy link
Author

ivenhov commented Dec 15, 2022

Further testing with this shows following behaviour for 100k records inserted, with 1000 records per batch

1st version of the code

        StringBuilder sb = new StringBuilder(1024);
        sb.append("INSERT INTO ").append("tablename");
        sb.append(" SETTINGS async_insert=1,wait_for_async_insert=0");
        sb.append(" VALUES ");
        sb.append("("?,?,?)");
        String sql = sb.toString();

inserts it in 1m 22secs

2nd version of the code

        StringBuilder sb = new StringBuilder(1024);
        sb.append("INSERT INTO ").append("tablename");
        sb.append(" VALUES ");
        sb.append("("?,?,?)");
        sb.append(" SETTINGS async_insert=1,wait_for_async_insert=0");
        String sql = sb.toString();

in 1s 600ms. !!! so much faster
The second version performs exactly the same when SETTINGS line is commented out.
This means that line is irrelevant and not used/parsed at all.
This can be proven by putting bogus SETTING values which are still tolerated.

This means 1st version of the code is correct but the question is

  • why there is such a performance drop with batch insert when SETTINGS is used in the correct place?

Also parser should probably warn/stop if SETTING is used in the wrong place. Even better if it would not matter at all.

For reference call executed is

List<Object[]> batchArgs...
int[] columnTypes = ...
_jdbcTemplate.batchUpdate(sql, batchArgs, columnTypes);

@zhicwu zhicwu added the bug label Dec 16, 2022
@zhicwu
Copy link
Contributor

zhicwu commented Dec 16, 2022

Hi @ivenhov, thanks for the reporting and sorry for the confusing results. It's caused by the outdated SQL parser, which expects SETTINGS expression at the end of the insert statement but it should really before VALUES expression :<

A few more comments for you to consider:

  1. the second query is invalid
    If you try that using ClickHouse native CLI client and you'll end up with a parsing error.

  2. async insert
    It's just my opinion for your reference. Async insert is convenient for small writes from many clients. In the past we use Buffer table or MQ to achieve similar result. However, depending on your use case, in many cases, it's more elastic and flexible to use MQ.

  3. batch insert
    It is expected that batch insert is fast, because ClickHouse is fast :) If you find JDBC driver is too slow for inserting large amount of data, you may want to switch to Java client, which is async and stream-based. Performance [WIP] #928 has some more numbers for comparison.

Update: the parsing issue was caused by breaking change on server side ClickHouse/ClickHouse#35883.

@ivenhov
Copy link
Author

ivenhov commented Dec 16, 2022

Thanks for prompt reply.

  1. You are right about the CLI, 2nd version is indeed incorrect. I don't know why it didn't occur to me check that via CLI.

  2. in my case it may be single messages but with bursts to 100s per second, at which point I would like to do the batching.
    Messages are coming from RabbitMQ which I need to pre-process and persist in CH. Because of the nature of the queue I think I will receive only 1 message at a time.
    This is different in Kafka as far as I know where you can get multiple messages and advance the pointer.
    I need to look into it.
    Is using async-insert preferred to use Buffer tables?

  3. Thank you for the links. I was considering Java client but having pure JDBC has its advantages.
    Definitely on the table if the performance suffers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants