Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clickhouse-jdbc driver using a lot of object allocations/memory when inserting data #1115

Closed
lavanyachennupati opened this issue Oct 20, 2022 · 6 comments

Comments

@lavanyachennupati
Copy link

When using clickhouse-jdbc driver to insert data into clickhouse, we are noticing a lot of object allocations from the StringBuilder.append() and .toString() calls that are used in several places when inserting data to clickhouse.

Context on how we insert data:

  1. we initialize a clickhouse connection that is reused:
    val properties = new Properties()
properties.setProperty(ClickHouseClientOption.ASYNC.getKey, "true")
properties.setProperty(ClickHouseClientOption.COMPRESS.getKey, "true")  properties.setProperty(ClickHouseClientOption.DECOMPRESS.getKey, "true")
properties.setProperty(ClickHouseClientOption.SOCKET_TIMEOUT.getKey, "300000")
    val dataSource = new ClickHouseDataSource(s"jdbc:clickhouse://$ip:8123", properties)
datasSource. getConnection
  1.  we create a prepareStatement from this connection and add an event/row to the prepareStatement and do a  `statement.addBatch()` for ~1000 events before doing a `statement.executeBatch()`
    
 val statement = connection.prepareStatement(
          s"INSERT INTO $tableName ${ClickHouseEvent.columnNames} VALUES ${ClickHouseEvent.columnValues}"
        )

//for 1000 events:
  statement.setString(1, clickHouseEvent.event_id)
    statement.setString(2, clickHouseEvent.shortDesc)
    statement.setString(3, clickHouseEvent.level)
    statement.setString(4, clickHouseEvent.platform)
    statement.setLong(5, clickHouseEvent.timestamp)
    statement.setInt(6, clickHouseEvent.hour)
    statement.setString(7, clickHouseEvent.version)
    statement.setString(8, clickHouseEvent.`type`)
    statement.setString(9, clickHouseEvent.logger)
    statement.setString(10, clickHouseEvent.message)
    statement.setString(11, clickHouseEvent.formattedMessage)
    ............
    statement.setString(42, Json.encode(clickHouseEvent.exception))
    statement.setString(43, clickHouseEvent.exceptionType)
    statement.setString(44, clickHouseEvent.exceptionModule)
    statement.setString(45, clickHouseEvent.exceptionValue)
    statement.setLong(46, clickHouseEvent.epoch)
    statement.addBatch()

//after adding 1000 events to the statement
statement.executeBatch()

There are several places in the statement.addBatch() where a StringBuilder with initial capacity 10 is created and constantly copied/resized that's forcing a lot of object allocations and a significant memory usage as each of the column values(we have 46 columns) for each row of the 1000 events added.
Later statement.executeBacth() again does a toString implementation.

Attaching the flame graphs for allocation profiles that shows about 70% of the allocations coming from jdbc driver.
Screen Shot 2022-10-20 at 11 06 07 AM
Screen Shot 2022-10-20 at 11 05 15 AM
Screen Shot 2022-10-20 at 11 04 40 AM

Is there a more memory efficient client that clickhouse offers? Or any alternative/more memory efficient ways that clickhouse supports inserting data?

cc: @e-mars

@zhicwu
Copy link
Contributor

zhicwu commented Oct 20, 2022

Hi @lavanyachennupati, what's the exact String passed to PreparedStatement? As ClickHouse does not support prepared statement, the JDBC driver has to figure out the table structure by analyzing the SQL query. Based on the information you provided, it looks like JDBC driver was trying to build a large SQL statement, which means it didn't get enough information about the target table.

Could you check Java examples at here? You may want to start with the simplified version insert into <table name> except (<columns not needed> and see if it will stop the madness.

If you must use JVM language, you may try Java client for memory-efficient inserting, for example:

https://github.com/ClickHouse/clickhouse-jdbc/blob/3e462a3cc90707189424b3137738abe59f606c85/clickhouse-client/src/test/java/com/clickhouse/client/ClientIntegrationTest.java#L1335-L1346

#928 contains some test results (QUERY TIME -> WRITE) that you want to check as well.

@e-mars
Copy link

e-mars commented Oct 20, 2022

@gingerwizard for visibility

@lavanyachennupati
Copy link
Author

lavanyachennupati commented Oct 21, 2022

what's the exact String passed to PreparedStatement?

This is the exact statement:

 val statement = connection.prepareStatement("
INSERT INTO $tableName
(event_id,
shortDesc,
level,
platform,
timestamp,
hour,
version,
type,
logger,
message,
formattedMessage,
.........
exceptionType,
exceptionModule,
exceptionValue,
dateTime) 
VALUES 
(?,
?,
?,
?, 
?, 
......
?,
?,
?)")

We switched to the clickhouse-jdbc batch inserts with the approach in the description after we ran into very expensive sql parsing from javacc similar to this issue

Looking at the examples will give the jdbc batch inserts where schema might not have to be interpreted a try as the first step.

However, looking at the benchmarks from write performance here it looks like Java client for inserting data might be performing worse in terms of the CPU and memory consumption than JDBC with mixed values(which is our use case, as we have a majority of fields that are string, but a few other data types). It seems using clickhouse-client directly might be the best performance wise compared to the java client and clickhouse-jdbc for us.

Thanks for taking a look at this super fast. Will share the results once we try out the above and any suggestions on the above approach meanwhile is very much appreciated.

@zhicwu
Copy link
Contributor

zhicwu commented Oct 21, 2022

Thanks @lavanyachennupati. Was $tableName substituted to the real table name before you pass the query to PreparedStatement? In your case, since you're not using input function, JDBC driver will issue query select <columns you specified> from $tableName where 0 to get column types. Also if you have complex expression in the VALUES part like VALUES(? + 1, ? || 'suffix', ...), the driver will simply give up trying to get table structure. Anyway, you can try to remove the VALUES expression, use real table name and try again. If query log is not disabled, you may check the query before insert in system.query_log.

As to memory usage, since I didn't specify heap size in my test, the number is higher than it should be(due to less GC involved). Taking Mixed query as an example, the test result shows the driver took ~210 MB memory for the query, but if you re-run the test using -Xms24m -Xmx24m, the number will be dropped to ~77MB(including non-heap memory usage) while the performance was not impacted. Actually you can verify that by using jmap to inspect histogram of Java objects in the heap, there'll be lots of byte[] and String there, but most of them are not considered as live.

Lastly, to achieve better performance, although clickhouse-client looks faster, please keep in mind that the test was using clickhouse-client to load a file in Native/RowBinary format, meaning you'll have to spend extra time serializing Java objects into supported format and then hand over to the client, which is pretty much the Java client provided :p

@lavanyachennupati
Copy link
Author

lavanyachennupati commented Oct 21, 2022

Was $tableName substituted to the real table name before you pass the query to PreparedStatement?

Yes it is substituted to the real table name. We don't use any complex expression in the VALUES part like VALUES(? + 1, ? || 'suffix', ...).

Thanks for the additional info on the test set up. Will try some alternatives mentioned and see how the live objects/heap usage looks for our use case.

@lavanyachennupati
Copy link
Author

Changing the prepared statement as suggested above to the following reduced the object allocations.

connection.prepareStatement(
          s"INSERT INTO $tableName select ${ClickHouseEvent.columnNames} from input ${ClickHouseEvent.columnValues}"
        )

ClickhouseSqlParser seems to interpret the above sql as Insert statement which results in using the InputBasedPreparedStatement implementation of the ClickHousePreparedStatement which writes data in Clickhouse RowBinary format to a java.nio.ByteBuffer backed input stream with java.util.concurrent.BlockingQueue that seems to reduce the object allocations a lot compared to the SqlBasedPreparedStatement implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants