-
Notifications
You must be signed in to change notification settings - Fork 566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Avro as an insertion format #769
Comments
If you don't want to go via a struct you'll need to use https://clickhouse.com/docs/en/sql-reference/formats/#data-format-avro - the driver currently only supports native so this is similar to #737 i.e. expanding format support. Its planned for v3. We can bring forward if there is demand. |
Thank you for your reply @gingerwizard We tried to implement it but we got "unknown code family code 101". I see that you can send the data with this via HTTP client: Probably we can send the Avro file the same way.. But this is available for HTTP Client (8123) but we use Port 9000 for connection. My question is, if we send the file to HTTP Client (8123), does it deserialize Avro file into Native format and send it to Port 9000? Or HTTP Client send the file directly to the 9000? I'm sorry If my question is hard to understand :) Thanks. |
Or another way of this question: Thanks. |
Yes, I think to do this properly, we shouldn't do any conversion. The client supports communication over HTTP or native. For HTTP it's probably fairly simple for inserts (harder for selects) - if we provide an I don't know if avro is supported over the native interface. Would need to check with core. |
We just tried HTTP Client and it worked with our Avro file. I think until we have support in this driver, we will ping the Clickhouse servers via this driver and send the file to that server via HTTP Client. FORMAT support is also available in "clickhouse-client" on server. That one probably uses native, right? I saw the Avro implementation in the Clickhouse-server and it all happens in the server-side. So I assume clickhouse-client send the binary in Avro format to native and native uses that FORMAT to write data to database. |
Curious how you achieved these results. Even ch-go, the low-level lib on which this is based (its simpler and avoids reflection), takes longer than JDBC i.e.
takes 90s on my machine |
ok ignore 90s - arm issues
first tests is 50m using clickhouse-go, second with ch-go |
Oh I'm sorry if I made a confusion. This wasn't our test, I just saw it on the internet while I was searching for similar things. We will use ch-go for reading (we will use our internal load balancer). But we will implement a new client for inserting Avro files directly to Clickhouse using HTTP Client. I am hoping to use streaming from our messaging queue. So our client will use really low cpu and memory to ship data to Clickhouse. We will see after bechmarks. |
Okay this is my first test results on a small machine on DigitalOcean: 100.000 records HTTP Client with Avro format (file is around 5.5 MB) Clickhouse-go So clickhouse-go native gives better performance. But I will check their resource usage later. Then I tried with smaller record sizes: 1.000 records 10.000 records So http client is very slow. But on the other hand, we are not able to compare two clients. Because we are sending Avro file to the http client. Maybe Avro deserializer is slow. If you add Avro support to the native, then comparing them would be more logical. |
@gingerwizard What I planned so far: (this is my plan. didn't do any of it yet. I'm not sure if everything will go as plan.)
So maybe for V3: As a closing note: current performance of the ch-go is insanely good but it uses a lot of memory (I mean memory that I can actually avoid) due to reasons here. We would like to invest more into that. I would highly appreciate if you can also get these feedbacks as input for v3. I'm sure it will make it more flexible for CH customers in the future. |
Adding (1) to the driver would be welcome. An HTTP client usage is important for users who need to use proxys etc. Its less performant but essential for some users.
Not sure what you mean. This client only uses native. I think ch-go could do column-by-column compression which might help re-memory (need to check if this optimization is there). |
@yusufozturk https://github.com/ClickHouse/clickhouse-go/pull/776/files double speed of read potentially. |
@gingerwizard at the end, we decided to go with clickhouse-go. We used following method in ch-go:
So we are preparing logs in native format and giving it directly. On clickhouse-go, we used AppendStruct. So difference is: (100K logs) 2022/10/04 17:04:27 insert ch-go proto took 996.0782ms It's not that much difference. Actually this is only one insert, if there are thousands of inserts, probable we would see some difference. But for now, I am pretty okay with clickhouse-go performance. Because for ch-go, I need to manage many things. But what I can kindly ask, maybe accepting native format (proto.Input) in clickhouse-go as well. In ch-go, we can give it via Input parameter. As I know so far, clickhouse-go does not accept proto.Input, right? If that would be possible, we could have skip some data conversion here. Thanks. |
Also I was thinking encode and compress native format and giving it directly to the clickhouse-go. That would also save us from some conversion latency. But I don't know how I can pull this off. Because there are additional stuff in the buffer like column names, protocol etc. So actually I came back to my previous idea, accepting []byte as an input. I think I will stick with keeping data in native format, since there is a huge performance difference with Avro. But I just don't know how I can encode and compress proto.Input. I am sorry if I don't explain clearly. English is not my native language 😅 |
@yusufozturk np at all. thanks for all the feedback and testing! so to summarize, we want to support an
I think we want (1)? |
@gingerwizard yes, (1) would be perfect! Actually if you bring the (1), nobody stops you at the Avro anymore. Basically any supported format by Clickhouse would be supported, right? So why not accepting []byte and making the "format" as an parameter, so people can send their data with any supported format, like how they do in the HTTP client. That would be a perfect feature for this library. Although (1) would be a perfect feature, after my tests, I was wondering if the performance of the Clickhouse Avro deserializer is good enough. Maybe Avro performance is worse than our serializer in the golang. We use github.com/hamba/avro as the avro serializer and it's one of the fastest. So if Avro performance is bad, I would like to see (2) as well. So I can keep data in the native format in my messaging queue, pull it directly, and ship it via clickhouse-go. Thank you a lot for your help. You guys are doing great in this library. This changes will really saves us some compute power and bring the cost dramatically down. |
@gingerwizard it would be perfect if library can support streams so instead of putting entire avro file into the memory, we can send to Clickhouse via reader. (I'm also going to ask for a new Avro table engine from Clickhouse.) |
It depends. Native data format is really fast for fixed-length data structure, but probably not the best when there's String. I updated ClickHouse/clickhouse-java#928 with latest test results. RowBinary performs better for String and Mixed(Int8 + UInt64 + String) queries, at the cost of higher CPU usage on server side(perhaps related to both http protocol and the format). |
@gingerwizard Hi Dale! Hope all is going well with you. Do you think this can be possible in v2? Or is this planned for v3? If it's in v3, hoping to see it in 3.0.0 release 😄 Thanks a lot. |
@jkaflik is taking on maintenance now - his decision here. |
@yusufozturk support other data formats than native will require some client internals redesign, hence I want to tackle this in v3 |
Hi,
We have an Avro file which is directly consumed from Kafka. We don't want to deserialize Avro to Google struct, and then give it to Clickhouse client to serialize it in Clickhouse proto format.. Because it's extra work on the API.
What we imagine is, supporting Avro as an insertion format, so just passing Avro file content as []byte, and then Clickhouse client will send it. And we will define our insert query as "INSERT INTO table FORMAT AVRO".
Do you think this is something easy to implement?
PS: We are also working on this driver to provide a PR but at the moment we are learning how it works..
The text was updated successfully, but these errors were encountered: