Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Avro as an insertion format #769

Open
yusufozturk opened this issue Sep 29, 2022 · 23 comments
Open

Support Avro as an insertion format #769

yusufozturk opened this issue Sep 29, 2022 · 23 comments

Comments

@yusufozturk
Copy link

Hi,

We have an Avro file which is directly consumed from Kafka. We don't want to deserialize Avro to Google struct, and then give it to Clickhouse client to serialize it in Clickhouse proto format.. Because it's extra work on the API.

What we imagine is, supporting Avro as an insertion format, so just passing Avro file content as []byte, and then Clickhouse client will send it. And we will define our insert query as "INSERT INTO table FORMAT AVRO".

Do you think this is something easy to implement?

PS: We are also working on this driver to provide a PR but at the moment we are learning how it works..

@gingerwizard
Copy link
Collaborator

If you don't want to go via a struct you'll need to use https://clickhouse.com/docs/en/sql-reference/formats/#data-format-avro - the driver currently only supports native so this is similar to #737 i.e. expanding format support. Its planned for v3. We can bring forward if there is demand.

@yusufozturk
Copy link
Author

Thank you for your reply @gingerwizard

We tried to implement it but we got "unknown code family code 101".

I see that you can send the data with this via HTTP client:
curl 'http://localhost:8123/?query=INSERT%20INTO%20TABLE%20FORMAT%20CSV' --data-binary @source.csv

Probably we can send the Avro file the same way.. But this is available for HTTP Client (8123) but we use Port 9000 for connection. My question is, if we send the file to HTTP Client (8123), does it deserialize Avro file into Native format and send it to Port 9000? Or HTTP Client send the file directly to the 9000?

I'm sorry If my question is hard to understand :)

Thanks.

@yusufozturk
Copy link
Author

yusufozturk commented Sep 30, 2022

Or another way of this question:
If you are going to implement Avro support in v3, you won't do any data conversion in the driver, right? You will just send the binary to the Clickhouse? In that case why not adding ability to send []byte with "FORMAT X" support on this driver? So anybody can send their file in the future in this driver.

Thanks.

@gingerwizard
Copy link
Collaborator

Yes, I think to do this properly, we shouldn't do any conversion. The client supports communication over HTTP or native. For HTTP it's probably fairly simple for inserts (harder for selects) - if we provide an AppendAvro method. The challenge is we hard code the FORMAT native and we'd need to find an elegant way to tell the client not to do this i.e. in the prepareBatch. We could maybe allow FORMAT to be specified here and respect here - Append would then route to a avro method.

I don't know if avro is supported over the native interface. Would need to check with core.

@yusufozturk
Copy link
Author

We just tried HTTP Client and it worked with our Avro file. I think until we have support in this driver, we will ping the Clickhouse servers via this driver and send the file to that server via HTTP Client.

FORMAT support is also available in "clickhouse-client" on server. That one probably uses native, right? I saw the Avro implementation in the Clickhouse-server and it all happens in the server-side. So I assume clickhouse-client send the binary in Avro format to native and native uses that FORMAT to write data to database.

@yusufozturk
Copy link
Author

An interesting benchmark:

image

@gingerwizard
Copy link
Collaborator

gingerwizard commented Sep 30, 2022

Curious how you achieved these results. Even ch-go, the low-level lib on which this is based (its simpler and avoids reflection), takes longer than JDBC

i.e.

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	"github.com/ClickHouse/ch-go"
	"github.com/ClickHouse/ch-go/proto"
)

func main() {
	ctx := context.Background()
	c, err := ch.Dial(ctx, ch.Options{Address: "localhost:9000", Password: ""})
	if err != nil {
		panic(err)
	}
	var (
		numbers int
		data    proto.ColUInt64
	)
	start := time.Now()
	if err := c.Do(ctx, ch.Query{
		Body: "SELECT number FROM system.numbers LIMIT 500000000",
		Result: proto.Results{
			{Name: "number", Data: &data},
		},
		// OnResult will be called on next received data block.
		OnResult: func(ctx context.Context, b proto.Block) error {
			numbers += len(data)
			return nil
		},
	}); err != nil {
		panic(err)
	}
	fmt.Println("numbers:", numbers)
	elapsed := time.Since(start)
	log.Printf("Read took %s", elapsed)
}

takes 90s on my machine

@gingerwizard
Copy link
Collaborator

ok ignore 90s - arm issues

=== RUN   TestRead
2022/09/30 18:40:32 Read took 38.102172657s
--- PASS: TestRead (38.10s)
=== RUN   TestChRead
numbers: 500000000
2022/09/30 18:40:33 Read took 1.285767595s
--- PASS: TestChRead (1.29s)
PASS
ok  	command-line-arguments	39.398s
dale@dale-pulse-14:/opt/clickhouse-go$ 

first tests is 50m using clickhouse-go, second with ch-go

@gingerwizard
Copy link
Collaborator

Makes me wonder if clickhouse-go would be faster using RowBinary...offloading the work to the server. @mshustov @ernado

@yusufozturk
Copy link
Author

Oh I'm sorry if I made a confusion. This wasn't our test, I just saw it on the internet while I was searching for similar things.
Actually we are preparing our own test scenarios and I will give you the benchmark results soon.

We will use ch-go for reading (we will use our internal load balancer). But we will implement a new client for inserting Avro files directly to Clickhouse using HTTP Client. I am hoping to use streaming from our messaging queue. So our client will use really low cpu and memory to ship data to Clickhouse. We will see after bechmarks.

@yusufozturk
Copy link
Author

Okay this is my first test results on a small machine on DigitalOcean:

100.000 records

HTTP Client with Avro format (file is around 5.5 MB)
Write took 16.0331926s

Clickhouse-go
Write took 9.0664694s

So clickhouse-go native gives better performance. But I will check their resource usage later.

Then I tried with smaller record sizes:

1.000 records
Write took 1.3065944s for http (file is 56 KB)
Write took 777.0141ms for native

10.000 records
Write took 3.2503491s for http (file is 555 KB)
Write took 1.6198537s for native

So http client is very slow. But on the other hand, we are not able to compare two clients. Because we are sending Avro file to the http client. Maybe Avro deserializer is slow.

If you add Avro support to the native, then comparing them would be more logical.

@yusufozturk
Copy link
Author

yusufozturk commented Oct 3, 2022

@gingerwizard What I planned so far: (this is my plan. didn't do any of it yet. I'm not sure if everything will go as plan.)

  1. Write a serializer/deserializer for Clickhouse native format and use that as a main serializer. So instead of writing data in Avro format, data will be transferred in native format in the entire lifecycle. So messaging queue will be transfering native format and we will send the data directly to the Clickhouse. (worst bottleneck is at the data ingester at the moment, so I can afford moving away from Avro)

  2. Fork ch-go, change the library to support sending native binary, remove all the other extra layers of latency. So make sure that data will be sent to CH directly.

  3. Remove all the reflector usage during columnar conversion process. We will have a single struct, so no need to use reflection. This will also reduce some of the latency.

So maybe for V3:
a. If you can support RowBinary, this will eleminate HTTP Client usage in the future.
b. If you can support accepting Native format binary, we can migrate to v3 in the future.
c. Add support for native binary return. So I can use my own reflector-free deserializer to convert data into my struct slices, so I can migrate to v3 in the future.

As a closing note: current performance of the ch-go is insanely good but it uses a lot of memory (I mean memory that I can actually avoid) due to reasons here. We would like to invest more into that. I would highly appreciate if you can also get these feedbacks as input for v3. I'm sure it will make it more flexible for CH customers in the future.

@gingerwizard
Copy link
Collaborator

Adding (1) to the driver would be welcome. An AppendAvro would be interesting in that it would be faster than struct - no reflection and reusable to others.

HTTP client usage is important for users who need to use proxys etc. Its less performant but essential for some users.

b. If you can support accepting Native format binary, we can migrate to v3 in the future.

Not sure what you mean. This client only uses native.

I think ch-go could do column-by-column compression which might help re-memory (need to check if this optimization is there).

@gingerwizard
Copy link
Collaborator

@yusufozturk
Copy link
Author

@gingerwizard at the end, we decided to go with clickhouse-go. We used following method in ch-go:

err := c.Do(ctx, ch.Query{
	Body:  sql.GetInsertQuery() + " VALUES",
	Input: input,
})

So we are preparing logs in native format and giving it directly.

On clickhouse-go, we used AppendStruct. So difference is: (100K logs)

2022/10/04 17:04:27 insert ch-go proto took 996.0782ms
2022/10/04 17:04:30 insert native struct took 2.2107911s

It's not that much difference. Actually this is only one insert, if there are thousands of inserts, probable we would see some difference. But for now, I am pretty okay with clickhouse-go performance. Because for ch-go, I need to manage many things.

But what I can kindly ask, maybe accepting native format (proto.Input) in clickhouse-go as well. In ch-go, we can give it via Input parameter. As I know so far, clickhouse-go does not accept proto.Input, right? If that would be possible, we could have skip some data conversion here.

Thanks.

@yusufozturk
Copy link
Author

Also I was thinking encode and compress native format and giving it directly to the clickhouse-go. That would also save us from some conversion latency. But I don't know how I can pull this off. Because there are additional stuff in the buffer like column names, protocol etc.

So actually I came back to my previous idea, accepting []byte as an input. I think I will stick with keeping data in native format, since there is a huge performance difference with Avro. But I just don't know how I can encode and compress proto.Input.

I am sorry if I don't explain clearly. English is not my native language 😅

@gingerwizard
Copy link
Collaborator

@yusufozturk np at all. thanks for all the feedback and testing!

so to summarize, we want to support an AppendAvro method on batch. Will this:

  1. Use Format=Avro when sending and accept a []byte
  2. Use native format and perform the conversion to native in the client?

I think we want (1)?

@yusufozturk
Copy link
Author

@gingerwizard yes, (1) would be perfect! Actually if you bring the (1), nobody stops you at the Avro anymore. Basically any supported format by Clickhouse would be supported, right? So why not accepting []byte and making the "format" as an parameter, so people can send their data with any supported format, like how they do in the HTTP client. That would be a perfect feature for this library.

Although (1) would be a perfect feature, after my tests, I was wondering if the performance of the Clickhouse Avro deserializer is good enough. Maybe Avro performance is worse than our serializer in the golang. We use github.com/hamba/avro as the avro serializer and it's one of the fastest. So if Avro performance is bad, I would like to see (2) as well. So I can keep data in the native format in my messaging queue, pull it directly, and ship it via clickhouse-go.

Thank you a lot for your help. You guys are doing great in this library. This changes will really saves us some compute power and bring the cost dramatically down.

@yusufozturk
Copy link
Author

@gingerwizard it would be perfect if library can support streams so instead of putting entire avro file into the memory, we can send to Clickhouse via reader. (I'm also going to ask for a new Avro table engine from Clickhouse.)

@zhicwu
Copy link

zhicwu commented Oct 18, 2022

Makes me wonder if clickhouse-go would be faster using RowBinary...offloading the work to the server.

It depends. Native data format is really fast for fixed-length data structure, but probably not the best when there's String. I updated ClickHouse/clickhouse-java#928 with latest test results. RowBinary performs better for String and Mixed(Int8 + UInt64 + String) queries, at the cost of higher CPU usage on server side(perhaps related to both http protocol and the format).

@yusufozturk
Copy link
Author

@gingerwizard Hi Dale! Hope all is going well with you.

Do you think this can be possible in v2? Or is this planned for v3? If it's in v3, hoping to see it in 3.0.0 release 😄

Thanks a lot.

@gingerwizard
Copy link
Collaborator

@jkaflik is taking on maintenance now - his decision here.

@jkaflik jkaflik added this to the v3.0.0 milestone Dec 20, 2022
@jkaflik
Copy link
Contributor

jkaflik commented Dec 20, 2022

@yusufozturk support other data formats than native will require some client internals redesign, hence I want to tackle this in v3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants