Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

support kafka 2.0.0 #1053

Closed
woodsaj opened this issue Sep 18, 2018 · 13 comments
Closed

support kafka 2.0.0 #1053

woodsaj opened this issue Sep 18, 2018 · 13 comments
Assignees
Milestone

Comments

@woodsaj
Copy link
Member

woodsaj commented Sep 18, 2018

We have seen a few bugs in production that have been fixed in kafka 2.0.0
Kafka 2.0.0 is supported in Sarama from v1.18.0, we are currently using v1.10.0

@Dieterbe
Copy link
Contributor

@woodsaj in order to avoid duplicate work, please clarify whether you're interested in testing of the new sarama and the new kafka (i believe that's what you said in todays meeting) or do you want someone else to take this on (which i believe we concluded in yesterdays meeting)

@woodsaj
Copy link
Member Author

woodsaj commented Sep 18, 2018

I am building new raintank/kafka docker images, so once someone else updates metrictank, it can be tested easily against the new kafka version using our docker stacks.

@woodsaj
Copy link
Member Author

woodsaj commented Sep 18, 2018

@Dieterbe i have pushed 2 new docker images to dockerhub

raintank/kafka:v1.1.1
raintank/kafka:v2.0.0

@Dieterbe
Copy link
Contributor

Dieterbe commented Sep 19, 2018

we are currently using v1.10.0

no, we use sarama v1.16.0 as of #906
it is a bit confusing because gopkg.toml specifies version "1.10.1" but that means "1.10.1 or a higher 1.x " by default. you can confirm by checking gopkg.lock which shows the exact version used, or diffing the code (which i did, and saw that it's v1.16.0). from now on, lets pin more precisely to an exact version.

i see that in all our prod and ops clusters we still use kafka 0.10.2.1, which was deprecated between sarama v1.16 and v1.17, and also 1.1.0 was introduced in between those two.

i see that a kafka cluster can be upgraded from 0.10 straight to 2.0 (http://kafka.apache.org/20/documentation.html#upgrade_2_0_0) but if we do it that way, we do it in a way that's not officially supported by sarama:

  1. either we first upgrade sarama to v1.18 in MT, then upgrade kafka (but v18 doesn't support 0.10)
  2. or we first upgrade kafka, then sarama to v1.18 (but sarama 1.16 doesn't support 2.0)

2 seems super risky, 1 seems somewhat risky.
so either we :
A) confirm whether 1 should work, despite not being officially supported (IBM/sarama#1171)
B) do a 2 step upgrade: first upgrade kafka to a version between 0.11 (lowest supported by sarama v1.18) and 1.0 (highest supported by v1.16), then upgrade sarama to v1.18, then upgrade kafka to 2.0

note: IBM/sarama#1101, want to redo some of these tests

@Dieterbe
Copy link
Contributor

We have seen a few bugs in production that have been fixed in kafka 2.0.0

for the record, can you describe these bugs? or link to them?

@woodsaj
Copy link
Member Author

woodsaj commented Sep 24, 2018

@Dieterbe Dieterbe added this to the 1.1 milestone Oct 14, 2018
@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 15, 2018

in a team meeting we decided to try solution 1. note that in the meantime 1.19 has come out, bringing some minor changes (https://github.com/Shopify/sarama/releases/tag/v1.19.0)

so we should verify that:

  • MT + sarama 1.19 indeed works with our current kafka versions, 0.10.2.1 (not officially supported)
  • MT + sarama 1.19 works with kafka 2.0
  • we can do a live upgrade from kafka 0.10.2.1 to 2.0, with an active publisher sending data - eg fakemetrics - and MT cluster consuming data (using sarama 1.19)

in addition we have to measure performance of:

  • MT + sarama 1.16 vs MT + sarama 1.19, with both kafka versions
  • if the results are noisy, we may want to modify MT to not actually process incoming data (e.g. don't send into index and into tank/aggmetrics, which is what i ended up doing regression in consumer throughput when upgrading 1.10 to 1.16 IBM/sarama#1101, but a full "take 2" of those tests (testing all sarama versions) is out scope for this task.

@robert-milan
Copy link
Contributor

robert-milan commented Oct 23, 2018

raintank/kafka:v2.0.0 with Sarama v1.19 using client config V2_0_0_0 seems to be the best overall so far, offering consistent and reduced memory and cpu overhead while only losing out slightly on ingest rate.

I tried to run the benchmarks from #1032. For cpu they worked quite well, but each run would grow the -benchmem alloc results. I didn't have time to further investigate. I will look into it soon and post results here.

Setup:
Through testing I finally settled on using ./fakemetrics backfill --kafka-mdm-addr localhost:9092 --kafka-mdm-v2=true --offset $((2*366*24))h --period 900s --speedup 180000 --orgs 2 --mpo 2000 to achieve a long enough ingest processing period to compare results.

Kafka: raintank/kafka
Sarama: 1.16 -- Specified in Gopkg.toml as:(1.10.1)
Client Version: V0_10_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/LwryH5IF7bXcnyN6G2yhNn46SK47PbQY
containers - https://snapshot.raintank.io/dashboard/snapshot/ZN84kh6yxDvoA0H8CN4PfIeowtOjx1Kg
host - https://snapshot.raintank.io/dashboard/snapshot/RBKvBUc6MI6L6MHzRiFYTGyRiQru2sOs

Kafka: raintank/kafka
Sarama: 1.19
Client Version: V0_10_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/A620A9s5LAGPj6diGwh1xLIj11dUedTp
containers - https://snapshot.raintank.io/dashboard/snapshot/vH4Ki6uW62l9wJUJQDbp7ayq2PIX00oT
host - https://snapshot.raintank.io/dashboard/snapshot/4gzXiD5Zaui5ihruicVYdZ31taFcY2uE

Kafka: raintank/kafka:v2.0.0
Sarama: 1.16
Client Version: V0_11_0_0
(NOTE: V0_10_0_0 with Sarama v1.16 does NOT work with raintank/kafka:v2.0.0. Please see IBM/sarama#1144)
Sarama 1.16 DOES support client version V0_11_0_0 so I changed it for this test.

diff --git a/input/kafkamdm/kafkamdm.go b/input/kafkamdm/kafkamdm.go
index 0e6a3752..fe543df6 100644
--- a/input/kafkamdm/kafkamdm.go
+++ b/input/kafkamdm/kafkamdm.go
@@ -8,17 +8,15 @@ import (
        "sync"
        "time"
 
-       "github.com/raintank/schema"
-       "github.com/raintank/schema/msg"
-
        "github.com/Shopify/sarama"
-       "github.com/rakyll/globalconf"
-       log "github.com/sirupsen/logrus"
-
        "github.com/grafana/metrictank/cluster"
        "github.com/grafana/metrictank/input"
        "github.com/grafana/metrictank/kafka"
        "github.com/grafana/metrictank/stats"
+       "github.com/raintank/schema"
+       "github.com/raintank/schema/msg"
+       "github.com/rakyll/globalconf"
+       log "github.com/sirupsen/logrus"
 )
 
 // metric input.kafka-mdm.metrics_per_message is how many metrics per message were seen.
@@ -129,7 +127,7 @@ func ConfigProcess(instance string) {
        config.Consumer.MaxWaitTime = consumerMaxWaitTime
        config.Consumer.MaxProcessingTime = consumerMaxProcessingTime
        config.Net.MaxOpenRequests = netMaxOpenRequests
-       config.Version = sarama.V0_10_0_0
+       config.Version = sarama.V0_11_0_0
        err = config.Validate()
        if err != nil {
                log.Fatalf("kafkamdm: invalid config: %s", err)
diff --git a/mdata/notifierKafka/cfg.go b/mdata/notifierKafka/cfg.go
index 3e54e45f..23579e3e 100644
--- a/mdata/notifierKafka/cfg.go
+++ b/mdata/notifierKafka/cfg.go
@@ -70,7 +70,7 @@ func ConfigProcess(instance string) {
 
        config = sarama.NewConfig()
        config.ClientID = instance + "-cluster"
-       config.Version = sarama.V0_10_0_0
+       config.Version = sarama.V0_11_0_0
        config.Producer.RequiredAcks = sarama.WaitForAll // Wait for all in-sync replicas to ack the message
        config.Producer.Retry.Max = 10                   // Retry up to 10 times to produce the message
        config.Producer.Compression = sarama.CompressionSnappy

MT - https://snapshot.raintank.io/dashboard/snapshot/viCbVVoQZB9IysKcf8oMksnZ28Igay46
containers - https://snapshot.raintank.io/dashboard/snapshot/Iq4GQI9BFcAMrlTkND7NGY8v3yKRTWiD
host - https://snapshot.raintank.io/dashboard/snapshot/a0Dqcluoyr8EG6BiuNeQh7zsxJKLHE35

Kafka: raintank/kafka:v2.0.0
Sarama: 1.19
Client Version: V0_10_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/3JLIMMvGS9DimEiXn8m1rDGmhvOLp7qo
containers - https://snapshot.raintank.io/dashboard/snapshot/grrbLd3WndhFpBjBd4JI6Z9V14IuFQFP
host - https://snapshot.raintank.io/dashboard/snapshot/HytAHZoGQ2hye9yAjM8NTq5396TTf4Uv

Kafka: raintank/kafka:v2.0.0
Sarama: 1.16
Client Version: V2_0_0_0
Sarama 1.16 does not support client version V2_0_0_0 so this was skipped

Kafka: raintank/kafka:v2.0.0
Sarama: 1.19
Client Version: V2_0_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/RdFXLTfmsaxvKCoiRy9kpW9P1lT467aJ
containers - https://snapshot.raintank.io/dashboard/snapshot/jMCYc2tq9CKL25wD3MF374EO1uPdwUVu
host - https://snapshot.raintank.io/dashboard/snapshot/4m3W5smhuUfOJ8yaq4yhn0tpRnvPYPvv

I should results verifying live upgrade on a local docker-cluster tomorrow. I will post them here.


EDIT with new information for kafka 0.10.2.1

Testing with kafka 0.10.2.1 yielded basically the same results as kafka 0.10.0.1

Kafka: raintank/kafka:v0.10.2.1 (locally built)
Sarama: 1.16
Client Version: V0_10_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/wnLqEXT4556cCjQ5JXZAij6ar6IiBsSO
containers - https://snapshot.raintank.io/dashboard/snapshot/G7IrR34vf6KwhsRRQyNuEkJNHTKa4Zq6
host - https://snapshot.raintank.io/dashboard/snapshot/TeINQRqhgLkfG02Gf6R37NzqHfHw66sY

Kafka: raintank/kafka:v0.10.2.1 (locally built)
Sarama: 1.19
Client Version: V0_10_0_0
MT - https://snapshot.raintank.io/dashboard/snapshot/6VBEYHi7KV6uiyS7QytMFZ96a4xA2mYg
containers - https://snapshot.raintank.io/dashboard/snapshot/4XTB6SHfYS7wPwZO5owdedi6NAsqfsjW
host - https://snapshot.raintank.io/dashboard/snapshot/JY8AcucjxgEs4NZXCa4w9Q1HayFw9i9F

@woodsaj
Copy link
Member Author

woodsaj commented Oct 23, 2018

This is great @robert-milan
To summarise: (*these numbers are just eyeballed from the snapshots)

Kafka Ver Sarama Ver Client Ver Ingest MT RSS MT heap MT CPU% Kafka CPU%
0.10.0.1 1.16 V0_10_0_0 890k 220MiB 55MiB 250 245
0.10.0.1 1.19 V0_10_0_0 860k 235MiB 62MiB 240 240
2.0.0 1.16 V0_11_0_0 815k 302MiB 45MiB 170 170
2.0.0 1.19 V0_10_0_0 614k 222MiB 50MiB 170 170
2.0.0 1.19 V2_0_0_0 805k 282MiB 40MiB 170 170

Key takeaways for me are:

  • memory used for ingestion represents a fraction of overall memory usage for MT. Doing 800k+/second using only 50MB of heap seems exceptional. Given MT instances typically use many GB of memory (clearly driven by active_series), variation between kafka/sarama/client vers seems inconsequential.
  • The reduced CPU usage of Kafka is going to be a big win for us. Our kafka cluster is mostly CPU constrained.
  • In production, we dont ever see individual MT instances do more than 500K of ingest. So the drop in ingest rate with kafka2.0 is not an issue.

One question, how many partitions was MT consuming from? While we are at it, i would like to get an idea of how the number of partitions affects performance numbers.

woodsaj pushed a commit that referenced this issue Oct 23, 2018
see issue: #1053

from  https://github.com/Shopify/sarama/blob/v1.19.0/config.go#L324-L330

```
The version of Kafka that Sarama will assume it is running against.
Defaults to the oldest supported stable version. Since Kafka provides
backwards-compatibility, setting it to a version older than you have
will not break anything, although it may prevent you from using the
latest features. Setting it to a version greater than you are actually
running may lead to random breakage.
```
woodsaj pushed a commit that referenced this issue Oct 23, 2018
see issue: #1053

from  https://github.com/Shopify/sarama/blob/v1.19.0/config.go#L324-L330

```
The version of Kafka that Sarama will assume it is running against.
Defaults to the oldest supported stable version. Since Kafka provides
backwards-compatibility, setting it to a version older than you have
will not break anything, although it may prevent you from using the
latest features. Setting it to a version greater than you are actually
running may lead to random breakage.
```
@Dieterbe
Copy link
Contributor

Dieterbe commented Oct 23, 2018

Sarama: 1.16 (1.10.1)

what is this 1.10.1 ?

re kafka version, @woodsaj how do you know 0.10.0.1 was used? that would be our raintank/kafka:v1 image. @robert-milan can you clarify the kafka version for those where it is currently not mentioned.
regardless, 0.10.0.1 doesn't seem useful as in prod and ops we use 0.10.2.1 (but we haven't published an image for that version on dockerhub, it seems)

NOTE: V0_10_0_0 with Sarama v1.16 does NOT work with raintank/kafka:v2.0.0

oof. and it looks like it doesn't work with V0_10_2_1 either which is the version we need. does it work when using an older kafka version? because you mentioned "with kafka v2"

another interesting observation is that when going from the first to the last tests, GC frequency and CPU spent on GC goes down

@robert-milan some more benchmarking tips:

  • when the data is noisy, (e.g. heap used) i recommend you adjust the query to show movingAverage of a minute or so.
  • run the MT instance with cluster.primary_mode = false, to rule out any interference of saving chunks to cassandra
  • likewise, use memory idx plugin, and disable the cassandra one
  • time your grafana requests to be after the benchmark workload (seems you already figured this out)

@woodsaj
Copy link
Member Author

woodsaj commented Oct 23, 2018

how do you know 0.10.0.1 was used? that would be our raintank/kafka:v1 image

yes, the raintank/kafka:v1 uses 0.10.0.1 version of kafka.

can you clarify the kafka version for those where it is currently not mentioned.

The "raintank/kafka" image will use "raintank/kafka:latest" which is "raintank/kafka:v1"

oof. and it looks like it doesn't work with V0_10_2_1 either which is the version we need

this doesnt matter. Just update sarama to v1.19, it works correctly with kafka2.0 and kafka0.10 with clientVersion set to V0_10_0_0

@robert-milan
Copy link
Contributor

@Dieterbe I upadted my comment to clarify the 1.10.1 as belonging to the constraint in Gopkg.toml, and thank you for the benchmarking tips. @woodsaj I will get back to you on the partitions.

@Dieterbe Dieterbe modified the milestones: 1.1, 1.0 Oct 24, 2018
robert-milan added a commit that referenced this issue Nov 1, 2018
This is required as part of the workflow to upgrade Kafka to v2.0.0

See also: #1053
@Dieterbe
Copy link
Contributor

@robert-milan seems me this ticket can be closed now?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants