Lambda Snapstart kafka connection errors #42286

hamburml · 2024-08-02T08:33:32Z

Describe the bug

Hi,

we use snapstart on our quarkus lambdas. Some of them use smallrye-messaging to write or receive messages from a kafka. This works as expected unfortunately in our logs we have some warnings that the connection to a kafka node was lost either to auth error or firewall blocking.

    "loggerClassName": "org.apache.kafka.common.utils.LogContext$LocationAwareKafkaLogger",
    "loggerName": "org.apache.kafka.clients.NetworkClient",
    "level": "WARN",
    "message": "[Producer clientId=kafka-producer-event-xxxx] Connection to node xx (hxxxx.amazonaws.com/xxx:9096) terminated during authentication. This may happen due to any of the following reasons: (1) Authentication failed due to invalid credentials with brokers older than 1.0.0, (2) Firewall blocking Kafka TLS traffic (eg it may only allow HTTPS traffic), (3) Transient network issue.",

Afaik during the init phase the whole memory of a started quarkus lambda is stored and when the lambda is reused reloaded into the memory to skip the init phase. That also means that pooled connections are "stored" but in reality are already closed.

Now I thought i simply need to close all open kafka connections before the snapshot is created. I did this with a org.crac.Resource and the beforeCheckpoint method. Now the warnings in the log are gone but it looks like no new connections are initiated and therefore all messages send via a channel fail. I also used KafkaProducer::flush but that didnt help.

Any ideas?

@ApplicationScoped
@Slf4j
public class KafkaHelper implements Resource {

    @Inject
    KafkaClientService kafkaClientService;

    void onStart(@Observes StartupEvent ev) {
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context)
            throws Exception {
        log.info("kafkaproducer {}", kafkaClientService.getProducerChannels());
        log.info("kafkaconsumer {}", kafkaClientService.getConsumerChannels());

        log.info("going to sleep");
        var listOfProducer = kafkaClientService.getProducerChannels().stream()
                .map(kafkaClientService::getProducer)
                .map(KafkaProducer::flush) // with KafkaProducer::close log warnings are gone but all future messages fail
                .toList();

        Uni.combine().all().unis(listOfProducer)
                .combinedWith(unused -> null)
                .await().atMost(Duration.ofSeconds(10));
        log.info("going to sleep 2");
    }
    @Override
    public void afterRestore(org.crac.Context<? extends Resource> context)
            throws Exception {

        // is there a 'init connection' method?
        log.info("i am back");

    }
}

I found #31401 which is the same issue but with database connections.

Expected behavior

No response

Actual behavior

No response

How to Reproduce?

No response

Output of `uname -a` or `ver`

No response

Output of `java -version`

No response

Quarkus version or git rev

No response

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

quarkus-bot · 2024-08-02T08:33:35Z

/cc @alesj (kafka), @cescoffier (kafka), @matejvasek (amazon-lambda), @ozangunalp (kafka), @patriot1burke (amazon-lambda)

ozangunalp · 2024-08-13T14:06:06Z

AFAIK, kafka clients have no way of suspend/resume to support CRac. There is KIP-921 for that.
In theory, suspend should close connections (not the client itself) and resume would need to reconnect to a node.

hamburml · 2024-08-13T18:12:59Z

AFAIK, kafka clients have no way of suspend/resume to support CRac

That's sad but not what I mean. When my resumed Lambda sends a message via Kafka, I simply have an exception in the log, the Kafka client reconnects and tries the same message again. Nothing is lost. What is more annoying is the fact that the Lambda is often shutdown, so it starts with SnapStart, loads the state into memory and then we have the exception again. It feels dirty and unclean. I only need a simple method which tells kafka client to close all connections, I would call it in the beforeCheckpoint and maybe in afterRestore tell kafka to connection again.

btw AWS SnapStart only uses the interfaces for CRaC but is not a CRaC implementation. It is just very similar.

edit

In short:

In theory, suspend should close connections (not the client itself) and resume would need to reconnect to a node.

would be helpful :D

cescoffier · 2024-11-13T06:31:55Z

@ozangunalp What's the status here? Did we documented the pause/resume mechanism we discussed?

hamburml added the kind/bug Something isn't working label Aug 2, 2024

quarkus-bot bot added area/amazon-lambda area/kafka labels Aug 2, 2024

hamburml mentioned this issue Aug 5, 2024

Lambda Snapstart kafka connection errors smallrye/smallrye-reactive-messaging#2715

Open

hamburml mentioned this issue Aug 14, 2024

Initial support for OpenJDK CRaC snapshotting apache/kafka#13619

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambda Snapstart kafka connection errors #42286

Lambda Snapstart kafka connection errors #42286

hamburml commented Aug 2, 2024 •

edited

Loading

quarkus-bot bot commented Aug 2, 2024

ozangunalp commented Aug 13, 2024

hamburml commented Aug 13, 2024 •

edited

Loading

cescoffier commented Nov 13, 2024

Lambda Snapstart kafka connection errors #42286

Lambda Snapstart kafka connection errors #42286

Comments

hamburml commented Aug 2, 2024 • edited Loading

Describe the bug

Expected behavior

Actual behavior

How to Reproduce?

Output of uname -a or ver

Output of java -version

Quarkus version or git rev

Build tool (ie. output of mvnw --version or gradlew --version)

Additional information

quarkus-bot bot commented Aug 2, 2024

ozangunalp commented Aug 13, 2024

hamburml commented Aug 13, 2024 • edited Loading

cescoffier commented Nov 13, 2024

hamburml commented Aug 2, 2024 •

edited

Loading

Output of `uname -a` or `ver`

Output of `java -version`

Build tool (ie. output of `mvnw --version` or `gradlew --version`)

hamburml commented Aug 13, 2024 •

edited

Loading