-
Notifications
You must be signed in to change notification settings - Fork 420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event listener restarts when under minimal stress #1618
Comments
I believe I have found the problem here. I started of trying to debug by sending a single message to my broker. Surprisingly, this let to 101 events being sent to my EventListener. After turning on trigger debugging, I reran my test and again check the logs. After analyzing the messages, it broke down like this:
Further investigation reveals that this message is sent in sink.go (https://github.com/tektoncd/triggers/blob/main/pkg/sink/sink.go#L236-L255). If the message encoding is unknown, a simple HTTP response is sent with an "Accepted" status. However, if the encoding is known, the CloudEvent message "dev.tekton.event.triggers.accepted.v1" is sent. In my case it appears the encoding is known. However, this appears to cause a furious loop. As the accepted event is sent, the broker accepts it and responds by forwarding it back to the EventListener. The EventListener receives the event, processes it, and since it has a known encoding it again responds by sending an accepted evetnt. There appears to be an infinite loop, and I'm not sure why it ever ends, but at some point it does. It could be because the EL gets so overwhelmed the eventually it can't respond in time and breaks the cycle. To me this appears to be a flaw in the logic of sending an event in response to another event. There's probably a good reason to do so, but it's nothing that I am aware of. But my setup does not seem unusual so it seems like this could be a common problem for anyone who is sending events with a known encoding. To test my theory, I commented out the sending of the accepted event and ran my curl test. The test ran for an hour, and it send 1.5 million events to the EL without a single restart. A check of the EventListener log reveals no timeout errors and no errors (other than the expected FailedPrecondition errors). I'm not sure what the fix should be here. To me it would make sense for |
@afrittoli What do you think about this request? Is there a flaw in the logic of sending cloud-event responses to cloud-event requests? |
I actually think the EL should contain logic to not respond with another event to an event it generated. I think that's what's missing. |
Thanks @ggallen for your analysis of this. Producing CloudEventsTekton can produce CloudEvents to notify consumers about things that happen in Tekton, like a trigger was received, a pipeline was executed, etc. This functionality has existed in Tekton Pipelines for some time, and has recently been introduced into Tekton triggers as well. Consuming CloudEventsAn event producer creates a CloudEvent, which can be sent directly to an event listener sink or sent to a CloudEvent broker and delivered to the event listener sink via a subscription. In Tekton's own infra cluster, we use a Knative Eventing channel to route CloudEvents. The Knative subscription object sends the events to the event listener sink and expects a response. If the response is missing, or if the response is not a valid CloudEvent, it considers the message delivery to be failed and it attempts again. Tekton triggers added a CloudEvent response to CloudEvents triggers to address this issue. Tekton must respond to an incoming CloudEvents, and that response must be in valid CloudEvents format according to the CloudEvents spec. Something is not working either in the cloudevent response or in the broker. With that context, I would like to make sure I understand where the loop happens: sequenceDiagram
participant broker
participant event-listener
participant el-sink
broker->>event-listener: sends a CloudEvent `com.redhat.agogos`
event-listener ->>el-sink: produces a new CloudEvent `dev.tekton.event.triggers.started.v1`
event-listener->>broker: sends a CloudEvent response `dev.tekton.event.triggers.accepted.v1`
broker ->>event-listener: sends a CloudEvent response to `dev.tekton.event.triggers.accepted.v1`
event-listener->>broker: sends a CloudEvent response `dev.tekton.event.triggers.accepted.v1`
broker ->>event-listener: sends a CloudEvent response to `dev.tekton.event.triggers.accepted.v1`
event-listener->>broker: sends a CloudEvent response `dev.tekton.event.triggers.accepted.v1`
If this is correct, it looks like the event broker does not recognise the response produced by the event listener as a response, instead, it interprets it as a new event to which it replies with a response to the event listener, which didn't expect a response and treats the incoming message as a new event and so on. I wonder how that is possible :S Do you have a trace of the content of the event listener response to the initial event? |
@afrittoli, I think the loop happens between the broker and the event listener as you show in your diagram. The UIDs are changing, so I also think the broker must be interpreting the response as a new event. But I don't know how the broker is supposed to determine that the new event is in fact a response and not a new event. Is there something missing in the response that is supposed to indicate that to the broker? If you tell me how to get what you want for debugging information, I can get it for you. But this should be easy to reproduce. |
@afrittoli, if it makes any difference, we are using the Knative MT broker. |
@afrittoli I think what @ggallen want is a receiving without a cloud event. Sending a cloud event in response to a cloud event is a valid response as we can see from this receiver: What @gallen want is just a Result response (which consists of a status code and message string) to CloudEvent. Can't we just filter events by the source on the broker side? |
Well, I don't really care what I get for a response, I just don't want all the messages. But I think there is a problem here. The question is where, and I see a couple of possibilities:
I think the problem, wherever it is, should be fixed. It's possible that it's #2, and the options are to either get a fix to the MT broker or change to a different broker that handles things correctly. I think the question is where is the actual problem? I don't have the knowledge to answer that question. |
If we can't filter events by the source on the broker side, then we can probably add an annotation to EventListener to determine what to do for CloudEvents request - default - send Events or just status-code and msg. |
@afrittoli If you are free, please clarify on how to proceed with this. |
Hello @khrm and @ggallen - sorry about the late reply. To summarise the issues as I understand them today:
In both issues (2) and (3), the response sent is not identified as a response, but rather as a new event, which triggers a response. In both those cases, the issue can be:
Note that in (2) and (3) producer and consumers are inverted, so we could have two separate issues:
The next steps would be to have examples of each of the responses produced by both systems, compare them with the specification and figure out whether the issue is the producer or consumer of the response. I hope this helps move forward towards a solution. |
I have gotten the Broker and EL messages. I chose three message/response pair as I think that's enough to get a picture of what's going on. I also looked at the CDEvents documentation at https://github.com/cdevents/spec, but it wasn't helpful to someone with my level of knowledge about this topic. But I am wondering if it makes any difference that the initial event being sent to the EL from the Broker is not one of the predefined CD events. Is that possibly the problem? Here are the messages. Hopefully, this information can shed some light on what might be the problem. Broker: |
That's fine. EL can receive any CE-Events. It will respond with Accepted Event which shouldn't come back. |
@ggallen Can you test the broker with this sample? https://github.com/cloudevents/sdk-go/blob/7f5ef3992769f96b40a54c4d59291be62acd36da/samples/http/responder/main.go If your broker while receiving the CE event sends back the event again, then it's broken. Otherwise, EL has a problem. |
I don't understand what you want me to do. I can't run that code. Do you want me to send that message? You can see from my example that the Broker ends up sending back the accepted message to the EL. |
@ggallen , can you try setting filters like given here: I wanted you to run something similar and send a message via a broker. Let me create a program. |
@ggallen Please use this deployment file: https://raw.githubusercontent.com/khrm/ce-test/master/deployment.yaml First, send a direct request after port-forwarding the service:
Afterward, you can use the below commands to send a request:
Afterward, you can port-forward the MT broker endpoint (expose as port 8080) and send the request using the above command. |
@khrm, something is not quite right, but here are the results. Sending to
Sending to the MT broker:
|
@ggallen Was message received at CE-Test service port? I don't think so. |
Well, I can't get anything to run now. I see this in the events:
|
@khrm, the ce-test pod keeps restarting.
In the output above, I haven't done anything but let time pass. I am seeing the liveness probe fail as in the previous comment. |
@khrm, I was able to finally send a message directly to the broker:
And what the EL shows in the log:
Sending to ce-test:
And the EL log after that:
Nothing there, although maybe that is expected. I think this is what you asked for. If you need anything else please let me know. |
@ggallen Actually, I wanted you to use CE-Test in place of EventLisener. Put CE-Test in broker in place of EventLisener and send event via broker. I assume at the moment when we send event to Broker, it goes to EL. Now we want to send event to Broker, it should go to CE-Test. |
@khrm, I will try that. I'm afraid the problem is going to be that I can't get the log for CE-test, unless I get lucky, since it keeps restarting. |
@khrm, I believe this is what you are asking for. CE-Test OUTPUT:
CE-Test LOG:
BROKER OUTPUT:
BROKER LOG:
|
No. I meant instead of broker sending request to EL, it should send request to CE-Test service. In the above, I can see log from EL. |
@khrm, yes. I replaced the regular EL with the CE-test image. |
@ggallen But the logs that you shared are from EL. Can you use the updated deployment file from here: I removed probes and updated the image with one with logs. Now, you it shouldn't crash. |
After this change,
|
@khrm, what exactly are you asking me to do? Does this branch have a fix? |
This branch changes the response code from 202 to 200, which is the only difference in behaviour that we could observe between the test app and the event listener. At this stage, it's still a bit unclear to me whether the issue lies in the event listener or in the broker implementation, so it's difficult to build a reproducer test. |
But I thought I already provided enough information so that the problem could be reproduced easily. Is that not the case? |
@ggallen Unless we set up the broker, we can't debug this easily. Can you try using the above branch? |
Or you can use the below release yaml:
It's based on the above branch. |
@ggallen If the issue is the status code, that branch release yaml might fix it. That's the only difference between ce-test and EventListener. Can you please test it? |
@khrm, I applied your patches to my development environment and ran an Agogos build as I did when I initially found this problem. I see thousands of messages in the log still, notably ones like this:
I think the fact that I am still seeing messages of type |
I would publish a branch cloud events sdk. Wait. |
@ggallen This doesn't look like the issue with EL. I tested with ce-test and it has looping too. The only thing is that the CE-test gives 400 responses after a while and this breaks the loop. |
A workaround for this:
Enable the new filtering in the configmap.
In the Trigger, I put the following:
|
@khrm, are you saying there's not a problem here somewhere? While I have no doubt your workaround will work, it's just masking the problem. Somewhere, something still has to process all those messages to filter out what isn't wanted. And whatever it is is still going to get thousands of messages it shouldn't get. Actually, you're cutting alot of that off with the filter. But it's still masking what I think is a real problem, and will just happen with the next EL. |
@ggallen There won't be thousand of messages if it's filter first time. Even without EventLisener, if I don't use filter, it start looping. |
Agreed, but it will still mean I'm going to get double the number of messages. Every time an accept message is sent, I get a message back that has to be filtered. And since I'm using CloudEvents, I have to put this workaround on every EL. Any users of our software also have to also put it on any ELs that they may create. It may not be an EL problem, but it still seems like a problem. It seems like it is actually a serious problem, regardless of where the problem lies. Don't you agree? |
Yes. That's true but that's how cloudevents work. You send an event and you get an event back as response. Now you filter that response. @afrittoli What could be the solution for this? Should we not send back the event for Cloudevents request? Or have a flag or annotation somewhere to disable sending back cloudevent as response? |
I can't believe this is how it's supposed to work. At least it's not how any other messaging system I have worked with works. Sending an ack should not result in an ack of the ack.
If this is, in fact, true, even for an ack, then the EL should automatically filter all of those messages. |
@ggallen It's not an ack only. It also contains information regarding what was event id generated during the processing.
|
@ggallen Can you use cloud event extension? A header with something like NO_ACK? Then, EventListener won't send back a response. Just a basic accepted response. @afrittoli Is this the correct way to go about this? |
It would be like a cloudevent extension. |
So after looking into cloud event spec and SDK. And EventListener can GetExtension Based on the extension value, EventListener can determine whether to send an event or not. We can name the extension as |
@afrittoli Can you please confirm whether the extension route is the way to go here as the correct pattern to use and that the EL is intentionally sending 2 ACKS. One for the message and another ACK for the ACK. |
@khrm @gvarsamis Reading through the thread, my impression is that the event listener is behaving according to specification (perhaps with the exception of the status code 202 instead of 200). The broker and even listener communicate through the CloudEvents HTTP Binding - which means communication follows a request/response pattern. The CloudEvent binding requires the HTTP client to produce a response that is valid according to the CloudEvents protocol - that means, if a payload is provided in the response, it must be a valid CloudEvent. In the past, the event listener use to respond with a JSON payload - not in CloudEvents format - which caused issues with the broker - see #1439 for more context. The fix for that has been to format the previous JSON payload as a CloudEvent. At this stage my perception is that the issue is no the broker side - it should not reply to an HTTP response with a new request. Once that HTTP request is sent to the event listener, the event listener replies to it as it should with an HTTP response. From an event listener point of view there is no connection between the first message from the broker and the second one. What we could do is to reach out to the knative/cloudevents communities to clarify this behaviour with them.
|
@afrittoli, thanks for the information. But do you agree that this is not expected behavior? There is a bug here somewhere? I agree it must the the broker if the EL is responding as it should. So it seems the issue has been narrowed to the broker. I would like to get the bug fixed as it seems like this could be an issue for anyone using CloudEvents. We are currently using the in-memory channel. Would changing to Apache Kafka potentially solve this problem? Or is it likely more of a problem with the broker and not the channel? I'm OK with the workaround, but would really like to get a solution. The problem with the workaround is that it will work for our one @khrm , could you package up the small test case, create an issue for the broker, and see what they say? Please include me on it so I can track its progress. |
Sure. I would create a test case to reproduce this without EL. |
@khrm, did you create an issue for this yet? |
@ggallen No. I didn't get time. We released Operator this week and are having f2f. |
@ggallen I have created the issue here: knative/eventing#7356 |
@ggallen I think you can contact people from the Red Hat OpenShift Serverless team regarding Knative Eventing. Or some other upstream maintainers. |
Expected Behavior
When under minimal stress, the event listener restarts. I was able to get the EL to restart by sending < 3K message to it using curl in 20 seconds.
Actual Behavior
An event listener should be able to handle thousands of messages per second.
Steps to Reproduce the Problem
./send-cloudevent-forever.sh cloudevent5.json
send-cloudevent-forever.sh.txt
cloudevent5.json.txt
Additional Info
Kubernetes version:
Output of
kubectl version
:Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:58:16Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.7
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"8f94681cd294aa8cfd3407b8191f6c70214973a4", GitTreeState:"clean", BuildDate:"2023-01-18T15:51:25Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
The text was updated successfully, but these errors were encountered: