Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send pipeline logs to OpenTelemetry endpoint #304

Merged
merged 36 commits into from
Feb 20, 2022

Conversation

cyrille-leclerc
Copy link
Contributor

@cyrille-leclerc cyrille-leclerc commented Jan 25, 2022

  • Support sending the Pipeline logs to an OpenTelemetry endpoint.
    • When pipeline logs are sent to Elastic through the OpenTelemetry instrumentation, support visualization of logs through Jenkins pipeline logs screens in addition to supporting visualization of logs through Elastic / Kibana
  • Add instrumentation of Jenkins Stapler
  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

image

kuisathaverat and others added 10 commits February 6, 2022 11:37
* Move Elasticsearch code under backend package

* Move Elasticsearch code under backend package
…nt (#315)

* Implement logs retriever pagination using new Elasticsearch Java client

* WIP

* WIP

* extract integration tests configuration from code

* WIP

* WIP

* WIP

* WIP

* fix findbugs

* better docs

* WIP
@cyrille-leclerc cyrille-leclerc marked this pull request as ready for review February 19, 2022 16:22
@cyrille-leclerc cyrille-leclerc merged commit 0397b31 into master Feb 20, 2022
@cyrille-leclerc cyrille-leclerc deleted the send-build-logs-to-otel branch February 20, 2022 15:26
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any general feedback on how well JEP-210 worked for your use case?

if (rootPath.equals("job")) {
// e.g /job/my-war/job/master/lastBuild/console
// e.g /job/my-war/job/master/2/console
ParsedJobUrl parsedJobUrl = parseJobUrl(pathInfoTokens);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all seems like a maintenance burden. Did you consider using a supported API like https://javadoc.jenkins.io/component/stapler/org/kohsuke/stapler/StaplerRequest.html#getAncestors()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I got trapped. I started adding HTTP request tracing to understand how the progressive rendering of logs worked in Jenkins and ended up this insane if then else.

I'll add a comment in the code to use this Stapler API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding HTTP request tracing to understand how the progressive rendering of logs worked in Jenkins

BTW -Dstapler.trace, automatic with mvn hpi:run.


/**
* Use the old `org.elasticsearch.client:elasticsearch-rest-high-level-client` waiting for
* `co.elastic.clients:elasticsearch-java` to fix https://github.com/elastic/elasticsearch-java/issues/163
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code has been refactored and cleaned up since then.

* Retrieve the logs from Elasticsearch.
* FIXME graceful shutdown
*/
public class ElasticsearchLogStorageRetriever implements LogStorageRetriever<ElasticsearchLogsQueryContext>, Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deliberately committed as a *.txt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was struggling understanding the Jenkins progressive logs rendering APIs and kept temporarily some code as .txt, it has been removed since then.


import java.util.Objects;

public class ElasticsearchLogsQueryContext implements LogsQueryContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, removed since then


/**
* Utilities for extracting and reinserting {@link ConsoleNote}s.
* copied from https://github.com/jenkinsci/pipeline-cloudwatch-logs-plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…and then modified from there. Candidate for a convenience API in workflow-api?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be an idea. The helper in the Jenkins Otel Plugin changed since then but it can probably be useful for other. implementations of JEP 210

ConsoleNotes.TextAndAnnotations textAndAnnotations = ConsoleNotes.parse(bytes, len);
String plainLogLine = textAndAnnotations.text;
if (plainLogLine == null || plainLogLine.isEmpty()) {
LOGGER.log(Level.FINER, () -> buildInfo + " - skip empty log line");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be confusing if you actually had some tool run as part of a sh step which deliberately emitted blank lines as a visual separator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point, I should have added a comment to explain why I ended removing these line breaks, it was probably caused by challenges I faced understanding Jenkins progressive logs rendering APIs

this.clock = Clock.getDefault();
} else {
long instantInNanosOnJenkinsAgentAtDeserialization = Clock.getDefault().now();
long offsetInNanosOnJenkinsAgent = instantInNanosOnJenkinsControllerBeforeSerialization - instantInNanosOnJenkinsAgentAtDeserialization;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume that the serialization, communication over Remoting, and deserialization are nearly instantaneous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not instantaneous, I wanted monotonicity of the timestamps even when the clocks of the controller and of the build agents are different.
Don't you do something similar with the CloudWatch storage?

See https://github.com/jenkinsci/opentelemetry-plugin/blob/opentelemetry-2.9.0-rc1/src/main/java/io/jenkins/plugins/opentelemetry/job/log/OtelLogSenderBuildListener.java#L50

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted monotonicity of the timestamps

My point: let us say the two hardware clocks are actually perfectly synchronized (up to relativistic constraints, whatever). But the Remoting channel is extremely sluggish today for whatever reason, and it takes ten seconds to send a Command. Then offsetInNanosOnJenkinsAgent will be 10_000_000_000 or so. But in fact the correct behavior in this case would be to use zero offset, if you are trying to ensure that messages delivered to the external storage are accurately threaded.

In other words, if you want to calculate clock skew, you need to send a round-trip packet and adjust by half the time required to do so, or something a bit more complex like that. I think https://github.com/jenkinsci/jenkins/blob/38d4e86a80d123d50c966f4584a7e795b6bcb227/core/src/main/java/hudson/model/Slave.java#L732 does this.

Don't you do something similar with the CloudWatch storage?

No, io.jenkins.plugins.pipeline_cloudwatch_logs.TimestampTracker solves different problems:

  • ensuring that we do not mark the build log as complete until all messages known to have been sent from agents have actually arrived in the cloud and are available for reading (otherwise you would often lose the last few lines of a build, which tend to be the most critical if there was a failure)
  • making sure output from multiple steps running on a given agent is logically ordered (no attempt is made to order messages between the controller and an agent, or between two agents, in the face of clock skew)

import java.util.Properties;
import java.util.Random;

public class OpenTelemetryLogToElasticsearchIT {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.txt rather than just @Ignore like the rest of the ITs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to cleanup all the test container based ITs.


import static org.junit.Assume.assumeTrue;

@Ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be the only IT of the JEP-210 implementation, if enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, all tests are manual for the moment. It's not great but I didn't have the bandwidth to fix the integrated tests.

private Object writeReplace() throws IOException {
logger.log(Level.FINEST, () -> "writeReplace()");
JenkinsJVM.checkJenkinsJVM();
return new OtelLogSenderBuildListenerOnAgent(buildInfo, otelConfigProperties, otelResourceAttributes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any need to pass minimal credentials to the agent replacement as the CloudWatch Logs implementation does, or does the telemetry endpoint somewhat automatically verify the origin of a message so that a malicious build would not be able to spoof output from unrelated jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otelConfigProperties contain the credentials of the OTLP endpoint. Does it answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It answers my question by affirming that this implementation is indeed insecure for some configurations. https://github.com/jenkinsci/pipeline-cloudwatch-logs-plugin#security

@cyrille-leclerc
Copy link
Contributor Author

THANK YOU very much for your code review @jglick .

Any general feedback on how well JEP-210 worked for your use case?

JEP-210 was quite hard to implement but to our knowledge, it works well.

The most difficult things were to understand:

  • The role of the org.kohsuke.stapler.framework.io.ByteBuffer in the hudson.console.AnnotatedLargeText produced by org.jenkinsci.plugins.workflow.log.LogStorage#overallLog . I struggled to not create an OOM
  • The way the progressive rendering of logs worked in Jenkins

@jglick
Copy link
Member

jglick commented Jul 19, 2022

AnnotatedLargeText is indeed a low-level API, and it is not well suited to the needs of JEP-210; workflow-api and/or pipeline-cloudwatch-logs wound up working around some of its limitations. I cannot recall details at this point but it would be better to make some changes in Stapler to better support non-local storage implementations.

Rendering of logs for running builds is not complete in JEP-210. Currently Stapler (and by extension Jenkins core, and also Blue Ocean) use long as a byte offset, which can only be efficiently implemented in a file-based implementation. Thus CloudWatchRetriever has to stream the complete text to a buffer, before potentially throwing away 99% of it (whatever long start says); since rendering is stateless, each HTTP request has to repeat the process, making for O(n²) work rendering a running build (unless you cache the log in a temporary file or something). A proper API would allow the implementation to use a String offset, which for CWL would be a FilterLogEventsResult.nextToken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants