Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whitepaper on cloud-native observability #16

Closed
AloisReitbauer opened this issue May 28, 2020 · 18 comments
Closed

Whitepaper on cloud-native observability #16

AloisReitbauer opened this issue May 28, 2020 · 18 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@AloisReitbauer
Copy link

Goal is to support users in implementing observability and monitoring for their cloud native workloads

Target: End users building cloud native applications

Scope: Define basic concepts of data collection and analysis and how CNCF projects can be used for this. Maybe add 1 - 3 real world reference examples

Details:

  • Data collection: Logs, Metrics, Traces
    • What to use which data source for
    • Examples with CNCF projects, Prometheus, Jaeger, OpenTelemetry, ...
  • Make your Kubernetes cluster - and the apps running on it - obseravable
  • Storage backends for data
    • Options based on CNCF projects
    • Enterprise requirements: HA, RBAC, ....
  • Data analysis patterns:
    • Alerting, Anomaly detection, trace analytics, log analytics
@mhausenblas
Copy link
Collaborator

That sounds like a really good idea to me and I'm happy to contribute to this.

@Vlaaaaaaad
Copy link

+1 to this.

A good first step may be a landscape, just to have all "players", from open-source and cloud-native, all the way to hosted SaaS options.

@AloisReitbauer
Copy link
Author

We would need to define the categories for the landscape first. Observability is a too broad and abstract term.

@fktkrt
Copy link

fktkrt commented Aug 19, 2020

I am also happy to contribute to this.
Is there any reason to use different categories than the categories defined in the official CNCF landscape's Observability & Analysis category?
Monitoring, logging and tracing can cover most aspects of the term, maybe Chaos Engineering is a bit of an outlier, but also can be included if this SIG would want to cover that as well.

@danielkhan
Copy link

@mhausenblas, @sferlin, @ArthurSens, and @danielkhan met today to agree on a rough outline and schedule.

We plan to have the draft ready for review by January 19th and a final version in February.
@ArthurSens already kicked off a google doc and we will continue to work in this doc.

@ArthurSens
Copy link
Contributor

I've finished reorganizing the sessions. Several comments were added to the doc with a brief explanation about the newly added sessions.
@mhausenblas @sferlin @danielkhan (or anyone else interested), feel free to review and to ping me on slack if there are any doubts.

@ArthurSens
Copy link
Contributor

It's worth mentioning that the organization of the sessions was inspired by the whitepaper that is under work by the SIG-Security

@ArthurSens
Copy link
Contributor

"Goals" and "Target Audience" ready for review :)

@ArthurSens
Copy link
Contributor

"Introduction" ready for review 🙂

@rakyll
Copy link

rakyll commented Mar 12, 2021

A couple of thoughts...

  • In the current structure, there is so much emphasis on different signals (metrics, traces, logs and more) but not a lot of structure to explain how observability goals make you change the way you collect these signals. For example, you collect things consistently with the same labels (dimensions) and later can narrow down the data by filtering the same labels.
  • Context propagation is very critical component in observability from the perspective of propagated labels and trace/request context. The paper may need a section dedicated to context (both in-process and on wire).
  • The paper should include a high-level section what observability enables. For example, response time to an incident can be improved from hours to minutes. Observability can help onboarding new engineers in relative large systems. Observability can contribute to the efficiency of your infrastructure and help you optimize (not just performance-wise but also costs-wise). It can help identifying malicious cases and security attacks. A section with a list of various things observability can contribute would be great.
  • The paper also can have a more compelling section of software lifecycle and how observability fits. From development, to deployment, to production, observability has unique contributions. A high-level overview would be great to capture.

@danielkhan
Copy link

Thank you for the feedback, Jaana.

  • Context propagation is very critical component in observability from the perspective of propagated labels and trace/request context. The paper may need a section dedicated to context (both in-process and on wire).

I could not participate as it got unexpectedly busy but I can take a stab on this within the next two weeks if no one else is already working on this.

@halcyondude halcyondude added the documentation Improvements or additions to documentation label May 12, 2021
@halcyondude
Copy link
Collaborator

The coordination and collaboration is happening in Google Doc and Slack, using this issue for high level tracking.

@dominick-blue
Copy link
Contributor

Hi everyone,

I would love to contribute to this. I see the last actions were last taken in April 2021. Where is the Whitepaper at in the process?

@bwplotka
Copy link
Collaborator

Hi all, hope you are all great!

Thanks everybody for help and feedback. It's finally time to claim that v1.0 version of the whitepaper we started long time ago!

To do so we have to perform final touches and review. For that, we created this document, which is now open for review. The aim is to have the final version of whitepaper with addressed TODOs done by 1st August. After that we will share it wider, save as official v1 and open WIP version for v1.1, so we can continuously evolve it with more content and updates community can bring next months! (:

Final review document holds the latest whitepaper version (copied from this version) available for collaborative review and addressing TODOs. Feel free to add comments & suggestions. For bigger suggestions (e.g. further details or new sections), we might pull them new GH issues for whitepaper v1.1. The doc also outlines all feedback items we got so far to consider. Feel free to help in any of those TODOs and, generally, in final review of this paper 🤗

Thanks! 💪🏽

@larryck
Copy link

larryck commented Jul 13, 2023

Is "Make your Kubernetes cluster - and the apps running on it - obseravable" mentioned above still a goal of the observability whitepaper?

@bwplotka
Copy link
Collaborator

bwplotka commented Aug 1, 2023

Is "Make your Kubernetes cluster - and the apps running on it - obseravable" mentioned above still a goal of the observability whitepaper?

Not sure, it feels like a separate tutorial might be useful here (we can then link from whitepaper). Happy to be told otherwise here (:

Added #131 to track that particular idea.

@bwplotka
Copy link
Collaborator

bwplotka commented Aug 1, 2023

All the points here were either addressed in the final 1.0 review period or added as TODO (help wanted!) and tracked in the individual issues with cn-o11y-whitepaper-v1.1 label.

PR with the changes from the review period to main branch will follow. Closing for now, feel free to keep discussion going, ideally in separate issue so it's easier to track and address 💪🏽

Thanks everybody for epic work on reviewing, suggesting and contributing!

@bwplotka bwplotka closed this as completed Aug 1, 2023
@bwplotka
Copy link
Collaborator

bwplotka commented Aug 1, 2023

See: #132

bwplotka referenced this issue Oct 17, 2023
…ion period. (#132)

* whitepaper: Syncing changes from the 1.0 community review & contribution period.

Thanks everyone for amazing feedback! Apologies for a bit short period,
but the paper was sitting for 2y without changes, so it made sense to time
box 1.0 and allow structured work on further iterations.

Still, within [this community review & contribution period](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit)
I counted 67 individual contributions (count of comment/suggestion bubbles, excluding my own and not counting individual discussion comments)
from 7 new contributors.

High level changes (suggested by community, but also clean up by me):

* Note on aggregatability and volume of metrics.
* Added non goals
* Changing example from temperature to memory gauge
* Add reasons for metric efficiency
* Added info about cardinality (new section)
* Mentioned metric data models
* Added info about types
* Metric time series vs count
* Addressing feedback on logs, traces
* Adding profile screenshot
* Cleaning up, simplifying correlation section
* Removing how to setup Prometheus with exemplars
* Transitions
* Box based monitoring refactor - changing "closed box" traditional
* Clarified SLO/SLA
* Added image and figure captions
* More automatic and non-intrusive instrumentation solutions in OSS
* Linking ebay paper
* Added gap around streaming API, not enough DBs and standarized query language
* Did grammarly pass for typos.

...and more.

As mentioned in [the doc](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I went through
all additional and old feedback. It's now either addressed in this PR or added as
todos in [separate issues](https://github.com/cncf/tag-observability/issues?q=is%3Aissue+is%3Aopen+label%3Acn-o11y-whitepaper-v1.1)

I admit, it was fun to process that doc! Reminds me of year ago when I was, fully focused, writing my [book](https://www.oreilly.com/library/view/efficient-go/9781098105709/) on sligthly different topic.


Signed-off-by: bwplotka <[email protected]>

* Added Jaana and Alois as contributors.

Thanks for your ideas and feedback in https://github.com/cncf/tag-observability/issues/16\

Signed-off-by: bwplotka <[email protected]>

* Apply suggestions from Richi's code review

Co-authored-by: RichiH-travel <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>

* Apply suggestions from code review

Co-authored-by: RichiH-travel <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>

* Fixed references, added tip version.

Signed-off-by: bwplotka <[email protected]>

---------

Signed-off-by: bwplotka <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: RichiH-travel <[email protected]>
nikolaev-rd referenced this issue in nikolaev-rd/tag-observability Nov 13, 2023
…ion period. (cncf#132)

* whitepaper: Syncing changes from the 1.0 community review & contribution period.

Thanks everyone for amazing feedback! Apologies for a bit short period,
but the paper was sitting for 2y without changes, so it made sense to time
box 1.0 and allow structured work on further iterations.

Still, within [this community review & contribution period](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit)
I counted 67 individual contributions (count of comment/suggestion bubbles, excluding my own and not counting individual discussion comments)
from 7 new contributors.

High level changes (suggested by community, but also clean up by me):

* Note on aggregatability and volume of metrics.
* Added non goals
* Changing example from temperature to memory gauge
* Add reasons for metric efficiency
* Added info about cardinality (new section)
* Mentioned metric data models
* Added info about types
* Metric time series vs count
* Addressing feedback on logs, traces
* Adding profile screenshot
* Cleaning up, simplifying correlation section
* Removing how to setup Prometheus with exemplars
* Transitions
* Box based monitoring refactor - changing "closed box" traditional
* Clarified SLO/SLA
* Added image and figure captions
* More automatic and non-intrusive instrumentation solutions in OSS
* Linking ebay paper
* Added gap around streaming API, not enough DBs and standarized query language
* Did grammarly pass for typos.

...and more.

As mentioned in [the doc](https://docs.google.com/document/d/19am_KCYWU28ebLiIXv_P3ji96edxCTscVb4CzemXV5A/edit) I went through
all additional and old feedback. It's now either addressed in this PR or added as
todos in [separate issues](https://github.com/cncf/tag-observability/issues?q=is%3Aissue+is%3Aopen+label%3Acn-o11y-whitepaper-v1.1)

I admit, it was fun to process that doc! Reminds me of year ago when I was, fully focused, writing my [book](https://www.oreilly.com/library/view/efficient-go/9781098105709/) on sligthly different topic.

Signed-off-by: bwplotka <[email protected]>

* Added Jaana and Alois as contributors.

Thanks for your ideas and feedback in https://github.com/cncf/tag-observability/issues/16\

Signed-off-by: bwplotka <[email protected]>

* Apply suggestions from Richi's code review

Co-authored-by: RichiH-travel <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>

* Apply suggestions from code review

Co-authored-by: RichiH-travel <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>

* Fixed references, added tip version.

Signed-off-by: bwplotka <[email protected]>

---------

Signed-off-by: bwplotka <[email protected]>
Signed-off-by: Bartlomiej Plotka <[email protected]>
Co-authored-by: RichiH-travel <[email protected]>
Signed-off-by: Roman Nikolaev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests