Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure Store: Update go-docappender to respect failure store status #228

Open
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

1pkg
Copy link
Member

@1pkg 1pkg commented Jan 29, 2025

Description

This PR updates go-docappender to respect new failure store response status and emit new correspodning metrics accordingly. The old "indexed" metrics stay intact instead a new separate set of failure store labels is exposed.

This PR depends on changes made for BulkIndexerResponseItem in elastic/go-elasticsearch#948 and should only be merged afterwards.

How to test:

Use an instance of ES that has failure store feature enabled and enable failure store for a data stream via component template with.

POST _component_template/$name
{
  "template": {
    "data_stream_options": {
      "failure_store": {
        "enabled": true
      }
    }
  }
}

Set a custom "fail" ingest pipeline with.

PUT _ingest/pipeline/$name
{
  "processors": [
    {
      "fail": {
        "message": "fail"
      }
    }
  ]
}

Ingest some data to corresponding data stream, then check that failure store metrics are getting reported correctly.

To emulate "failed" status, set backing data stream index to be read only with.

PUT /$index/_settings
{
  "index.blocks.read_only": true
}
image

@1pkg 1pkg self-assigned this Jan 29, 2025
@1pkg 1pkg added the enhancement New feature or request label Jan 29, 2025
@elastic-observability-automation elastic-observability-automation bot added the safe-to-test Automated label for running bench-diff on forked PRs label Jan 29, 2025
appender_test.go Show resolved Hide resolved
@1pkg 1pkg marked this pull request as ready for review January 31, 2025 01:17
@1pkg 1pkg requested a review from a team as a code owner January 31, 2025 01:17
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just one question

appender_test.go Outdated Show resolved Hide resolved
@axw
Copy link
Member

axw commented Jan 31, 2025

One more question, sorry: should we be measuring "not_enabled" too?

@1pkg
Copy link
Member Author

1pkg commented Jan 31, 2025

One more question, sorry: should we be measuring "not_enabled" too?

Agree, this metric could be useful too. Updated to expose it in the last commit.

go.mod Outdated Show resolved Hide resolved
@1pkg 1pkg requested review from axw and a team January 31, 2025 19:46
@1pkg 1pkg requested a review from marclop February 12, 2025 18:12
appender.go Outdated Show resolved Hide resolved
@1pkg 1pkg requested a review from kruskall February 13, 2025 22:34
Copy link
Contributor

@marclop marclop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good overall, just one small question

Comment on lines +117 to +126
// FailureStore contains failure store specific stats.
FailureStore struct {
// Used contains the total number of documents indexed to failure store.
Used int64
// Failed contains the total number of documents which failed when indexed to failure store.
Failed int64
// NotEnabled contains the total number of documents which could have been indexed to failure store
// if it was enabled.
NotEnabled int64
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding the FailureStore struct in the legacy stats metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but per my understanding legacy stats metrics are residing in https://github.com/elastic/go-docappender/blob/main/appender.go#L65. Which I already reverted in 48f9d85.

While this is just generic bulk response container which is used to pass data to appender for reporting OTEL metrics, etc.

@1pkg 1pkg requested a review from marclop February 14, 2025 03:10
@1pkg 1pkg enabled auto-merge (squash) February 14, 2025 03:10
a.addCount(failureStore.Used, nil,
a.metrics.docsIndexed,
metric.WithAttributes(
attribute.String("status", "FailureStore"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the status: "FailureStore" is supposed to be used?
E.g. when calculating SLOs and being interested in success vs failed documents. If I understand it correct then the failure_store: used documents would be counted as success whereas failure_store: failed and failure_store: not_enabled would be counted as failed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea behind it that the existing SLOs stay untouched, with enabled failure store they should always be 100% good events in theory. Separately a new set of SLOs can be created to track failure store error rate, and the number of documents indexed to FS vs all documents.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, thanks

bulk_indexer.go Show resolved Hide resolved
Failed int64
// NotEnabled contains the total number of documents which could have been indexed to failure store
// if it was enabled.
NotEnabled int64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The already reported stats are not very consistent (some of them contain the term docs, others don't - e.g. Indexed vs RetriedDocs, FailedDocs).
However, when reading FailureStore.Used, FailureStore.Failed and FailureStore.NotEnabled, it's not necessarily clear from the naming whether this would count the number of documents or batch requests. Is it too verbose to add Docs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no problem adding docs to the names. What makes more sense FailureStoreDocs.Used, FailureStoreDocs.Failred, FailureStoreDocs.NotEnabled or alternatively FailureStore.UsedDocs, FailureStore.FailredDocs, FailureStore.NotEnabledDocs?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly in favour of FailureStoreDocs.Used, but no strong opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request safe-to-test Automated label for running bench-diff on forked PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants