Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Data frame GET _stats response is confusing #43767

Closed
sophiec20 opened this issue Jun 28, 2019 · 5 comments · Fixed by #44350
Closed

[ML] Data frame GET _stats response is confusing #43767

sophiec20 opened this issue Jun 28, 2019 · 5 comments · Fixed by #44350
Assignees
Labels

Comments

@sophiec20
Copy link
Contributor

Found in 7.3.0-SNAPSHOT

Continuous data frame GET _stats returns the following. Hopefully we can make this response a little less confusing:

  • indexer_state - a value of started means it is idle whereas indexing means it is either searching or indexing. This is not precise, but is inherited from rollups so might be best to leave as is. Also having two states is confusing.
  • checkpoint - this is the last known completed (current) checkpoint. This could be confused with the currently underway checkpoint.
  • progress - this is the progress of the currently underway checkpoint. By calling it progress and leaving at the top level, it gives a false impression that it is somehow indicative of the whole transform. percent_complete indicates the progress of a single checkpoint. With batch data frames there is only ever 1 checkpoint, so the other values make sense. However for continuous the total_docs and doc_remaining should ideally be reset. progress could be renamed to checkpoint_progress or combined with the checkpointing info to keep together.
  • current_position - this is the position of the cursor for the composite agg and will only be visible whilst the composite agg search is scrolling. This is context for the currently underway checkpoint. If a composite agg is not in progress, then this entire object is missing. A small nit, but its sporadic existence is weird.
  • checkpointing
    • we show both timestamp_millis and time_upper_bound_millis where the latter is timestamp_millis - sync.delay. Do we need both?
    • current refers to the current completed checkpoint whereas in_progress refers to the currently underway checkpoint. This is confusing in conjunction with progress. Perhaps we could just keep upper and lower bound?
    • in_progress sometimes does not exist.
    • Should progress and current_position and maybe indexer_state sit here?
  • Many of these stats refer to the next checkpoint checkpoint: 101 - this is not clear

To summarise the priority points,

  • progress.total_docs progress.docs_remaining and is incorrect for continuous. This is checkpoint_progress for the next checkpoint.
  • there is slightly confusing usage of the terms current* and *progress which may lead to confusion when trying to operationally manage and/or troubleshoot.
{
  "count" : 1,
  "transforms" : [
    {
      "id" : "sycn1844",
      "state" : {
        "task_state" : "started",
        "indexer_state" : "indexing",
        "current_position" : {
          "hashtag" : "abcd1234"
        },
        "checkpoint" : 100,
        "progress" : {
          "total_docs" : 1900883,
          "docs_remaining" : 1722762,
          "percent_complete" : 9.370434687458408
        }
      },
      "stats" : {
        "pages_processed" : 559,
        "documents_processed" : 4207753,
        "documents_indexed" : 278783,
        "trigger_count" : 2,
        "index_time_in_ms" : 14467,
        "index_total" : 558,
        "index_failures" : 0,
        "search_time_in_ms" : 284161,
        "search_total" : 559,
        "search_failures" : 0
      },
      "checkpointing" : {
        "current" : {
          "timestamp" : "2019-06-28T16:44:12.497Z",
          "timestamp_millis" : 1561740252497,
          "time_upper_bound" : "2019-06-28T16:43:12.497Z",
          "time_upper_bound_millis" : 1561740192497
        },
        "in_progress" : {
          "timestamp" : "2019-06-28T16:50:29.172Z",
          "timestamp_millis" : 1561740629172,
          "time_upper_bound" : "2019-06-28T16:49:29.172Z",
          "time_upper_bound_millis" : 1561740569172
        },
        "operations_behind" : 27000
      }
    }
  ]
}
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@jpountz jpountz added v7.4.0 and removed v7.3.0 labels Jul 3, 2019
@hendrikmuhs
Copy link

We discussed the format and came up with the following new design:

{
  "count": 1,
  "transforms": [ {
      "id": "sycn1844",
       # removed the state object 
      "task_state": "started",
      "node" : {
          "id" : "zMfxBopWSkOIdU0EVdIYVg",
          "name" : "xyz",
          "ephemeral_id" : "BijLWKBHQ-iIWH5sFlvLzA",
          "transport_address" : "127.0.0.1:9300",
          "attributes" : { }
        },
      "stats": {
          "pages_processed": 559,
          "documents_processed": 4207753,
          "documents_indexed": 278783,
          "trigger_count": 2,
          "index_time_in_ms": 14467,
          "index_total": 558,
          "index_failures": 0,
          "search_time_in_ms": 284161,
          "search_total": 559,
          "search_failures": 0
      },
      "checkpointing": {
          # renamed current -> “last”
          # renamed in_progress -> “next”
          "last": {
              # moving the checkpoint here so it’s better understood
              "checkpoint": 100,
              "timestamp": "2019-06-28T16:44:12.497Z",
              "timestamp_millis": 1561740252497,
              "time_upper_bound": "2019-06-28T16:43:12.497Z",
              "time_upper_bound_millis": 1561740192497
          },
          "next": {
              # not surprising what the next cp is, but users hopefully better 
              # understand the concept of checkpointing
              "checkpoint": 101,
              # moved indexer state here
              "indexer_state": "indexing",
              "current_position": {
                  "hashtag": "abcd1234"
              },
              # progress is only about the next checkpoint
              "checkpoint_progress": {
                  "total_docs": 1900883,
                  "docs_remaining": 1722762,
                  "percent_complete": 9.370434687458408
              },
              "timestamp": "2019-06-28T16:50:29.172Z",
              "timestamp_millis": 1561740629172,
              "time_upper_bound": "2019-06-28T16:49:29.172Z",
              "time_upper_bound_millis": 1561740569172
          },
          "operations_behind": 27000
      }
    }
  ]
}

@sophiec20 sophiec20 added v7.3.0 and removed v7.4.0 labels Jul 4, 2019
@sophiec20
Copy link
Contributor Author

elastic/kibana#40378 Note that if we want to show progress to the user in the UI, then checkpointing.last.time_upper_bound should always be present and not just whilst "indexer_state": "indexing" (with possible exception for checkpoint zero).

@droberts195
Copy link
Contributor

droberts195 commented Jul 18, 2019

# progress is only about the next checkpoint

This turns out to be quite a far-reaching change. For checkpoints that are processed quickly it means that progress is only available for a very short time before that "next" checkpoint is completed and moves to "last".

One effect is that integration tests cannot really assert on progress any more, because depending on OS scheduling there might not be any progress field in the stats by the time the test asks for it - search for TODO progress is now checkpoint progress and it may be that no checkpoint is in progress here in the 8th commit of #44350 to see the effect on our tests. Given that there is a need to change progress again due to #44557 I'm inclined to leave these TODOs in #44350 and they can be revisited in the fix for #44557.

@droberts195
Copy link
Contributor

# moved indexer state here

In terms of test assertions this also suffers from the problem mentioned in the previous comment of:

checkpoints that are processed quickly

In the 10th commit of #44350 I had to remove all the YAML test assertions on indexer_state for this reason.

droberts195 added a commit that referenced this issue Jul 23, 2019
This change adjusts the data frame transforms stats
endpoint to return a structure that is easier to
understand.

This is a breaking change for clients of the data frame
transforms stats endpoint, but the feature is in beta so
stability is not guaranteed.

Closes #43767
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants