[ML] Flawed logic for when autodetect writes state/quantiles on graceful close #393

droberts195 · 2019-02-13T12:39:49Z

When the autodetect process is shut down gracefully it does the following:

Writes the latest quantiles to its output unconditionally.
Persists state if at least one input record was observed while it was running.

Both these conditions have problems:

If a job is opened and closed with nothing else in between then it unnecessarily writes quantiles on close, causing a renormalization on the Java side. This leads to problems like [CI] Had to resort to force-closing job, something went wrong? elasticsearch#30300.
If a job is opened, time is advanced causing results to be generated, then it is closed, then it does not persist state, thus losing the knowledge that time was advanced over a period in which there was no input data.

Both quantiles and state should be output on graceful close if and only if one or more of the following conditions is true:

An input record has been processed.
Time has been advanced.

Once this is fixed, state will be output more eagerly than it is now and quantiles less eagerly.

The text was updated successfully, but these errors were encountered:

Changed the logic surrounding persistence of both state and quantiles on graceful shutdown so that persistence only occurs if and only if at least one input record has been processed or time has been advanced. closes elastic#393

Changed the logic surrounding persistence of both state and quantiles on graceful shutdown so that persistence only occurs if and only if at least one input record has been processed or time has been advanced. closes #393

Changed the logic surrounding persistence of both state and quantiles on graceful shutdown so that persistence only occurs if and only if at least one input record has been processed or time has been advanced. closes elastic#393

Additional checks have been added to exercise the behaviour of persistence on graceful close of an anomaly job. In particular: - check that persistence does not occur for a job that is opened and then immediately closed, with nothing else having happened. - check that persistence occurs on graceful close of a job if it has processed data. - check that persistence occurs subsequent to time being manually advanced - even if no additional data has been seen by the job - check the edge case where persistence occurs if a job is opened, time is manually advanced and then the job is closed, having seen no data. Related to elastic/ml-cpp#393

Additional checks to exercise the behaviour of persistence on graceful close of an anomaly job. Related to elastic/ml-cpp#393 Backports #40272

Additional checks to exercise the behaviour of persistence on graceful close of an anomaly job Related to elastic/ml-cpp#393 Backports #40272

Additional checks to exercise the behaviour of persistence on graceful close of an anomaly job. Related to elastic/ml-cpp#393 Backports #40272

The collection of static program counters is cached prior to persistence. This provides the background persistence thread access to a consistent set of counters as they are being written. As it is desired to only persist the program counters the once for each model state snapshot, their persistence, and the clearing of the cache, is coupled to the persistence of the simple count detector, which is assumed to always exist. However there is a scenario where persistence operates on an empty collection of detectors. This occurs when no data has been seen but time has advanced (see elastic#393 for more details). In this situation the program counter cache is populated but not cleared. A subsequent persistence operation will lead to a warning that the counter cache is being overwritten. To avoid the warning message, this PR takes the approach of ensuring that the program counter cache is always cleared at the end of the persistence operation, regardless of its success or not.

The collection of static program counters is cached prior to persistence. This provides the background persistence thread access to a consistent set of counters as they are being written. As it is desired to only persist the program counters the once for each model state snapshot, their persistence, and the clearing of the cache, is coupled to the persistence of the simple count detector, which is assumed to always exist. However there is a scenario where persistence operates on an empty collection of detectors. This occurs when no data has been seen but time has advanced (see #393 for more details). In this situation the program counter cache is populated but not cleared. A subsequent persistence operation will lead to a warning that the counter cache is being overwritten. To avoid the warning message, we take the approach of ensuring that the program counter cache is always cleared at the end of the persistence operation, regardless of its success or not.

The collection of static program counters is cached prior to persistence. This provides the background persistence thread access to a consistent set of counters as they are being written. As it is desired to only persist the program counters the once for each model state snapshot, their persistence, and the clearing of the cache, is coupled to the persistence of the simple count detector, which is assumed to always exist. However there is a scenario where persistence operates on an empty collection of detectors. This occurs when no data has been seen but time has advanced (see elastic#393 for more details). In this situation the program counter cache is populated but not cleared. A subsequent persistence operation will lead to a warning that the counter cache is being overwritten. To avoid the warning message, we take the approach of ensuring that the program counter cache is always cleared at the end of the persistence operation, regardless of its success or not.

The collection of static program counters is cached prior to persistence. This provides the background persistence thread access to a consistent set of counters as they are being written. As it is desired to only persist the program counters the once for each model state snapshot, their persistence, and the clearing of the cache, is coupled to the persistence of the simple count detector, which is assumed to always exist. However there is a scenario where persistence operates on an empty collection of detectors. This occurs when no data has been seen but time has advanced (see #393 for more details). In this situation the program counter cache is populated but not cleared. A subsequent persistence operation will lead to a warning that the counter cache is being overwritten. To avoid the warning message, we take the approach of ensuring that the program counter cache is always cleared at the end of the persistence operation, regardless of its success or not. Backports #1774

Additional checks have been added to exercise the behaviour of persistence on graceful close of an anomaly job. In particular: - check that persistence does not occur for a job that is opened and then immediately closed, with nothing else having happened. - check that persistence occurs on graceful close of a job if it has processed data. - check that persistence occurs subsequent to time being manually advanced - even if no additional data has been seen by the job - check the edge case where persistence occurs if a job is opened, time is manually advanced and then the job is closed, having seen no data. Related to elastic/ml-cpp#393

droberts195 added >bug v7.0.0 :ml v6.7.0 v8.0.0 v7.2.0 labels Feb 13, 2019

droberts195 mentioned this issue Feb 13, 2019

[CI] Had to resort to force-closing job, something went wrong? elastic/elasticsearch#30300

Closed

edsavage self-assigned this Feb 25, 2019

edsavage mentioned this issue Mar 8, 2019

[ML] Improve autodetect logic for persistence #437

Merged

edsavage closed this as completed in #437 Mar 18, 2019

edsavage mentioned this issue Mar 18, 2019

[6.7][ML] Improve autodetect logic for persistence (#437) #440

Merged

edsavage mentioned this issue Mar 20, 2019

[ML] Add integration tests to check persistence elastic/elasticsearch#40272

Merged

edsavage mentioned this issue Mar 21, 2019

[ML][TEST] Add integration tests to check persistence (#40272) elastic/elasticsearch#40315

Merged

This was referenced Mar 21, 2019

[ML][TEST] Add integration tests to check persistence (#40272) elastic/elasticsearch#40316

Merged

[ML][TEST] Add integration tests to check persistence (#40272) elastic/elasticsearch#40317

Merged

droberts195 mentioned this issue Jun 26, 2019

[ML] Model should defend against times earlier than epoch 0 #394

Closed

edsavage mentioned this issue Feb 25, 2021

[ML] Ensure program counters cache always cleared #1774

Merged

edsavage mentioned this issue Mar 1, 2021

[7.x][ML] Ensure program counters cache always cleared (#1774) #1778

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Flawed logic for when autodetect writes state/quantiles on graceful close #393

[ML] Flawed logic for when autodetect writes state/quantiles on graceful close #393

droberts195 commented Feb 13, 2019

[ML] Flawed logic for when autodetect writes state/quantiles on graceful close #393

[ML] Flawed logic for when autodetect writes state/quantiles on graceful close #393

Comments

droberts195 commented Feb 13, 2019