Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Possible race condition between multiple memoryTracker.isRecentlyRefreshed() calls #69289

Closed
droberts195 opened this issue Feb 19, 2021 · 1 comment · Fixed by #69290
Closed
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@droberts195
Copy link
Contributor

While triaging #69276 I noticed a different problem that could theoretically be affecting production use of ML.

The master node log contains this section:

[2021-02-19T13:09:20,538][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] [ml-mappings-upgrade-job] Insufficient history to calculate established memory use
[2021-02-19T13:09:20,540][TRACE][o.e.x.m.j.p.JobResultsProvider] [v7.12.0-0] ES API CALL: search latest model_size_stats for job ml-snapshots-upgrade-job
[2021-02-19T13:09:20,543][DEBUG][o.e.c.c.PublicationTransportHandler] [v7.12.0-0] received diff cluster state version [846] with uuid [5cD6NFvOS0idmNd4tB96hw], diff size [560]
[2021-02-19T13:09:20,557][INFO ][o.e.x.m.a.TransportUpgradeJobModelSnapshotAction] [v7.12.0-0] [ml-snapshots-upgrade-job] [1613739961] sending start upgrade request
[2021-02-19T13:09:20,589][DEBUG][o.e.c.c.C.CoordinatorPublication] [v7.12.0-0] publication ended successfully: Publication{term=6, version=846}
[2021-02-19T13:09:20,592][DEBUG][o.e.x.m.j.JobNodeSelector] [v7.12.0-0] Falling back to allocating job [ml-snapshots-upgrade-job] by job counts because its memory requirement was not available

It implies that two calls to memoryTracker.isRecentlyRefreshed() in different parts of the code which are assumed to return the same value actually returned different values (false in SnapshotUpgradeTaskExecutor.getAssignment and true in AbstractJobPersistentTasksExecutor.checkMemoryFreshness).

The effect of unnecessarily falling back to assigning jobs by count rather than memory is bad because it could lead to ML native processes suffering OOM errors if memory is constrained.

@droberts195 droberts195 added >bug :ml Machine learning labels Feb 19, 2021
@droberts195 droberts195 self-assigned this Feb 19, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

droberts195 added a commit that referenced this issue Feb 19, 2021
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed.  Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.

Fixes #69289
droberts195 added a commit that referenced this issue Feb 22, 2021
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed.  Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.

Fixes #69289
droberts195 added a commit that referenced this issue Feb 22, 2021
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed.  Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.

Fixes #69289
droberts195 added a commit that referenced this issue Feb 22, 2021
This change fixes a race condition that can occur if the
return value of memoryTracker.isRecentlyRefreshed() changes
between two calls that are assumed to return the same value.
The solution is to just call the method once and pass that
value to the other place where it is needed.  Then all related
code makes decisions based on the same view of whether the
memory tracker has been recently refreshed or not.

Fixes #69289
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants