Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: time out of range #409

Merged
merged 3 commits into from
Jul 11, 2023
Merged

Conversation

fudongyingluck
Copy link
Contributor

@fudongyingluck fudongyingluck commented Jun 30, 2023

We found that when current time greater than nextExecutionTime, the TimeValue in threadPool.schedule will throw an IllegalArgumentException as following

java.lang.IllegalArgumentException: duration cannot be negative, was given [-2965077933106]
        at org.elasticsearch.common.unit.TimeValue.<init>(TimeValue.java:52) ~[elasticsearch-core-7.10.2.jar:7.10.2]
        at com.amazon.opendistroforelasticsearch.jobscheduler.scheduler.JobScheduler.reschedule(JobScheduler.java:190) ~[?:?]
        at com.amazon.opendistroforelasticsearch.jobscheduler.scheduler.JobScheduler.lambda$reschedule$0(JobScheduler.java:177) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) ~[elasticsearch-7.10.2.jar:7.10.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

Then the job will not be scheduled anymore.
This change fixes this, by setting the nextExecutionTime to current time.

Thanks my colleague @kkewwei solve this out.

Signed-off-by: fudongying <[email protected]>
Signed-off-by: kewei.11 <[email protected]>

Signed-off-by: fudongying <[email protected]>
@joshpalis
Copy link
Member

Thanks for raising this PR @fudongyingluck. Checks are failing due to stale artifacts, will wait to re-run checks until the next 2.9.0 build is successful

* What went wrong:
Could not determine the dependencies of task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> Server returned HTTP response code: 403 for URL: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.9.0/8039/linux/x64/tar/builds/opensearch/plugins/opensearch-job-scheduler-2.9.0.0.zip

@fudongyingluck
Copy link
Contributor Author

fudongyingluck commented Jul 3, 2023

@joshpalis we found that dirty data stale in the memory after exception. If the index migrates back to this node after we migrate it to another, then the job loses again. So we add the second commit to deal with this condition.

@codecov
Copy link

codecov bot commented Jul 6, 2023

Codecov Report

Merging #409 (6defabc) into main (0132436) will increase coverage by 0.42%.
The diff coverage is 91.66%.

@@             Coverage Diff              @@
##               main     #409      +/-   ##
============================================
+ Coverage     28.77%   29.19%   +0.42%     
- Complexity       97       98       +1     
============================================
  Files            22       22              
  Lines          1178     1185       +7     
  Branches        109      109              
============================================
+ Hits            339      346       +7     
  Misses          818      818              
  Partials         21       21              
Impacted Files Coverage Δ
...pensearch/jobscheduler/scheduler/JobScheduler.java 74.73% <91.66%> (+2.00%) ⬆️

@fudongyingluck
Copy link
Contributor Author

Thanks for raising this PR @fudongyingluck. Checks are failing due to stale artifacts, will wait to re-run checks until the next 2.9.0 build is successful

* What went wrong:
Could not determine the dependencies of task ':opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#fullRestartClusterTask'.
> Server returned HTTP response code: 403 for URL: https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.9.0/8039/linux/x64/tar/builds/opensearch/plugins/opensearch-job-scheduler-2.9.0.0.zip

@joshpalis Checks seem successful now. Is it convenient for you to review those codes ? Really thanks for your time ~

Copy link
Member

@dbwiddis dbwiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix!

In the future, it would be helpful to create an issue reporting the details of the bug and then reference that issue in the PR. In case there's a need to discuss appropriate fixes for the bug that the PR might not handle it keeps the discussions separate.

Fix LGTM with a few nits.

Signed-off-by: fudongying <[email protected]>
@fudongyingluck
Copy link
Contributor Author

@dbwiddis Really thanks for your time and advice. I'll create a bug issue next time. And The code is changed as your comments in the latest commit.

Copy link
Member

@cwperks cwperks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fudongyingluck This PR looks good to me, but I'd be curious to know how to reproduce the scenario. Would you be able to provide steps for how to recreate the scenario where duration can be negative? I can see in the code its possible if the current instant is after the nextExecutionTime, but does that mean that a job had previously failed to run and the nextExecutionTime was not updated?

How can the situation arise where nextExecutionTime is in the past? Thank you.

@fudongyingluck
Copy link
Contributor Author

@cwperks Good question. We also feel curious when the job disappears until we found logs. The thing is the cloud service which the ES k8s instance runs on, is unavailable for some time. Then the ES instance seems not to run at those times, I don't know how this happened. After about 30m, the ES instance reruns again, and the exception occurs.
I know we should fix the ES instance stalled problem, but this seems more complex, and the online ES instance can't restart for some stability reason. To avoid the problem occurring next time, I raise this PR.

@cwperks
Copy link
Member

cwperks commented Jul 11, 2023

@fudongyingluck Thank you for the context!

@joshpalis joshpalis merged commit 9f4ec67 into opensearch-project:main Jul 11, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 11, 2023
* fix: time out of range

Signed-off-by: fudongying <[email protected]>

* fix: deschedule failed after schedule exception

Signed-off-by: fudongying <[email protected]>

* chore: dbwiddis's comments

Signed-off-by: fudongying <[email protected]>

---------

Signed-off-by: fudongying <[email protected]>
(cherry picked from commit 9f4ec67)
joshpalis pushed a commit that referenced this pull request Jul 11, 2023
* fix: time out of range

Signed-off-by: fudongying <[email protected]>

* fix: deschedule failed after schedule exception

Signed-off-by: fudongying <[email protected]>

* chore: dbwiddis's comments

Signed-off-by: fudongying <[email protected]>

---------

Signed-off-by: fudongying <[email protected]>
(cherry picked from commit 9f4ec67)
Signed-off-by: Joshua Palis <[email protected]>
joshpalis pushed a commit that referenced this pull request Jul 11, 2023
* fix: time out of range



* fix: deschedule failed after schedule exception



* chore: dbwiddis's comments



---------


(cherry picked from commit 9f4ec67)

Signed-off-by: fudongying <[email protected]>
Signed-off-by: Joshua Palis <[email protected]>
Co-authored-by: fudongying <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants