Enable completion time-to-live to be set on all jobs #407

objectiser · 2019-05-10T16:00:07Z

Fixes #406

@jkandasa Could you give this a test to make sure it fixes the issue. Note it requires kubernetes 1.12 or higher.

Signed-off-by: Gary Brown [email protected]

objectiser · 2019-05-10T16:02:30Z

Codeclimate issues are related to the autogenerated code.

codecov · 2019-05-10T16:17:59Z

Codecov Report

Merging #407 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #407      +/-   ##
==========================================
+ Coverage   91.59%   91.62%   +0.02%     
==========================================
  Files          64       64              
  Lines        3142     3153      +11     
==========================================
+ Hits         2878     2889      +11     
  Misses        184      184              
  Partials       80       80

Impacted Files	Coverage Δ
pkg/apis/jaegertracing/v1/jaeger_types.go	`100% <ø> (ø)`	⬆️
pkg/strategy/controller.go	`97.69% <100%> (+0.19%)`	⬆️
pkg/cronjob/spark_dependencies.go	`97.8% <100%> (+0.02%)`	⬆️
pkg/cronjob/es_rollover.go	`95.53% <100%> (-0.08%)`	⬇️
pkg/storage/cassandra_dependencies.go	`100% <100%> (ø)`	⬆️
pkg/cronjob/es_index_cleaner.go	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f67fd4...174230e. Read the comment docs.

jkandasa · 2019-05-14T14:49:44Z

@objectiser I do not have 1.12 Kubernetes,

I have,

OpenShift Master: v3.11.98
Kubernetes Master: v1.11.0+d4cacc0
OpenShift Web Console: v3.11.98

Test status:

works well for es-index-cleaner job (job status success all the time)
spark-dependencies inconsistent (job status error all the time, spark-dependencies job is failing #405 )
- set of old jobs(spark-dependencies) are deleted at random interval (could not predict)

CR:

  storage:
    type: elasticsearch
    esIndexCleaner:
      enabled: true
      schedule: "*/1 * * * *"
      completedTTL: 180
    dependencies:
      enabled: true
      schedule: "*/1 * * * *"
      completedTTL: 300
    elasticsearch:
      nodeCount: 3
      resources:

objectiser · 2019-05-14T15:55:09Z

@jkandasa Ok thanks.

@jpkrohling @pavolloffay At present the 'completed time-to-live' for all of the jobs is set to a default of 10 minutes, but wondering whether each job type (e.g. rollover, index cleaner, etc) should have their own defaults which can more closely relate to the schedules (when used) - so if a schedule only runs a job once a day, then maybe the ttl should be two days so any failures are around for a while?

pavolloffay · 2019-05-14T16:03:44Z

Maybe we could make it always double to what is the schedule?

objectiser · 2019-05-17T09:58:52Z

@pavolloffay On further thought, the time the job should remain shouldn't really be related to schedule, as if a failure/retry occurs it can result in a large number of jobs being left lying around for a long time.

On the other hand we need to give time for someone to detect the failure and capture any relevant information. So have updated default to 1 hour.

If we want to try to come up with a more complex scheme, then we could do it in a separate PR?

pavolloffay · 2019-05-17T11:58:24Z

On the other hand we need to give time for someone to detect the failure and capture any relevant information. So have updated default to 1 hour.

The jobs we have here run once per day at midnight. Not sure if one hour is a good default in this case.

objectiser · 2019-05-22T10:40:19Z

If there is a logging framework, then any failures would be captured centrally - so having the job hanging around would not be as relevant.

pavolloffay · 2019-05-22T14:38:25Z

The question is if those logs would get attention and if they would indicate a problem.

objectiser · 2019-05-29T08:25:17Z

@jpkrohling Any thoughts on this?

jpkrohling · 2019-05-29T08:35:14Z

If we run each job on a daily basis, it's safe to assume that they shouldn't take more than one day to complete.

objectiser · 2019-06-03T13:20:31Z

@jpkrohling Could you take a look at this? codeclimate errors not relevant, and travis job has finished, but not updated here for some reason.

jpkrohling · 2019-06-03T14:44:01Z

This is all green now.

objectiser · 2019-06-03T16:09:15Z

@jpkrohling Are the changes ok to merge?

jpkrohling · 2019-06-03T18:34:20Z

LGTM, but I would prefer TTLSecondsAfterFinished as the name for the new property, to match the one from the batch objects, which is where it's used after all.

objectiser · 2019-06-04T08:01:18Z

@jpkrohling No problem, can change.

Signed-off-by: Gary Brown <[email protected]>

…and change default TTL to 1 day Signed-off-by: Gary Brown <[email protected]>

Signed-off-by: Gary Brown <[email protected]>

objectiser · 2019-06-04T10:20:28Z

go.sum

@@ -43,6 +43,7 @@ github.com/bradfitz/go-smtpd v0.0.0-20170404230938-deb6d6237625/go.mod h1:HYsPBT
 github.com/census-instrumentation/opencensus-proto v0.2.0 h1:LzQXZOgg4CQfE6bFvXGM30YZL1WW/M337pXml+GrcZ4=
 github.com/census-instrumentation/opencensus-proto v0.2.0/go.mod h1:f6KPmirojxKA12rnyqOA5BBL4O983OfeGPqjHWSTneU=
 github.com/client9/misspell v0.3.4/go.mod h1:qj6jICC3Q7zFZvVWo7KLAzC3yx5G7kyvSDkc90ppPyw=
+github.com/codahale/hdrhistogram v0.0.0-20161010025455-3a0bb77429bd h1:qMd81Ts1T2OTKmB4acZcyKaMtRnY5Y44NuXGX2GFJ1w=


@jpkrohling Would you expect these changes to the go.sum file? They seem similar except next line as /go.mod

Not really, but the version seems strange. Perhaps it's pointing to master and they had a new commit recently?

Would have expected it to just update the existing line rather than add a new.

Oh, good point. I need to understand go modules better in order to answer that. I have seen duplicated entries in different sections in the go.mod, but not sure I would expect the same for go.sum...

jkandasa mentioned this pull request May 24, 2019

Add a mechanism to clean completed/failed jobs triggered by the jaeger-operator #406

Closed

objectiser added 4 commits June 4, 2019 09:16

Enable completion time-to-live to be set on all jobs

3dbbf5b

Signed-off-by: Gary Brown <[email protected]>

Changed default to 1 hour

e34f80c

Signed-off-by: Gary Brown <[email protected]>

Rename completedTTL to afterCompletionTTL to hopefully make clearer, …

f5c3e05

…and change default TTL to 1 day Signed-off-by: Gary Brown <[email protected]>

Change ttl field to k8s field name

174230e

Signed-off-by: Gary Brown <[email protected]>

objectiser commented Jun 4, 2019

View reviewed changes

jpkrohling approved these changes Jun 6, 2019

View reviewed changes

jpkrohling merged commit eaa4d52 into jaegertracing:master Jun 6, 2019

headcr4sh mentioned this pull request Jul 3, 2019

Error applying changes in ttlSecondsAfterFinished with 1.13 #494

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable completion time-to-live to be set on all jobs #407

Enable completion time-to-live to be set on all jobs #407

objectiser commented May 10, 2019 •

edited

Loading

objectiser commented May 10, 2019

codecov bot commented May 10, 2019 •

edited

Loading

jkandasa commented May 14, 2019

objectiser commented May 14, 2019

pavolloffay commented May 14, 2019

objectiser commented May 17, 2019

pavolloffay commented May 17, 2019

objectiser commented May 22, 2019

pavolloffay commented May 22, 2019

objectiser commented May 29, 2019

jpkrohling commented May 29, 2019

objectiser commented Jun 3, 2019

jpkrohling commented Jun 3, 2019

objectiser commented Jun 3, 2019

jpkrohling commented Jun 3, 2019

objectiser commented Jun 4, 2019

objectiser Jun 4, 2019 •

edited

Loading

jpkrohling Jun 4, 2019

objectiser Jun 4, 2019

jpkrohling Jun 4, 2019

Enable completion time-to-live to be set on all jobs #407

Enable completion time-to-live to be set on all jobs #407

Conversation

objectiser commented May 10, 2019 • edited Loading

objectiser commented May 10, 2019

codecov bot commented May 10, 2019 • edited Loading

Codecov Report

jkandasa commented May 14, 2019

objectiser commented May 14, 2019

pavolloffay commented May 14, 2019

objectiser commented May 17, 2019

pavolloffay commented May 17, 2019

objectiser commented May 22, 2019

pavolloffay commented May 22, 2019

objectiser commented May 29, 2019

jpkrohling commented May 29, 2019

objectiser commented Jun 3, 2019

jpkrohling commented Jun 3, 2019

objectiser commented Jun 3, 2019

jpkrohling commented Jun 3, 2019

objectiser commented Jun 4, 2019

objectiser Jun 4, 2019 • edited Loading

Choose a reason for hiding this comment

jpkrohling Jun 4, 2019

Choose a reason for hiding this comment

objectiser Jun 4, 2019

Choose a reason for hiding this comment

jpkrohling Jun 4, 2019

Choose a reason for hiding this comment

objectiser commented May 10, 2019 •

edited

Loading

codecov bot commented May 10, 2019 •

edited

Loading

objectiser Jun 4, 2019 •

edited

Loading