Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain generated corpus and pretranslation files for a build #468

Closed
ddaspit opened this issue Aug 27, 2024 · 30 comments
Closed

Retain generated corpus and pretranslation files for a build #468

ddaspit opened this issue Aug 27, 2024 · 30 comments
Assignees
Labels
enhancement New feature or request

Comments

@ddaspit
Copy link
Contributor

ddaspit commented Aug 27, 2024

Currently, Serval deletes corpus and pretranslation files once a build has finished. This makes it difficult to debug issues and to perform testing. Instead, Serval should retain the files after a build has finished. The files should be deleted after a predetermined amount of time (maybe 30 days).

@ddaspit ddaspit added the enhancement New feature or request label Aug 27, 2024
@ddaspit ddaspit added this to Serval Aug 27, 2024
@github-project-automation github-project-automation bot moved this to 🆕 New in Serval Aug 27, 2024
@ddaspit ddaspit moved this from 🆕 New to 🔖 Ready in Serval Aug 27, 2024
@ddaspit
Copy link
Contributor Author

ddaspit commented Aug 27, 2024

This is needed to support gray box testing by the test team.

@johnml1135
Copy link
Collaborator

Do we want this for QA only? Do we want this controllable by a flag?

@ddaspit
Copy link
Contributor Author

ddaspit commented Aug 28, 2024

I think it would be good to do this on production as well, since it will be helpful for debugging.

@johnml1135 johnml1135 self-assigned this Aug 29, 2024
johnml1135 added a commit that referenced this issue Sep 9, 2024
johnml1135 added a commit that referenced this issue Sep 10, 2024
johnml1135 added a commit that referenced this issue Sep 10, 2024
johnml1135 added a commit that referenced this issue Sep 10, 2024
* Fix #464 - add lock lifetime for all
* Add HTTP timeout
* Make adjustable through options
* Will need to delete all locks from MongoDB - otherwise will endlessly loop for startup
* Fix some ci-e2e issues

Only use locking when accessing SMT model

Fix unit tests

Update to latest version of Machine

Fix bug where wrong id is used when starting a build

Remove reference to Serval.Shared in Serval.Machine.Shared

* preserve fix fro #468
@johnml1135 johnml1135 moved this from 🔖 Ready to ✅ Done in Serval Sep 10, 2024
@ddaspit
Copy link
Contributor Author

ddaspit commented Sep 10, 2024

We still need to implement a method for cleaning up the files.

@ddaspit ddaspit reopened this Sep 10, 2024
@johnml1135 johnml1135 moved this from ✅ Done to 🔖 Ready in Serval Sep 10, 2024
@johnml1135 johnml1135 assigned Enkidu93 and unassigned johnml1135 Oct 7, 2024
@Enkidu93
Copy link
Collaborator

I know you were asking about an approach in the PR above, @ddaspit: I'm thinking I'll just add a DateFinished to the machine build model which we can set in BuildJobService.BuildJobFinishedAsync and then add another RecurrentTask similar to the ModelCleanupService to delete the build artifacts after a certain amount of time (is 30 days long enough?). Does that sound like a good approach? We could wait until 30 days after the code has been pushed to production and then push another change to delete build artifacts for translation engines which don't have a DateFinished property set to avoid indefinitely storing the build artifacts from builds build recently or we could set the existing builds to have the current date at the time of the push (probable simpler).

@johnml1135
Copy link
Collaborator

The longest that I could imagine that a job could take would be 1 day. If we set the expiration for 30 days and key off of DateCreated (or whatever the field is), I think it could be simpler to implement and not have an old vs. new data issue.

@Enkidu93
Copy link
Collaborator

The longest that I could imagine that a job could take would be 1 day. If we set the expiration for 30 days and key off of DateCreated (or whatever the field is), I think it could be simpler to implement and not have an old vs. new data issue.

I'm not sure what you mean by 'set the expiration', @johnml1135. Are you referring to an s3 object expiration?

@johnml1135
Copy link
Collaborator

Set the expiration -> "delete the build artifacts after a certain amount of time." Sorry for the bad English.

@Enkidu93
Copy link
Collaborator

Set the expiration -> "delete the build artifacts after a certain amount of time." Sorry for the bad English.

No, no, I just wanted to make sure I understood 😁 . So you're suggesting we use a property of the files themselves?

@johnml1135
Copy link
Collaborator

Yes. Any system should at least have a created on and last modified on date.

@ddaspit
Copy link
Contributor Author

ddaspit commented Oct 14, 2024

The files are stored in the S3 bucket. Is that correct? Does S3 have a date modified/created field?

@Enkidu93
Copy link
Collaborator

Yes. Any system should at least have a created on and last modified on date.

OK, so basically just the equivalent of the clean_s3 script in silnlp. I guess that makes sense.

@ddaspit
Copy link
Contributor Author

ddaspit commented Oct 14, 2024

A script like that could be run using a cron job. I think that is how Matthew is doing it.

@Enkidu93
Copy link
Collaborator

A script like that could be run using a cron job. I think that is how Matthew is doing it.

Right. Might it be appropriate just to hijack that script and tag on an additional pattern that covers the Serval builds?

@ddaspit
Copy link
Contributor Author

ddaspit commented Oct 14, 2024

That makes sense to me. Although, we are planning on moving silnlp data to the NAS, so production and research data will be stored in separate locations.

@johnml1135
Copy link
Collaborator

I am assuming that the job would be run directly within serval as a recurring task. If it were an actual cron job, the deployment would be non-obvious.

@ddaspit
Copy link
Contributor Author

ddaspit commented Oct 15, 2024

If possible, I would prefer to run this as a cron job rather than use up a thread in the Serval server. Does Kubernetes have some way to run scheduled jobs?

@johnml1135
Copy link
Collaborator

Another option is just to have it completely separate from Serval - running the cron job on with the SILNLP cron job for the S3 bucket data. I would make sure that there was a flag to turn on and off the serval auto-deleting at the end of jobs though. It would be a bit manual, but really only effects S3 bucket cost if the cron job does not execute.

@johnml1135 johnml1135 moved this from 🔖 Ready to 🏗 In progress in Serval Nov 1, 2024
@Enkidu93
Copy link
Collaborator

Enkidu93 commented Dec 2, 2024

What's the conclusion here in regards to a strategy?

@johnml1135
Copy link
Collaborator

Let's talk about it at Wednesday's meeting. I am still inclined to do the "roll it into SILNLP cleanup" thing.

@johnml1135
Copy link
Collaborator

@mshannon-sil - If I am correct, there is an SILNLP job that runs that cleans up old data. Is that correct?

@johnml1135 johnml1135 assigned mshannon-sil and johnml1135 and unassigned Enkidu93 Dec 5, 2024
@mshannon-sil
Copy link
Collaborator

Yes that's right, the clean_s3 script in the SILNLP repo currently runs as a cron job on the AQuA server every Sunday.

@johnml1135
Copy link
Collaborator

@mshannon-sil - how hard would it be to extend that script to also clean production data?

@mshannon-sil
Copy link
Collaborator

Should be pretty straightforward. It would just need to check if the files are in the production folder and match the pattern for corpus and pretranslation files, and then delete any files older than 30 days. If we just want to update the clean_s3 script for this, I'd be happy to do it.

@johnml1135
Copy link
Collaborator

@mshannon-sil, what is the status of this issue?

@mshannon-sil
Copy link
Collaborator

The status is ready to be worked on. I have some time today, so I'll work on it and keep you updated.

@mshannon-sil
Copy link
Collaborator

@johnml1135 I just submitted a PR for this issue in SILNLP.

@johnml1135
Copy link
Collaborator

@mshannon-sil - can you respond to the comments?

@mshannon-sil
Copy link
Collaborator

Yes sorry for the delay, I was tackling other assignments when I came back last week and this got pushed back. I have time today to review.

@mshannon-sil
Copy link
Collaborator

The PR has been merged, and I just verified that the cron job is deleting 2 month old production builds in addition to the 1 month old research checkpoints.

@github-project-automation github-project-automation bot moved this from 🏗 In progress to ✅ Done in Serval Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
Development

No branches or pull requests

4 participants