Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(robot-server): implement data files auto-deletion #15879

Merged
merged 4 commits into from
Aug 2, 2024

Conversation

sanni-t
Copy link
Member

@sanni-t sanni-t commented Aug 2, 2024

Closes AUTH-467

Overview

Adds auto-deletion of old data files. If the data files stored on disk exceed 50 files, then we start auto-deleting files as needed. The oldest file which is not being referenced by any analysis or run in the database is deleted from disk and removed from the database

Test Plan and Hands on Testing

  • Add 50 data files to the robot, use some of them in a run
  • Check that 50 files still exist
  • Add another file
  • Check that the oldest unused file is deleted
    (pro-tip: you can change the file limit in robot settings for testing)

Changelog

  • added DataFileAutoDeleter and DataFileDeletionPlanner
  • updated data file router to use auto-deletion before adding a new file
  • added new remove() method to DataFilesStore

Review requests

  • Usual code review

Risk assessment

Medium. Doesn't change existing infrastructure, except that if implemented incorrectly, could delete data files unexpectedly.

@sanni-t sanni-t requested a review from a team as a code owner August 2, 2024 17:57
Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks. Some minor comments.

default=50,
gt=0,
description=(
"The maximum number of data files to allow before auto-deleting old ones."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say:

Suggested change
"The maximum number of data files to allow before auto-deleting old ones."
"The maximum number of unused data files to allow before auto-deleting old ones."

And be named maximum_unused_data_files?

Copy link
Contributor

@SyntaxColoring SyntaxColoring Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and would you also mind running pipenv run python scripts/settings_schema.py settings_schema.json and then a top-level make format-js?

I'm not sure if settings_schema.json is really helpful, but given that it exists, we should keep it up to date. I'd also be down to just delete it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually not max number of unused data files though. As in, it will start deleting unused files when [used+unused] exceed 50.

Copy link
Contributor

@SyntaxColoring SyntaxColoring Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that imply that if you have 10 runs that each use 5 CSV files that are all distinct from each other, it's impossible to upload any additional CSV files? Because you're already at the limit of 50 total, and the server will not autoremove any of those 50?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in that case we'll exceed the 50 file limit and allow adding new ones. Each time we try to add a new one, we will re-check to find any newly un-referenced files and delete them until files < 50 again.

transaction.execute(delete_statement)

file_dir = self._data_files_directory.joinpath(file_id)
if file_dir:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this if file_dir intended to do? It looks like it will only check if the stringification of the path is != "", not that the path actually exists on the filesystem?

Comment on lines +24 to +25
# It feels wasteful to collect usage info of upto 50 files
# even when there's no need for deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Another thing that we should keep in mind is that this AutoDeleter pattern is not safe in the face of concurrent requests. I think in the future we should look for ways to internalize this logic within the stores, which are in a better position to do things transactionally and which already have some of this information available.

raise FileInUseError(
data_file_id=file_id,
message=f"Cannot remove file {file_id} as it is being used in"
f" existing{analysis_usage_text or ''}{conjunction}{runs_usage_text or ''}.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistically, it might be cleaner to do this kind of stringification inside FileInUseError()'s __init__().

Copy link
Contributor

@jbleon95 jbleon95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straightforward and looking good.

async def make_room_for_new_file(self) -> None:
"""Delete old data files to make room for a new one."""
# It feels wasteful to collect usage info of upto 50 files
# even when there's no need for deletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can just check how many actual data files there are rather than the more complicated usage request? Not necessary for now and I don't know how much more efficient it'll be (especially once you're at 50 files and we're gonna have to do this every time), but a thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage check is necessary because deleting a used file will result in a cascade of errors starting from foreign keys in analysis/ runs tables. A solution would probably involve internalizing the file deletion in the store itself, as @SyntaxColoring suggested, so that we can keep track of the files as they are created/ deleted and handle auto-deletion much more efficiently.

@sanni-t sanni-t merged commit dceee85 into edge Aug 2, 2024
7 checks passed
@sanni-t sanni-t deleted the AUTH-467-auto-delete-old-csv-files branch August 7, 2024 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants