-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(robot-server): implement data files auto-deletion #15879
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks. Some minor comments.
default=50, | ||
gt=0, | ||
description=( | ||
"The maximum number of data files to allow before auto-deleting old ones." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say:
"The maximum number of data files to allow before auto-deleting old ones." | |
"The maximum number of unused data files to allow before auto-deleting old ones." |
And be named maximum_unused_data_files
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and would you also mind running pipenv run python scripts/settings_schema.py settings_schema.json
and then a top-level make format-js
?
I'm not sure if settings_schema.json
is really helpful, but given that it exists, we should keep it up to date. I'd also be down to just delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually not max number of unused data files though. As in, it will start deleting unused files when [used+unused] exceed 50.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that imply that if you have 10 runs that each use 5 CSV files that are all distinct from each other, it's impossible to upload any additional CSV files? Because you're already at the limit of 50 total, and the server will not autoremove any of those 50?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, in that case we'll exceed the 50 file limit and allow adding new ones. Each time we try to add a new one, we will re-check to find any newly un-referenced files and delete them until files < 50 again.
transaction.execute(delete_statement) | ||
|
||
file_dir = self._data_files_directory.joinpath(file_id) | ||
if file_dir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this if file_dir
intended to do? It looks like it will only check if the stringification of the path is != ""
, not that the path actually exists on the filesystem?
# It feels wasteful to collect usage info of upto 50 files | ||
# even when there's no need for deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Another thing that we should keep in mind is that this AutoDeleter
pattern is not safe in the face of concurrent requests. I think in the future we should look for ways to internalize this logic within the stores, which are in a better position to do things transactionally and which already have some of this information available.
raise FileInUseError( | ||
data_file_id=file_id, | ||
message=f"Cannot remove file {file_id} as it is being used in" | ||
f" existing{analysis_usage_text or ''}{conjunction}{runs_usage_text or ''}.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistically, it might be cleaner to do this kind of stringification inside FileInUseError()
's __init__()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Straightforward and looking good.
async def make_room_for_new_file(self) -> None: | ||
"""Delete old data files to make room for a new one.""" | ||
# It feels wasteful to collect usage info of upto 50 files | ||
# even when there's no need for deletion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just check how many actual data files there are rather than the more complicated usage request? Not necessary for now and I don't know how much more efficient it'll be (especially once you're at 50 files and we're gonna have to do this every time), but a thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage check is necessary because deleting a used file will result in a cascade of errors starting from foreign keys in analysis/ runs tables. A solution would probably involve internalizing the file deletion in the store itself, as @SyntaxColoring suggested, so that we can keep track of the files as they are created/ deleted and handle auto-deletion much more efficiently.
Closes AUTH-467
Overview
Adds auto-deletion of old data files. If the data files stored on disk exceed 50 files, then we start auto-deleting files as needed. The oldest file which is not being referenced by any analysis or run in the database is deleted from disk and removed from the database
Test Plan and Hands on Testing
(pro-tip: you can change the file limit in robot settings for testing)
Changelog
DataFileAutoDeleter
andDataFileDeletionPlanner
remove()
method toDataFilesStore
Review requests
Risk assessment
Medium. Doesn't change existing infrastructure, except that if implemented incorrectly, could delete data files unexpectedly.