-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escape database, table and udf names for S3 paths #168
Conversation
ch_backup/backup/layout.py
Outdated
def _try_get_escaped_path( | ||
self, path_function: Callable, *args: Any, **kwargs: Any | ||
) -> str: | ||
""" | ||
Return escaped path if it exists. Otherwise return regular path. | ||
""" | ||
path = path_function(*args, escape_names=True, **kwargs) | ||
if self._storage_loader.path_exists(path, is_dir=True): | ||
return path | ||
return path_function(*args, escape_names=False, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be simpler to just try to use escaped path and retry with not escaped in case file is not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this way. It seems it is safe to escape the whole path at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I think adding retries to several methods will cause unnecessary complications of the old code. There are also some minor issues which will need to be addressed like list_dir
used in check_data_part
returning empty list even in case directory is not existing or the need to recollect all part paths in delete_data_parts
in case of retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about do in such a way instead of passing a callable and kwargs. For me it is more clear.
# In downloading methods
remote_path = _udf_data_path(backup_meta.path, filename, escape=False) # For backward compatibility
remote_path = _get_escaped_if_exists(remote_path)
def _get_escaped_if_exists(self, unescaped_path: str) -> str:
"""
Return escaped path if it exists. Otherwise return regular path.
"""
escaped_path = escape(unescaped_path)
if self._storage_loader.path_exists(escaped_path, is_dir=True):
return escaped_path
return unescaped_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escaping the whole path looks like this ch_backup%2F7c543608%2D1af5%2D4eaa%2D9974%2D93b58e3f699c%2Fdata%2Funusual_char%2Ftab%07l%2E%2F%2D_e%2Fall_1_1_0
- '/' and '-' are escaped. We can ignore '/', but '-' was escaped in table names for things like metadata before, so ignoring it can lead to some troubles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, we use safe=""
for urllib.parse.quote
. The escaping the whole string is a bad idea.
Ok let's return to the previous approach, it is more correct. But just rename function to _get_escaped_if_exists
Unescaped paths may cause some errors with s3 client.
Check if escaped path exists is added for backward compatibility with old backups.
path_exists
was working only for files, so I added another method for directories.