Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape database, table and udf names for S3 paths #168

Merged
merged 5 commits into from
Jul 22, 2024

Conversation

kirillgarbar
Copy link
Contributor

Unescaped paths may cause some errors with s3 client.
Check if escaped path exists is added for backward compatibility with old backups. path_exists was working only for files, so I added another method for directories.

Comment on lines 615 to 624
def _try_get_escaped_path(
self, path_function: Callable, *args: Any, **kwargs: Any
) -> str:
"""
Return escaped path if it exists. Otherwise return regular path.
"""
path = path_function(*args, escape_names=True, **kwargs)
if self._storage_loader.path_exists(path, is_dir=True):
return path
return path_function(*args, escape_names=False, **kwargs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be simpler to just try to use escaped path and retry with not escaped in case file is not found

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this way. It seems it is safe to escape the whole path at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think adding retries to several methods will cause unnecessary complications of the old code. There are also some minor issues which will need to be addressed like list_dir used in check_data_part returning empty list even in case directory is not existing or the need to recollect all part paths in delete_data_parts in case of retry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about do in such a way instead of passing a callable and kwargs. For me it is more clear.

    # In downloading methods
    remote_path = _udf_data_path(backup_meta.path, filename, escape=False)  # For backward compatibility
    remote_path = _get_escaped_if_exists(remote_path)

    def _get_escaped_if_exists(self, unescaped_path: str) -> str:
        """
        Return escaped path if it exists. Otherwise return regular path.
        """
        escaped_path = escape(unescaped_path)
        if self._storage_loader.path_exists(escaped_path, is_dir=True):
            return escaped_path
        return unescaped_path

Copy link
Contributor Author

@kirillgarbar kirillgarbar Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Escaping the whole path looks like this ch_backup%2F7c543608%2D1af5%2D4eaa%2D9974%2D93b58e3f699c%2Fdata%2Funusual_char%2Ftab%07l%2E%2F%2D_e%2Fall_1_1_0 - '/' and '-' are escaped. We can ignore '/', but '-' was escaped in table names for things like metadata before, so ignoring it can lead to some troubles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we use safe="" for urllib.parse.quote. The escaping the whole string is a bad idea.
Ok let's return to the previous approach, it is more correct. But just rename function to _get_escaped_if_exists

@aalexfvk aalexfvk merged commit 23e9954 into yandex:main Jul 22, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants