Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip password when serializing filesystem #1625

Merged
merged 4 commits into from
Jun 17, 2024

Conversation

DarkLight1337
Copy link
Contributor

@DarkLight1337 DarkLight1337 commented Jun 14, 2024

Currently, AbstractFileSystem.to_dict (and by extension AbstractFileSystem.to_json) store all arguments used to construct the object. However, there are some cases where the user may provide a password, such as in FTPFileSystem, SMBFileSystem, and WebHDFS. This may cause security issues, especially considering that the password is stored in plain text.

This PR disables the aforementioned behaviour by default. Users now have to pass include_password=True to the relevant methods if they wish to store the password anyway. After some discussion, this feature is now opt-in. I have added a warning to the documentation to remind users to be careful about this.

@DarkLight1337 DarkLight1337 changed the title Avoid storing password when serializing filesystem Do not store password by default when serializing filesystem Jun 14, 2024
@martindurant
Copy link
Member

I should start by saying that including passwords directly in code/scripts is not generally a good idea. Many or the backends provide alternative ways to get credentials at runtime, from certain files, environment variables or other means.

Am I right in thinking that you are simply looking for dictionary keys that happen to be called "password"? That would not include other possibly sensitive values like keys and tokens, but I don't suppose there's any way to know whether some value is sensitive or not.

I am mildly against removing any values by default, as it would break the principal function of dict/json serialization, to faithfully recreate the original instance. How would a user re-apply the password when they need it, or how would they know in the first place that it was missing?

However, it might be suitable to provide a "safe ser" mode, but perhaps each backend implementation would declare which instance kwargs are to be considered sensitive?

@DarkLight1337
Copy link
Contributor Author

Am I right in thinking that you are simply looking for dictionary keys that happen to be called "password"? That would not include other possibly sensitive values like keys and tokens, but I don't suppose there's any way to know whether some value is sensitive or not.

Yes, it's that straightforward.

but perhaps each backend implementation would declare which instance kwargs are to be considered sensitive?

This would be a more comprehensive way to filter out sensitive data, but would require more time to implement. I think we should release a simple/partial patch first.

I am mildly against removing any values by default, as it would break the principal function of dict/json serialization, to faithfully recreate the original instance. How would a user re-apply the password when they need it, or how would they know in the first place that it was missing?

It should not take much digging into an authentication error to realize that the password is missing. Do many people expect that serialization includes sensitive data by default? The requirement of passing include_password=True explicitly would get people to think twice about security.

@martindurant
Copy link
Member

Let's make removing "password" opt-in for now and rettain the same default behaviour for now.

It should not take much digging into an authentication error to realize that the password is missing.

Unfortunately, you commonly get a "permission error" or something more arcane without further details. Most security systems will not tell you specifically that a password is missing. Of course, the FS repr doesn't include whether a password is set or not. I don't know, but you may even get an attribute-error for some backends if you simply don't set the value (should it be None by default, or ""?).

Do many people expect that serialization includes sensitive data by default?

It depends! If the purpose is to pass instances within a secured network, then yes. Pickle may well do that in a worker cluster, after all. Where instances need to be (re)created from a set of arguments plus additional information from the user or environment, you may well need a helper like Intake to encode this intent.

@DarkLight1337
Copy link
Contributor Author

Let's make removing "password" opt-in for now and rettain the same default behaviour for now.

Alright, I have updated the PR accordingly.

@DarkLight1337 DarkLight1337 changed the title Do not store password by default when serializing filesystem Add option to skip password when serializing filesystem Jun 17, 2024
@martindurant martindurant merged commit 527728d into fsspec:master Jun 17, 2024
11 checks passed
@DarkLight1337 DarkLight1337 deleted the no-dump-password branch June 17, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants