Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Enhance Error Handling in FileSystem Imports to Improve Troubleshooting #31218

Closed
1 of 16 tasks
RyuSA opened this issue May 8, 2024 · 4 comments · Fixed by #31219
Closed
1 of 16 tasks

Comments

@RyuSA
Copy link
Contributor

RyuSA commented May 8, 2024

What would you like to happen?

I would like to make the import of Filesystem, which is defined in the top-level code of apache_beam.io.filesystems, easier to troubleshoot.

https://github.com/apache/beam/blob/v2.56.0/sdks/python/apache_beam/io/filesystems.py#L36-L59

AS-IS:

try:
  from apache_beam.io.hadoopfilesystem import HadoopFileSystem
except ImportError:
  pass

PROPOSAL:

try:
  from apache_beam.io.hadoopfilesystem import HadoopFileSystem
except ModuleNotFoundError:
  pass
except ImportError as e:
  _LOGGER.warning("Failed to import HadoopFileSystem; loading of this filesystem will be skipped.", e)

For context, I encountered a problem when launching a Beam job on CentOS 7 with apache-beam[gcp]==2.55.0 installed. The error occurs at the time of job initiation and is not an issue that occurs during job execution.

$ python3 -m apache_beam.examples.wordcount \
     --input INPUT \
     --output OUTPUT \
     --runner DataflowRunner 
Traceback (most recent call last):
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
...
  File "/home/ryusa/venv/lib64/python3.8/site-packages/apache_beam/io/filesystems.py", line 103, in get_filesystem
    raise ValueError(
ValueError: Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: ...

The error itself occurs on this line and is due to the failure to load GCSFileSystem at module initialization. This, in turn, is because GCSFileSystem relies on the requests package which, from version 2 onwards, requires OpenSSL 1.1.1 due to OS dependencies. CentOS 7 has OpenSSL 1.0.2 installed, so the behavior has changed with Beam version 2.55.0 and later. (This is not essential, so I have not investigated in detail.)

$ python3
>>> from apache_beam.io.gcp.gcsfilesystem import GCSFileSystem
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryusa/venv/lib64/python3.8/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 36, in <module>
...
    import urllib3
  File "/home/ryusa/venv/lib64/python3.8/site-packages/urllib3/__init__.py", line 42, in <module>
    raise ImportError(
ImportError: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'OpenSSL 1.0.2k-fips  26 Jan 2017'. See: https://github.com/urllib3/urllib3/issues/2168

I was able to resolve this quickly because I happened to know about these circumstances, but considering the future, it seems better to handle ImportError not just by suppressing it, but by logging a warning error.
I can send a Pull Request. However, since it involves committing to a core area, I've raised an Issue first.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@liferoad
Copy link
Collaborator

liferoad commented May 8, 2024

I like your idea. Thanks for opening this issue.

@Abacn
Copy link
Contributor

Abacn commented May 8, 2024

filesystem extensions [gcp],[s3],etc are optional dependencies of beam. #31219 will cause excessive warning raised if user not intended to install these dependencies. Can we improve the error message you referred in the description instead?

Unable to get filesystem from specified path, please use the correct path or ensure the required dependency is installed, e.g., pip install apache-beam[gcp]. Path specified: ...

to, e.g.

Unable to get filesystem of scheme "s3://" from specified path

@Abacn
Copy link
Contributor

Abacn commented May 8, 2024

Even better, currently it hints 'e.g., pip install apache-beam[gcp]. If the failed scheme is gs://, we can hint user to do pip install apache-beam[gcp]; if the failed scheme is s3://, we can hint user to do pip install apache-beam[aws]; and so on

@RyuSA
Copy link
Contributor Author

RyuSA commented May 9, 2024

@Abacn Thank you for reviewing my proposal. 👍

#31219 will cause excessive warning raised if user not intended to install these dependencies...

In my PR, there should not be any warning logs for modules ([gcp], [aws], etc.) that the user has not intended to installed. The import statement for not-installed modules should throw ModuleNotFoundError, so they should be blocked before the section on ImportError. (Am I right?)

$ python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> try:
...   import this_is_module_not_found
... except ModuleNotFoundError:
...   print("ModuleNotFound!")
... except ImportError:
...   print("ImportError!")
... 
ModuleNotFound!
>>>

However, on the other hand:

Can we improve the error message you referred to in the description instead?

Even better, currently it hints ...

I think these ideas are excellent!

I was considering this issue under the scope of "when modules (gcp/aws/azure) installed by the user as Filesystem fail to initialize due to for some reason (such as OpenSSL)."
I think the enhancement you suggested(where the user has not installed Filesystem) could either be a separate issue or included in this one. (I would be happy to submit a PR! 👀)

@github-actions github-actions bot added this to the 2.57.0 Release milestone May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants