Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not respecting NLTK_DATA environment variable #3125

Closed
TaylorN15 opened this issue May 31, 2024 · 4 comments
Closed

Not respecting NLTK_DATA environment variable #3125

TaylorN15 opened this issue May 31, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@TaylorN15
Copy link

Describe the bug
I'm not sure if this is an issue with unstructured or nltk...

I am running on Azure Functions in an App Service Environment which is within an internal network and all outbound traffic is blocked and allowed by exception only. I have downloaded the required NLTK packages and stored then with the functions code, and set an environment variable for NLTK_DATA on the app config. But it still tries to download the NLTK packages and times out (fails). If I (manually) copy the nltk_data folder to ~/nltk_data/ it works, but this is not viable as this directory is volatile.

To Reproduce
Block access to nltk.org, run any partition function that requires NLTK packages.

Expected behavior
The code should check environment variable NLTK_DATA

After timing out, I assume its trying to unzip the downloaded NLTK data, which doesn't exist. The stack trace indicates an issue with python-docx but I couldn't find NLTK referenced in there.

  File "/home/site/wwwroot/.python_packages/lib/site-packages/unstructured/partition/docx.py", line 423, in _document
    return docx.Document(file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/pkgreader.py", line 22, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/phys_pkg.py", line 76, in __init__
    self._zipf = ZipFile(pkg_file, "r")
  File "/usr/local/lib/python3.10/zipfile.py", line 1271, in __init__
    self._RealGetContents()
  File "/usr/local/lib/python3.10/zipfile.py", line 1338, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
@TaylorN15 TaylorN15 added the bug Something isn't working label May 31, 2024
@MthwRobinson
Copy link
Contributor

@scanny - Any thoughts on the python-docx stack traces? We don't do anything special with the NLTK_DATA environment variable, that all gets handled by nltk.

@scanny
Copy link
Collaborator

scanny commented May 31, 2024

I'd be looking for the original partitioning call, especially for the file-type (ODT, DOC, DOCX), to get any real insight.

This error is the one you get when a file-path is provided to python-docx and either:

  • There is no file at that path.
  • There is a file but it is not a DOCX file.

At the outermost level, a DOCX file is a Zip archive. So if the file isn't a Zip archive it's definitely not a DOCX file.

On the python-docx issues list this most frequently occurs when someone tries to use python-docx for a DOC file (pre-2007 Word file), but there are any number of ways it can happen.

The rest of the stack trace might also narrow it down.

@TaylorN15
Copy link
Author

Thanks for the quick responses. I think I may have been incorrect about the NLTK data, as once we added the rule to our firewall to allow access to GitHub for downloads, I got the error again. I then realised it was caused by trying to download the telemetry package from your servers that was causing the error.

Initially I suspected an issue with NLTK as it worked when I copied the NLTK_data to /home/site (on the app service) but I think my networking guys were also troubleshooting at the same time so it was a false positive.

@TaylorN15 TaylorN15 closed this as not planned Won't fix, can't repro, duplicate, stale May 31, 2024
@MthwRobinson
Copy link
Contributor

Thanks for following up @TaylorN15 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants