-
Notifications
You must be signed in to change notification settings - Fork 818
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not respecting NLTK_DATA environment variable #3125
Comments
@scanny - Any thoughts on the |
I'd be looking for the original partitioning call, especially for the file-type (ODT, DOC, DOCX), to get any real insight. This error is the one you get when a file-path is provided to
At the outermost level, a DOCX file is a Zip archive. So if the file isn't a Zip archive it's definitely not a DOCX file. On the The rest of the stack trace might also narrow it down. |
Thanks for the quick responses. I think I may have been incorrect about the NLTK data, as once we added the rule to our firewall to allow access to GitHub for downloads, I got the error again. I then realised it was caused by trying to download the telemetry package from your servers that was causing the error. Initially I suspected an issue with NLTK as it worked when I copied the NLTK_data to /home/site (on the app service) but I think my networking guys were also troubleshooting at the same time so it was a false positive. |
Thanks for following up @TaylorN15 ! |
Describe the bug
I'm not sure if this is an issue with
unstructured
ornltk
...I am running on Azure Functions in an App Service Environment which is within an internal network and all outbound traffic is blocked and allowed by exception only. I have downloaded the required NLTK packages and stored then with the functions code, and set an environment variable for NLTK_DATA on the app config. But it still tries to download the NLTK packages and times out (fails). If I (manually) copy the nltk_data folder to ~/nltk_data/ it works, but this is not viable as this directory is volatile.
To Reproduce
Block access to nltk.org, run any partition function that requires NLTK packages.
Expected behavior
The code should check environment variable NLTK_DATA
After timing out, I assume its trying to unzip the downloaded NLTK data, which doesn't exist. The stack trace indicates an issue with
python-docx
but I couldn't find NLTK referenced in there.The text was updated successfully, but these errors were encountered: