-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_pdf from a URL #91
Comments
Is this still something that needs to be done? |
Hey I'm interested in this issue. Can you explain a little more? |
Hey @rra94, you can check out the links in my comment above. Just like pandas.read_html reads html from a URL, camelot.read_pdf would read a PDF from a URL. |
Hey @vinayak-mehta I can take this. I have two ideas:
What do you recommend? |
1 sounds better than 2 since we don't have to worry about cleaning up the downloaded file afterwards. (Though it'll fill up memory in case of very large files, we could give this out as a warning. Did you take a look at pandas.read_html? How is this implemented there?) We might also need to think about differentiating between filepaths and URLs using regexes maybe. |
in pandas it's the same thing afik if _is_url(obj): we can use regex or simply the first four chars to be equal to ftp or http |
Regex sounds good, please open a PR, we can continue the discussion there! Do check out the contribution guidelines here https://camelot-py.readthedocs.io/en/master/dev/contributing.html#pull-requests. |
We can also download the file in |
read_pdf won't be reused since it's the top-level interface. A step could be added to the lower-level PDFHandler which would differentiate if the input is a URL or file-like object or filepath, and then download into |
Leaving this here to check out later. https://stackoverflow.com/questions/22800100/parsing-a-pdf-via-url-with-python-using-pdfminer |
[MRG] Update how-it-works.rst
No description provided.
The text was updated successfully, but these errors were encountered: