read_pdf from a URL #91

vinayak-mehta · 2018-09-02T13:42:37Z

No description provided.

pecey · 2018-10-03T06:37:30Z

Is this still something that needs to be done?

vinayak-mehta · 2018-10-03T08:38:52Z

Hey @pecey, it would be good to have this as a feature. A good starting point would be pandas' read_html. Do post here about how you're planning to do this if you take this up 😄

EDIT: You can check out read_csv too which says "Valid URL schemes include http, ftp, s3, and file."

rra94 · 2018-10-14T15:41:18Z

Hey I'm interested in this issue. Can you explain a little more?

vinayak-mehta · 2018-10-15T09:07:59Z

Hey @rra94, you can check out the links in my comment above. Just like pandas.read_html reads html from a URL, camelot.read_pdf would read a PDF from a URL.

rra94 · 2018-10-15T14:50:46Z

Hey @vinayak-mehta I can take this. I have two ideas:

Use Urllib2 to get the pdf as a stringIO object and pass it to the PdfFileReader function in the pdfhandler
Download a local copy again using URLLIB and then use the local copy to with the PdfFileReader.

What do you recommend?

vinayak-mehta · 2018-10-15T17:04:03Z

1 sounds better than 2 since we don't have to worry about cleaning up the downloaded file afterwards. (Though it'll fill up memory in case of very large files, we could give this out as a warning. Did you take a look at pandas.read_html? How is this implemented there?)

We might also need to think about differentiating between filepaths and URLs using regexes maybe.

rra94 · 2018-10-15T18:58:10Z

in pandas it's the same thing afik

if _is_url(obj):
with urlopen(obj) as url:
text = url.read()

we can use regex or simply the first four chars to be equal to ftp or http

vinayak-mehta · 2018-10-16T12:20:03Z

Regex sounds good, please open a PR, we can continue the discussion there! Do check out the contribution guidelines here https://camelot-py.readthedocs.io/en/master/dev/contributing.html#pull-requests.

pecey · 2018-10-29T19:09:22Z

We can also download the file in tmp. Then we would be able to reuse the read_pdf method in io.py. Cleanup shouldn't be much big of a task. In many systems, it is cleaned up pretty frequently, so unless someone is downloading huge files back-to-back, this shouldn't cause an issue.

vinayak-mehta · 2018-10-31T12:25:06Z

Then we would be able to reuse the read_pdf method in io.py.

read_pdf won't be reused since it's the top-level interface. A step could be added to the lower-level PDFHandler which would differentiate if the input is a URL or file-like object or filepath, and then download into tmp like it already does with the TemporaryDirectory context manager.

vinayak-mehta · 2018-11-09T18:53:20Z

Leaving this here to check out later. https://stackoverflow.com/questions/22800100/parsing-a-pdf-via-url-with-python-using-pdfminer

[MRG] Update how-it-works.rst

vinayak-mehta added enhancement Priority: High labels Sep 2, 2018

vinayak-mehta modified the milestone: 0.1.0 Sep 2, 2018

vinayak-mehta removed the Priority: High label Sep 9, 2018

arpit1997 added the Hacktoberfest label Oct 12, 2018

vinayak-mehta mentioned this issue Oct 23, 2018

Is it possible read_pdf accept a file like object to be read? #158

Closed

vinayak-mehta removed the Hacktoberfest label Oct 30, 2018

pqrth mentioned this issue Nov 1, 2018

[WIP] PDFHandler accepts file like objects #189

Closed

vinayak-mehta added this to the v0.6.0 milestone Dec 2, 2018

vinayak-mehta mentioned this issue Dec 24, 2018

[MRG] Add support to read from url #236

Merged

vinayak-mehta closed this as completed in #236 Dec 24, 2018

kirbs- pushed a commit to kirbs-/camelot that referenced this issue Jul 31, 2020

Merge pull request atlanhq#91 from vasantvohra/patch-1

eb2badb

[MRG] Update how-it-works.rst

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_pdf from a URL #91

read_pdf from a URL #91

vinayak-mehta commented Sep 2, 2018

pecey commented Oct 3, 2018

vinayak-mehta commented Oct 3, 2018 •

edited

Loading

rra94 commented Oct 14, 2018

vinayak-mehta commented Oct 15, 2018

rra94 commented Oct 15, 2018

vinayak-mehta commented Oct 15, 2018

rra94 commented Oct 15, 2018

vinayak-mehta commented Oct 16, 2018

pecey commented Oct 29, 2018

vinayak-mehta commented Oct 31, 2018 •

edited

Loading

vinayak-mehta commented Nov 9, 2018

read_pdf from a URL #91

read_pdf from a URL #91

Comments

vinayak-mehta commented Sep 2, 2018

pecey commented Oct 3, 2018

vinayak-mehta commented Oct 3, 2018 • edited Loading

rra94 commented Oct 14, 2018

vinayak-mehta commented Oct 15, 2018

rra94 commented Oct 15, 2018

vinayak-mehta commented Oct 15, 2018

rra94 commented Oct 15, 2018

vinayak-mehta commented Oct 16, 2018

pecey commented Oct 29, 2018

vinayak-mehta commented Oct 31, 2018 • edited Loading

vinayak-mehta commented Nov 9, 2018

vinayak-mehta commented Oct 3, 2018 •

edited

Loading

vinayak-mehta commented Oct 31, 2018 •

edited

Loading