Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversion script for LA Metro attachments #193

Merged
merged 11 commits into from
Apr 9, 2018

Conversation

reginafcompton
Copy link
Contributor

@reginafcompton reginafcompton commented Mar 22, 2018

This PR handles Metro issue Metro-Records/la-metro-councilmatic#266

The script should (1) convert .doc, .docx, and PDF attachments into plain text, and (2) save that text in the full_text field on the BillDocument model.

Unoconv can convert doc and .docx files. However, it cannot convert PDFs into plain text. See what LibreOffice can and cannot export.

For PDFs, we could use something like PyPDF2 in combination with the requests library.

Note: Since we need to use something in addition to unoconv, we might think about if we should use something other than unoconv for the doc and docx conversions (unoconv is a heavy dependency). The original scope mentions Excel, however - and unoconv could be good for those files....

@reginafcompton reginafcompton changed the title Initial pass at conversion script for LA Metro attachments [WIP] Initial pass at conversion script for LA Metro attachments Mar 22, 2018
@hancush
Copy link
Member

hancush commented Mar 26, 2018

for posterity: piping in2csv output to txt could work for excel files!

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Mar 30, 2018

PyPDF2 is easy to install and use. However, it comes with a couple drawbacks:

(1) I noticed several instances of plain text omitted spaces. This seems to be a known issue with PyPDF2: py-pdf/pypdf#17

(2) PyPDF2 can only convert one page at a time (using extractText()).


I think pdfminer six might be the best option - it comes with a nice pdf2txt.py script, although the pip install does not work as expected. The documentation suggests downloading it from source.

@reginafcompton
Copy link
Contributor Author

I ultimately landed on textract, since it converts pdf, doc, and docx files to plaintext, without the heft of unoconv or a second library. I did not have much difficulty installing it on MacOS, but I'd like to try it out on the Councilmatic server before confirming this solution: http://textract.readthedocs.io/en/stable/installation.html#ubuntu-debian

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Apr 4, 2018

Clarification regarding the abandonment of unoconv

unoconv struggles with converting PDF to txt. I tried these conversions both locally and on the Councilmatic server. For both, unoconv errors with "Unable to store document..." when calling storeToURL - a function in OpenOffice.

Server

# Command run 
# I also tried this with "text"
unoconv -f txt 8e6281f1-8342-42ae-b5a2-271ca6902d99.pdf

# Error
File "/usr/bin/unoconv", line 1118, in convert
    document.storeToURL(outputurl, tuple(outputprops) )
uno.IOException: SfxBaseModel::impl_store <file:///tmp/8e6281f1-8342-42ae-b5a2-271ca6902d99.txt> failed: 0xc10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/unoconv", line 1389, in <module>
    main()
  File "/usr/bin/unoconv", line 1305, in main
    convertor.convert(inputfn)
  File "/usr/bin/unoconv", line 1120, in convert
    raise UnoException("Unable to store document to %s (ErrCode %d)\n\nProperties: %s" % (outputurl, e.ErrCode, outputprops), None)
  File "/usr/lib/python3/dist-packages/uno.py", line 507, in _uno_struct__getattr__
    return getattr(self.__dict__["value"], name)
AttributeError: ErrCode

It may be that we need a different version of OpenOffice, but I would rather not go down that path, considering that we are happily using Unoconv for the RTF converter.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Apr 4, 2018

Installing textract on an Ubuntu server

I installed textract on the staging server (i.e., the server with all our staging sites, not the Councilmatic server), and it worked well. It requires several lightweight dependencies and one small install hack.

First, install all the dependencies.

Second, textract fails when installing pocketsphinx, which is not actually necessary for converting PDFs or word documents to txt. This issue provides a clever work around: deanmalmgren/textract#178.
Third, run pip install textract.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Apr 4, 2018

@hancush - I have a working solution for the text conversion!

The one point that requires further thought entails the use of a NamedTemporaryFile in convert_document.

Ideally, we could do this without a temporary file, i.e., with a subprocess, since textract can be used as a CLI tool:

p = subprocess.Popen(['textract', '--stdin', '--stdout'], preexec_fn=os.setsid, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
plain_text, stderr_data = p.communicate(input=RESPONSE_IN_SOME_FORMAT, timeout=15)

However, textract uses a variety of dependencies to open and convert files: docx, doc, pdf

Is there a way to pass pdf, doc, and docx files to textract with a subprocess - and I am not seeing it?


@fgregg - could I bring you into this conversation as a consultant/reviewer? The main parts of this are (1) why I used textract, rather than unoconv (see comments above), and (2) the question of using a subprocess with --stdin rather than a TempFile (see notes directly above).

@reginafcompton reginafcompton requested a review from hancush April 4, 2018 22:18
@reginafcompton reginafcompton changed the title [WIP] Initial pass at conversion script for LA Metro attachments Conversion script for LA Metro attachments Apr 4, 2018
@fgregg
Copy link
Member

fgregg commented Apr 5, 2018

Since unoconv remains a dependency, as we use it for converting the rtf files to html, what about just install pdftotext and just using that for pdfs and unoconv for the doc files.

That would seem to be a much smaller footprint?

@reginafcompton
Copy link
Contributor Author

@fgregg, yes, indeed. One of my first implementations did just that (i.e., use Unoconv for doc and docx, then another tool for PDF) - see commented code. I am happy to revert back to something along these lines with pdftotext.

With that said...

  • the two-tool solution seems unnecessarily verbose (when textract can do everything for us)
  • it does not seem terribly efficient to open an unoconv listener every session, even if we do not need to convert doc and docx files
  • ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?

I am not married to either solution! Just getting my thoughts out there.

@fgregg
Copy link
Member

fgregg commented Apr 6, 2018

ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?

This is a compelling argument, as unoconv is a big dependency. However textract has a lot of binary dependencies, it is very far from lightweight.

Okay, I'm okay with textrac.

Copy link
Member

@hancush hancush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for finding the right tool! :-) A few comments inline...

self.add_plain_text()

def get_document_url(self):
self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this here, instead of in the transactional context below?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is unnecessary: we are not adding any timestamps to the instances of BillDocument. We can safely remove it.


def handle(self, *args, **options):
self.update_all = options['update_all']
self.connection = engine.connect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening a connection in this way, without subsequently closing it, relies on garbage collection to close it out for you. The SQLAlchemy docs caution against this approach, notably because it's unreliable and can lead to orphaned connections to your database hanging out forever until they ultimately cause a mysterious "You have too many connections open, sorry!!" error from Postgres.

Is there a reason you assigned connection as a class attribute rather than using the with engine.begin() context manager when you require a connection?


chunk = []

for doc_dict in plain_text_results:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be a bit more readable here to do for doc_dict in self.convert_document() rather than assigning the generator function call to a variable, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be a bit more descriptive with the convert_document name, perhaps convert_document_to_plaintext?

def add_plain_text(self):
plain_text_results = self.convert_document()

self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to do this twice? Could we do it once, at the top of the handle method?


logger.info('Converting document to plain text...')

for document_data in documents:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto for x in self.generator_method() rather than assigning to variable comment.


for doc_dict in plain_text_results:
chunk.append(doc_dict)
if len(chunk) == 20:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that you want to do these updates in batches of 20 or less, but this control flow feels a little bit clunky. @fgregg, is there a more streamlined way to do this? If no, @reginafcompton, maybe a comment stating intention would make this a little easier to grok. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure 90% that

 self.connection.execute(sa.text(update_statement), plain_texts)

will do the right thing if plain_texts is a generator.

Copy link
Member

@hancush hancush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems legit to me! This is an awesome revision! One comment inline.

plaintexts = self.convert_document_to_plaintext()

while True:
plaintexts_fetched_from_generator = list(itertools.islice(plaintexts, 20))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @fgregg wants to backread this change just in case I've missed a subtlety, I'd be much obliged!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @hancush ! I'll wait for final comments from @fgregg before merging.

@reginafcompton reginafcompton merged commit 8aa07f6 into master Apr 9, 2018
@reginafcompton reginafcompton deleted the attachment-conversion branch April 9, 2018 18:01
@hancush hancush mentioned this pull request Jun 18, 2019
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants