-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversion script for LA Metro attachments #193
Conversation
for posterity: piping |
(1) I noticed several instances of plain text omitted spaces. This seems to be a known issue with PyPDF2: py-pdf/pypdf#17 (2) PyPDF2 can only convert one page at a time (using I think pdfminer six might be the best option - it comes with a nice |
I ultimately landed on |
Clarification regarding the abandonment of unoconv struggles with converting PDF to txt. I tried these conversions both locally and on the Councilmatic server. For both, unoconv errors with "Unable to store document..." when calling storeToURL - a function in OpenOffice. Server
It may be that we need a different version of OpenOffice, but I would rather not go down that path, considering that we are happily using Unoconv for the RTF converter. |
Installing textract on an Ubuntu server I installed textract on the staging server (i.e., the server with all our staging sites, not the Councilmatic server), and it worked well. It requires several lightweight dependencies and one small install hack. First, install all the dependencies. Second, textract fails when installing |
@hancush - I have a working solution for the text conversion! The one point that requires further thought entails the use of a Ideally, we could do this without a temporary file, i.e., with a subprocess, since textract can be used as a CLI tool:
However, textract uses a variety of dependencies to open and convert files: docx, doc, pdf Is there a way to pass pdf, doc, and docx files to textract with a subprocess - and I am not seeing it? @fgregg - could I bring you into this conversation as a consultant/reviewer? The main parts of this are (1) why I used |
Since unoconv remains a dependency, as we use it for converting the rtf files to html, what about just install pdftotext and just using that for pdfs and unoconv for the doc files. That would seem to be a much smaller footprint? |
@fgregg, yes, indeed. One of my first implementations did just that (i.e., use Unoconv for doc and docx, then another tool for PDF) - see commented code. I am happy to revert back to something along these lines with pdftotext. With that said...
I am not married to either solution! Just getting my thoughts out there. |
This is a compelling argument, as unoconv is a big dependency. However textract has a lot of binary dependencies, it is very far from lightweight. Okay, I'm okay with textrac. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for finding the right tool! :-) A few comments inline...
self.add_plain_text() | ||
|
||
def get_document_url(self): | ||
self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this here, instead of in the transactional context below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - this is unnecessary: we are not adding any timestamps to the instances of BillDocument. We can safely remove it.
|
||
def handle(self, *args, **options): | ||
self.update_all = options['update_all'] | ||
self.connection = engine.connect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opening a connection in this way, without subsequently closing it, relies on garbage collection to close it out for you. The SQLAlchemy docs caution against this approach, notably because it's unreliable and can lead to orphaned connections to your database hanging out forever until they ultimately cause a mysterious "You have too many connections open, sorry!!" error from Postgres.
Is there a reason you assigned connection
as a class attribute rather than using the with engine.begin()
context manager when you require a connection?
|
||
chunk = [] | ||
|
||
for doc_dict in plain_text_results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be a bit more readable here to do for doc_dict in self.convert_document()
rather than assigning the generator function call to a variable, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be a bit more descriptive with the convert_document
name, perhaps convert_document_to_plaintext
?
def add_plain_text(self): | ||
plain_text_results = self.convert_document() | ||
|
||
self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to do this twice? Could we do it once, at the top of the handle
method?
|
||
logger.info('Converting document to plain text...') | ||
|
||
for document_data in documents: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto for x in self.generator_method()
rather than assigning to variable comment.
|
||
for doc_dict in plain_text_results: | ||
chunk.append(doc_dict) | ||
if len(chunk) == 20: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that you want to do these updates in batches of 20 or less, but this control flow feels a little bit clunky. @fgregg, is there a more streamlined way to do this? If no, @reginafcompton, maybe a comment stating intention would make this a little easier to grok. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure 90% that
self.connection.execute(sa.text(update_statement), plain_texts)
will do the right thing if plain_texts is a generator.
…ating and saving in chunks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems legit to me! This is an awesome revision! One comment inline.
plaintexts = self.convert_document_to_plaintext() | ||
|
||
while True: | ||
plaintexts_fetched_from_generator = list(itertools.islice(plaintexts, 20)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If @fgregg wants to backread this change just in case I've missed a subtlety, I'd be much obliged!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR handles Metro issue Metro-Records/la-metro-councilmatic#266
The script should (1) convert .doc, .docx, and PDF attachments into plain text, and (2) save that text in the
full_text
field on the BillDocument model.Unoconv can convert doc and .docx files. However, it cannot convert PDFs into plain text. See what LibreOffice can and cannot export.
For PDFs, we could use something like PyPDF2 in combination with the requests library.
Note: Since we need to use something in addition to unoconv, we might think about if we should use something other than unoconv for the
doc
anddocx
conversions (unoconv is a heavy dependency). The original scope mentions Excel, however - and unoconv could be good for those files....