Adapting parserator to handle an entire document #16

AbeHandler · 2015-03-24T23:35:14Z

I am currently using Parserator to parse short strings, like this:

s of 1110 of an hour. The maximum amount to be paid under this contract is $20,000.00. No amount of work is guaranteed under this agreement; payments wil

and this

General Liability insurance will be purchased and maintained with limits of $1,000,000 per occurrence an

I extract these strings using a loose regular expression ".{75}$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).

@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as and the interesting ones would be tagged with their proper values.

I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?

fgregg · 2015-03-25T00:23:53Z

Very interested. I think the main difficulty is the interface. We probably don't have the bandwidth to build the interface, but can definitely advise. @derekeder, @cathydeng, @evz any thoughts on this one. flask app?

AbeHandler · 2015-03-25T00:46:02Z

I know Flask and could take on building a GUI. A GUI would also help in cases where you are tagging only 10 or 15 tokens. I found the command line tagging fairly cumbersome. I also need the GUI anyway to complete this contracts project.

fgregg · 2015-03-25T13:08:21Z

Sounds good. Hopefully, the current command line interface is decoupled from the rest of parserator. If it isn't we will make it so.

AbeHandler · 2015-03-25T14:12:09Z

It seems like it might be worth it to use a front-end framework for the GUI. I think the GUI will start getting somewhat complicated and we will end up wanting things like models. Does datamade have allegiances to a particular framework? Do you all have thoughts about if that's necessary?

fgregg · 2015-03-25T15:17:55Z

We like flask.

AbeHandler · 2015-03-25T15:45:36Z

Sounds good. And no allegiances on a front-end javascript framework? Backbone? Ember? Angular?

derekeder · 2015-03-25T15:50:54Z

@AbeHandler we've done backbone in the past, but I kind of hate fighting it and all the complexity it introduces. We've been going 'no framework' (well, jQuery) for our recent projects, but go with what you're comfortable with.

AbeHandler · 2015-04-02T00:56:03Z

Hi everyone,

After a bit of thought, I realized that I should just modify the DocumentCloud viewer to tag a document -- rather than building a whole new UI.

I'm working on that now: https://github.com/AbeHandler/DocumentParserator

Let me know what you all think. I would like to put in a pull request down the line once the code is cleaned up and working properly.

fgregg · 2015-04-02T01:03:45Z

Very cool! @knowtheory, you should peep this.

On Wed, Apr 1, 2015 at 7:56 PM Abe Handler [email protected] wrote:

Hi everyone,

After a bit of thought, I realized that I should just modify the
DocumentCloud viewer to tag a document -- rather than building a whole new
UI.

I'm working on that now: https://github.com/AbeHandler/DocumentParserator

Let me know what you all think. I would like to put in a pull request down
the line once the code is cleaned up an working properly.

—
Reply to this email directly or view it on GitHub
#16 (comment).

AbeHandler · 2015-04-09T19:01:44Z

One issue that is coming up: parserator is going quite slowly (maybe ~6 min per contract) when parsing an entire document. I can work around this -- but others might not be so patient. For now, I'm using the parallel command and working on other stuff while I wait. I can provide clearer benchmarks down the line. But wanted to make a note of this on this thread.

cathydeng · 2015-04-09T19:22:28Z

@AbeHandler hmm...my initial hunch is that it's taking a long time b/c a document gets broken out into way more tokens than the typical name or address. what happens if you run your contract parser on a string w/, say, 10 tokens? or on an input approximately half the length of your typical contract that takes ~6 mins?

AbeHandler · 2015-04-09T19:30:58Z

@cathydeng Yah. It runs fine on short strings. It would make sense that it is taking so long because there are so many tokens. But I can't really think of a way to cut down on the tokens for any given contract. Right?

cathydeng · 2015-04-09T20:57:55Z

what about configuring the tokenizer to be smarter, so that tokens are less granular than a single word where applicable?

I'm not sure what your data looks like, but if there are certain sequences of words that you know belong together under the same label, it could def cut down on the number of tokens to evaluate. for example, if you have, say, the string 'one (two three four five six) seven', instead of splitting it into 7 tokens, you could configure the tokenizer to spit it into 3: 'one', (two three four five six), seven. thoughts?

AbeHandler · 2015-04-09T21:01:29Z

That is a good idea -- particularly since there is a lot of boiler plate language in these contracts. It might be possible to take entire paragraphs as individual tokens.

samzhang111 · 2015-04-09T21:09:01Z

eavesdropping on this thread because this project is really cool. how many tags do you have btw? In general, CRFs are linear in tokens and quadratic in tags, right?

AbeHandler · 2015-04-09T21:47:07Z

@samzhang111 I have about 10 tags. "In general, CRFs are linear in tokens and quadratic in tags, right?" Not sure. Can you say a bit more about what you mean. I'd be interested to hear more.

@cathydeng that is great idea about tokens. Any ideas about can I integrate that boiler plate recognition into the tokenizer without writing a huge regex? Some of the boiler plate is 4 or 5 lines long. This has an example of the boilerplate (page 20) https://www.documentcloud.org/documents/326455-12-14-11-eustis-engineering-services-st-bernard

samzhang111 · 2015-04-09T22:50:41Z

Hi @AbeHandler, the NYT released an article today describing how they did something similar to extract structured data out of recipes. Maybe @cathydeng can clarify how similar parserator's engine is? Ctrl-f quadratic: http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/?_r=0

knowtheory · 2015-04-10T13:02:12Z

Okay, general comments:

@AbeHandler (i know we've already talked, but for the sake of having this in a public place) This is exactly the sort of thing I want to have w/ DocumentCloud. The way you're approaching this from a UX POV makes me really want to better unify document presentation & text presentation w/ the way DC currently displays docs.

If anyone ever wants to talk about Backbone practices, happy to oblige (there are things i don't like about it too, but i can at least give pointers).

Re @samzhang111 & @cathydeng's comments @AbeHandler: The longer the input string you're training on and trying to recognize the larger the computational space that your CRF needs to account for. CRFs in so far as i understand their inner workings are essentially about aligning a list of possible tags on top of a list of input tokens. So the longer the list of input tokens... the more possible alignments which have to be accounted for. Ditto, for if there are a lot of possible tag arrangements (and combining lots of possible tag alignments with a lot of tokens could make things slow). (One of the reasons i'm so impressed with DataMade is that they've found well defined problems to which CRFs could be applied, e.g. cleaning up names and chunking addresses into expected components.)

For the contracts... the question is whether you can break a full document or page up into sentence level chunks which'll be easier to classify and align tags on top of. If there's a lot of boiler plate, and you can reliably and predictably detect and throw that junk out... just do that before you train or try to recognize the meaningful body content of each document.

Final note: Ruby4evaaaaaaaaa! (more seriously i'm thinking about whether data trained through parserator produces data files that i could then use via CRFSuite's ruby bindings too, so that i can integrate stuff like this into DocumentCloud and we could share trained data)

AbeHandler · 2015-04-10T15:00:43Z

Thanks for the interest everyone. I ran Parserator overnight and checked the labels in the morning. Things are looking good!

fgregg · 2015-04-10T16:32:55Z

That's awesome!

On Fri, Apr 10, 2015 at 10:00 AM Abe Handler [email protected]
wrote:

Thanks for the interest everyone. I ran Parserator overnight and checked
the labels in the morning. Things are looking good!

[image: proof_of_concept]
https://cloud.githubusercontent.com/assets/1252925/7090327/df03bfee-df67-11e4-8caf-8d71825c3143.png

—
Reply to this email directly or view it on GitHub
#16 (comment).

fgregg · 2015-04-11T00:08:06Z

@AbeHandler think about paragraphs, as a whole, as tokens. (assuming you can reliably split into paragraphs). Then paragraphs can have features like

includes number
includes the word estimate
etc

Make sense?

AbeHandler · 2015-04-27T18:13:45Z

Hi everyone, this project is pretty far along and (if you all are interested) I'd like to put in a pull request. Would you be open to directory /documentparserator that contains the webapp allowing people to run parserator on a whole document cloud document? Any advice on how to proceed? https://github.com/AbeHandler/DocumentParserator

fgregg · 2015-04-27T21:34:59Z

Hmm.. I think this would probably be best as a separate repo that depended
upon parserator.

On Mon, Apr 27, 2015 at 1:13 PM Abe Handler [email protected]
wrote:

Hi everyone, this project is pretty far along and (if you all are
interested) I'd like to put in a pull request. Would you be open to
directory /documentparserator that contains the webapp allowing people to
run parserator on a whole document cloud document? Any advice on how to
proceed? https://github.com/AbeHandler/DocumentParserator

—
Reply to this email directly or view it on GitHub
#16 (comment).

knowtheory · 2015-04-28T00:00:57Z

❤️

In an ideal case we'd integrate this into DocumentCloud. We're not there yet... but that seems like the natural place for a thing like this to live (or outside the app as its own app).

We've got a processing cluster laying around so... like it should be feasible to do this kind of thing.

AbeHandler · 2015-04-28T01:49:23Z

@knowtheory it would be awesome if DocumentCloud supported this. Maybe a first step towards that would be the ability to tokenize a document? You could pass a regex and a doc cloud id and get back the tokens? Tokenization will be a necessary step for lots of kinds of analysis, so maybe start there?

AbeHandler mentioned this issue Apr 10, 2015

on getPageText, DocumentViewer returns undefined if text has not yet been presented in the viewer documentcloud/document-viewer#53

Open

fgregg closed this as completed Feb 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapting parserator to handle an entire document #16

Adapting parserator to handle an entire document #16

AbeHandler commented Mar 24, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

derekeder commented Mar 25, 2015

AbeHandler commented Apr 2, 2015

fgregg commented Apr 2, 2015

AbeHandler commented Apr 9, 2015

cathydeng commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

cathydeng commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

samzhang111 commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

samzhang111 commented Apr 9, 2015

knowtheory commented Apr 10, 2015

AbeHandler commented Apr 10, 2015

fgregg commented Apr 10, 2015

fgregg commented Apr 11, 2015

AbeHandler commented Apr 27, 2015

fgregg commented Apr 27, 2015

knowtheory commented Apr 28, 2015

AbeHandler commented Apr 28, 2015

Adapting parserator to handle an entire document #16

Adapting parserator to handle an entire document #16

Comments

AbeHandler commented Mar 24, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

fgregg commented Mar 25, 2015

AbeHandler commented Mar 25, 2015

derekeder commented Mar 25, 2015

AbeHandler commented Apr 2, 2015

fgregg commented Apr 2, 2015

AbeHandler commented Apr 9, 2015

cathydeng commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

cathydeng commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

samzhang111 commented Apr 9, 2015

AbeHandler commented Apr 9, 2015

samzhang111 commented Apr 9, 2015

knowtheory commented Apr 10, 2015

AbeHandler commented Apr 10, 2015

fgregg commented Apr 10, 2015

fgregg commented Apr 11, 2015

AbeHandler commented Apr 27, 2015

fgregg commented Apr 27, 2015

knowtheory commented Apr 28, 2015

AbeHandler commented Apr 28, 2015