Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting parserator to handle an entire document #16

Closed
AbeHandler opened this issue Mar 24, 2015 · 25 comments
Closed

Adapting parserator to handle an entire document #16

AbeHandler opened this issue Mar 24, 2015 · 25 comments

Comments

@AbeHandler
Copy link

I am currently using Parserator to parse short strings, like this:

s of 1110 of an hour. The maximum amount to be paid under this contract is $20,000.00. No amount of work is guaranteed under this agreement; payments wil

and this

General Liability insurance will be purchased and maintained with limits of $1,000,000 per occurrence an

I extract these strings using a loose regular expression ".{75}$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).

@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as and the interesting ones would be tagged with their proper values.

I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?

@fgregg
Copy link

fgregg commented Mar 25, 2015

Very interested. I think the main difficulty is the interface. We probably don't have the bandwidth to build the interface, but can definitely advise. @derekeder, @cathydeng, @evz any thoughts on this one. flask app?

@AbeHandler
Copy link
Author

I know Flask and could take on building a GUI. A GUI would also help in cases where you are tagging only 10 or 15 tokens. I found the command line tagging fairly cumbersome. I also need the GUI anyway to complete this contracts project.

@fgregg
Copy link

fgregg commented Mar 25, 2015

Sounds good. Hopefully, the current command line interface is decoupled from the rest of parserator. If it isn't we will make it so.

@AbeHandler
Copy link
Author

It seems like it might be worth it to use a front-end framework for the GUI. I think the GUI will start getting somewhat complicated and we will end up wanting things like models. Does datamade have allegiances to a particular framework? Do you all have thoughts about if that's necessary?

@fgregg
Copy link

fgregg commented Mar 25, 2015

We like flask.

@AbeHandler
Copy link
Author

Sounds good. And no allegiances on a front-end javascript framework? Backbone? Ember? Angular?

@derekeder
Copy link
Member

@AbeHandler we've done backbone in the past, but I kind of hate fighting it and all the complexity it introduces. We've been going 'no framework' (well, jQuery) for our recent projects, but go with what you're comfortable with.

@AbeHandler
Copy link
Author

Hi everyone,

After a bit of thought, I realized that I should just modify the DocumentCloud viewer to tag a document -- rather than building a whole new UI.

I'm working on that now: https://github.com/AbeHandler/DocumentParserator

Let me know what you all think. I would like to put in a pull request down the line once the code is cleaned up and working properly.

@fgregg
Copy link

fgregg commented Apr 2, 2015

Very cool! @knowtheory, you should peep this.

On Wed, Apr 1, 2015 at 7:56 PM Abe Handler [email protected] wrote:

Hi everyone,

After a bit of thought, I realized that I should just modify the
DocumentCloud viewer to tag a document -- rather than building a whole new
UI.

I'm working on that now: https://github.com/AbeHandler/DocumentParserator

Let me know what you all think. I would like to put in a pull request down
the line once the code is cleaned up an working properly.


Reply to this email directly or view it on GitHub
#16 (comment).

@AbeHandler
Copy link
Author

One issue that is coming up: parserator is going quite slowly (maybe ~6 min per contract) when parsing an entire document. I can work around this -- but others might not be so patient. For now, I'm using the parallel command and working on other stuff while I wait. I can provide clearer benchmarks down the line. But wanted to make a note of this on this thread.

@cathydeng
Copy link

@AbeHandler hmm...my initial hunch is that it's taking a long time b/c a document gets broken out into way more tokens than the typical name or address. what happens if you run your contract parser on a string w/, say, 10 tokens? or on an input approximately half the length of your typical contract that takes ~6 mins?

@AbeHandler
Copy link
Author

@cathydeng Yah. It runs fine on short strings. It would make sense that it is taking so long because there are so many tokens. But I can't really think of a way to cut down on the tokens for any given contract. Right?

@cathydeng
Copy link

what about configuring the tokenizer to be smarter, so that tokens are less granular than a single word where applicable?

I'm not sure what your data looks like, but if there are certain sequences of words that you know belong together under the same label, it could def cut down on the number of tokens to evaluate. for example, if you have, say, the string 'one (two three four five six) seven', instead of splitting it into 7 tokens, you could configure the tokenizer to spit it into 3: 'one', (two three four five six), seven. thoughts?

@AbeHandler
Copy link
Author

That is a good idea -- particularly since there is a lot of boiler plate language in these contracts. It might be possible to take entire paragraphs as individual tokens.

@samzhang111
Copy link

eavesdropping on this thread because this project is really cool. how many tags do you have btw? In general, CRFs are linear in tokens and quadratic in tags, right?

@AbeHandler
Copy link
Author

@samzhang111 I have about 10 tags. "In general, CRFs are linear in tokens and quadratic in tags, right?" Not sure. Can you say a bit more about what you mean. I'd be interested to hear more.

@cathydeng that is great idea about tokens. Any ideas about can I integrate that boiler plate recognition into the tokenizer without writing a huge regex? Some of the boiler plate is 4 or 5 lines long. This has an example of the boilerplate (page 20) https://www.documentcloud.org/documents/326455-12-14-11-eustis-engineering-services-st-bernard

@samzhang111
Copy link

Hi @AbeHandler, the NYT released an article today describing how they did something similar to extract structured data out of recipes. Maybe @cathydeng can clarify how similar parserator's engine is? Ctrl-f quadratic: http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/?_r=0

@knowtheory
Copy link

Okay, general comments:

@AbeHandler (i know we've already talked, but for the sake of having this in a public place) This is exactly the sort of thing I want to have w/ DocumentCloud. The way you're approaching this from a UX POV makes me really want to better unify document presentation & text presentation w/ the way DC currently displays docs.

If anyone ever wants to talk about Backbone practices, happy to oblige (there are things i don't like about it too, but i can at least give pointers).

Re @samzhang111 & @cathydeng's comments @AbeHandler: The longer the input string you're training on and trying to recognize the larger the computational space that your CRF needs to account for. CRFs in so far as i understand their inner workings are essentially about aligning a list of possible tags on top of a list of input tokens. So the longer the list of input tokens... the more possible alignments which have to be accounted for. Ditto, for if there are a lot of possible tag arrangements (and combining lots of possible tag alignments with a lot of tokens could make things slow). (One of the reasons i'm so impressed with DataMade is that they've found well defined problems to which CRFs could be applied, e.g. cleaning up names and chunking addresses into expected components.)

For the contracts... the question is whether you can break a full document or page up into sentence level chunks which'll be easier to classify and align tags on top of. If there's a lot of boiler plate, and you can reliably and predictably detect and throw that junk out... just do that before you train or try to recognize the meaningful body content of each document.

Final note: Ruby4evaaaaaaaaa! (more seriously i'm thinking about whether data trained through parserator produces data files that i could then use via CRFSuite's ruby bindings too, so that i can integrate stuff like this into DocumentCloud and we could share trained data)

@AbeHandler
Copy link
Author

Thanks for the interest everyone. I ran Parserator overnight and checked the labels in the morning. Things are looking good!

proof_of_concept

@fgregg
Copy link

fgregg commented Apr 10, 2015

That's awesome!

On Fri, Apr 10, 2015 at 10:00 AM Abe Handler [email protected]
wrote:

Thanks for the interest everyone. I ran Parserator overnight and checked
the labels in the morning. Things are looking good!

[image: proof_of_concept]
https://cloud.githubusercontent.com/assets/1252925/7090327/df03bfee-df67-11e4-8caf-8d71825c3143.png


Reply to this email directly or view it on GitHub
#16 (comment).

@fgregg
Copy link

fgregg commented Apr 11, 2015

@AbeHandler think about paragraphs, as a whole, as tokens. (assuming you can reliably split into paragraphs). Then paragraphs can have features like

  • includes number
  • includes the word estimate
  • etc

Make sense?

@AbeHandler
Copy link
Author

Hi everyone, this project is pretty far along and (if you all are interested) I'd like to put in a pull request. Would you be open to directory /documentparserator that contains the webapp allowing people to run parserator on a whole document cloud document? Any advice on how to proceed? https://github.com/AbeHandler/DocumentParserator

@fgregg
Copy link

fgregg commented Apr 27, 2015

Hmm.. I think this would probably be best as a separate repo that depended
upon parserator.

On Mon, Apr 27, 2015 at 1:13 PM Abe Handler [email protected]
wrote:

Hi everyone, this project is pretty far along and (if you all are
interested) I'd like to put in a pull request. Would you be open to
directory /documentparserator that contains the webapp allowing people to
run parserator on a whole document cloud document? Any advice on how to
proceed? https://github.com/AbeHandler/DocumentParserator


Reply to this email directly or view it on GitHub
#16 (comment).

@knowtheory
Copy link

❤️

In an ideal case we'd integrate this into DocumentCloud. We're not there yet... but that seems like the natural place for a thing like this to live (or outside the app as its own app).

We've got a processing cluster laying around so... like it should be feasible to do this kind of thing.

@AbeHandler
Copy link
Author

@knowtheory it would be awesome if DocumentCloud supported this. Maybe a first step towards that would be the ability to tokenize a document? You could pass a regex and a doc cloud id and get back the tokens? Tokenization will be a necessary step for lots of kinds of analysis, so maybe start there?

@fgregg fgregg closed this as completed Feb 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants