-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapting parserator to handle an entire document #16
Comments
Very interested. I think the main difficulty is the interface. We probably don't have the bandwidth to build the interface, but can definitely advise. @derekeder, @cathydeng, @evz any thoughts on this one. flask app? |
I know Flask and could take on building a GUI. A GUI would also help in cases where you are tagging only 10 or 15 tokens. I found the command line tagging fairly cumbersome. I also need the GUI anyway to complete this contracts project. |
Sounds good. Hopefully, the current command line interface is decoupled from the rest of parserator. If it isn't we will make it so. |
It seems like it might be worth it to use a front-end framework for the GUI. I think the GUI will start getting somewhat complicated and we will end up wanting things like models. Does datamade have allegiances to a particular framework? Do you all have thoughts about if that's necessary? |
We like flask. |
Sounds good. And no allegiances on a front-end javascript framework? Backbone? Ember? Angular? |
@AbeHandler we've done backbone in the past, but I kind of hate fighting it and all the complexity it introduces. We've been going 'no framework' (well, jQuery) for our recent projects, but go with what you're comfortable with. |
Hi everyone, After a bit of thought, I realized that I should just modify the DocumentCloud viewer to tag a document -- rather than building a whole new UI. I'm working on that now: https://github.com/AbeHandler/DocumentParserator Let me know what you all think. I would like to put in a pull request down the line once the code is cleaned up and working properly. |
Very cool! @knowtheory, you should peep this. On Wed, Apr 1, 2015 at 7:56 PM Abe Handler [email protected] wrote:
|
One issue that is coming up: parserator is going quite slowly (maybe ~6 min per contract) when parsing an entire document. I can work around this -- but others might not be so patient. For now, I'm using the parallel command and working on other stuff while I wait. I can provide clearer benchmarks down the line. But wanted to make a note of this on this thread. |
@AbeHandler hmm...my initial hunch is that it's taking a long time b/c a document gets broken out into way more tokens than the typical name or address. what happens if you run your contract parser on a string w/, say, 10 tokens? or on an input approximately half the length of your typical contract that takes ~6 mins? |
@cathydeng Yah. It runs fine on short strings. It would make sense that it is taking so long because there are so many tokens. But I can't really think of a way to cut down on the tokens for any given contract. Right? |
what about configuring the tokenizer to be smarter, so that tokens are less granular than a single word where applicable? I'm not sure what your data looks like, but if there are certain sequences of words that you know belong together under the same label, it could def cut down on the number of tokens to evaluate. for example, if you have, say, the string |
That is a good idea -- particularly since there is a lot of boiler plate language in these contracts. It might be possible to take entire paragraphs as individual tokens. |
eavesdropping on this thread because this project is really cool. how many tags do you have btw? In general, CRFs are linear in tokens and quadratic in tags, right? |
@samzhang111 I have about 10 tags. "In general, CRFs are linear in tokens and quadratic in tags, right?" Not sure. Can you say a bit more about what you mean. I'd be interested to hear more. @cathydeng that is great idea about tokens. Any ideas about can I integrate that boiler plate recognition into the tokenizer without writing a huge regex? Some of the boiler plate is 4 or 5 lines long. This has an example of the boilerplate (page 20) https://www.documentcloud.org/documents/326455-12-14-11-eustis-engineering-services-st-bernard |
Hi @AbeHandler, the NYT released an article today describing how they did something similar to extract structured data out of recipes. Maybe @cathydeng can clarify how similar parserator's engine is? Ctrl-f quadratic: http://open.blogs.nytimes.com/2015/04/09/extracting-structured-data-from-recipes-using-conditional-random-fields/?_r=0 |
Okay, general comments: @AbeHandler (i know we've already talked, but for the sake of having this in a public place) This is exactly the sort of thing I want to have w/ DocumentCloud. The way you're approaching this from a UX POV makes me really want to better unify document presentation & text presentation w/ the way DC currently displays docs. If anyone ever wants to talk about Backbone practices, happy to oblige (there are things i don't like about it too, but i can at least give pointers). Re @samzhang111 & @cathydeng's comments @AbeHandler: The longer the input string you're training on and trying to recognize the larger the computational space that your CRF needs to account for. CRFs in so far as i understand their inner workings are essentially about aligning a list of possible tags on top of a list of input tokens. So the longer the list of input tokens... the more possible alignments which have to be accounted for. Ditto, for if there are a lot of possible tag arrangements (and combining lots of possible tag alignments with a lot of tokens could make things slow). (One of the reasons i'm so impressed with DataMade is that they've found well defined problems to which CRFs could be applied, e.g. cleaning up names and chunking addresses into expected components.) For the contracts... the question is whether you can break a full document or page up into sentence level chunks which'll be easier to classify and align tags on top of. If there's a lot of boiler plate, and you can reliably and predictably detect and throw that junk out... just do that before you train or try to recognize the meaningful body content of each document. Final note: Ruby4evaaaaaaaaa! (more seriously i'm thinking about whether data trained through parserator produces data files that i could then use via CRFSuite's ruby bindings too, so that i can integrate stuff like this into DocumentCloud and we could share trained data) |
That's awesome! On Fri, Apr 10, 2015 at 10:00 AM Abe Handler [email protected]
|
@AbeHandler think about paragraphs, as a whole, as tokens. (assuming you can reliably split into paragraphs). Then paragraphs can have features like
Make sense? |
Hi everyone, this project is pretty far along and (if you all are interested) I'd like to put in a pull request. Would you be open to directory /documentparserator that contains the webapp allowing people to run parserator on a whole document cloud document? Any advice on how to proceed? https://github.com/AbeHandler/DocumentParserator |
Hmm.. I think this would probably be best as a separate repo that depended On Mon, Apr 27, 2015 at 1:13 PM Abe Handler [email protected]
|
❤️ In an ideal case we'd integrate this into DocumentCloud. We're not there yet... but that seems like the natural place for a thing like this to live (or outside the app as its own app). We've got a processing cluster laying around so... like it should be feasible to do this kind of thing. |
@knowtheory it would be awesome if DocumentCloud supported this. Maybe a first step towards that would be the ability to tokenize a document? You could pass a regex and a doc cloud id and get back the tokens? Tokenization will be a necessary step for lots of kinds of analysis, so maybe start there? |
I am currently using Parserator to parse short strings, like this:
and this
I extract these strings using a loose regular expression ".{75}$[0-9]+.{75}" on documents that are usually 5 to 10 pages long. I'm most interested in tagging and categorizing the dollar values. Often, the 100 characters around the string is enough to categorize the dollar value. But in some cases I need input from other parts of the document to do the tagging (ex. earlier in the document it might mention that the document is a lease).
@fgregg has pointed me here to show how you could do this with crfsuite http://nbviewer.ipython.org/github/tpeng/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb but I am wondering if it might be possible with the parserator wrapper. All uninteresting tokens would be tagged as and the interesting ones would be tagged with their proper values.
I wanted to see what you all thought about (1) using parserator in this way (2) adapting parserator to cover such cases. The biggest hold up to using parserator in this way is tagging documents with hundreds and hundreds of tokens. It seems like you would want a small document annotation GUI to generate the XML to train parserator. Do you think that such a GUI should be part of the library? Do you think this would work? Would you be open to a pull request?
The text was updated successfully, but these errors were encountered: