Skip to content
Michael Schmitz edited this page Nov 21, 2013 · 10 revisions

Goal

To provide an easy to use, well engineered NLP stack that will enable researchers to easily engage in in higher-level research.

Open Source (i.e. Apache 2).

  1. Benefits
  2. Collaboration from companies as well as institutions. In my experience, engineering collaboration with institutions is usually quite weak.
  3. Can be used on grants/programs with license limitations. For example, the IARPA grant at the UW required a very open license (GPL was not OK).
  4. Compatible with existing University projects (i.e. Stanford CoreNLP, ClearNLP). A research-only license would mean that Stanford could not incorperate our work into their software (a research-only license conflicts with the GPL).
  5. It limits monetization of components, but monetization may not be our goal. If monetization is our goal, then I believe a open source base system will help gain attention and allow us to market higher-level applications. Many organizations do this (provide their basic software for free). (GitHub, Typesafe, Travis).

Competition.

There are a growing number of NLP stacks.

  • OpenNLP. There is no intent for models to work out of the box. Tools are not threadsafe.
  • Stanford.
  • ClearNLP. Tools make large compromises to exhibit research contributions (i.e. 6 GB of memory for a 0.5% gain in F measure).
  • Breeze. Breeze is being divided. Chalk is the NLP portion. The number of covered tools is quite limited.
  • ClearTK. Not a "swiss army knife" but a UIMA solution.
  • Gate.
  • Factorie. Provides some basic NLP tools, but is mostly focused on providing a DSL for probabilistic modeling.

What would our NLP stack include?

  1. POS tagger for web text with open license model annotations (OpenNLP).
  2. Chunker (shallow parser) for web text with open license model annotations (OpenNLP).
  3. Taggers platform for quickly writing extractors.

Machine Learning / Distributed Computing

  1. GraphLab
  2. Spark
  3. Boom