Projects

✓ Fix multithreaded Joshua (sentences should be placed in a queue that threads pop and deposit somewhere; deposits would then be assembled sequentially) --- currently on the fix_threads branch
✓ Clean up the input handling routines (HackishSegmentParser, SAXSegmentParser, PlainSegmentParser)
✓ Configuration parameters should be overridable from the command line. This is especially true of runtime related parameters such as the number of threads.
- ✓ Rudimentary support has been added for -threads...
- ✓ ...but it should be rewritten in a more general fashion: (1) load the configuration file, then (2) process command line arguments and let anything be overridden.
Fix KenLM integration
- KenLM typically scores between 0.5 and 1.0 BLEU points less than SRILM using the same model
- fix the vocabulary mapping
- use a proper UNK
- get rid of SRILM
Pruning: we should be able to prune simply by specifying a pop-limit on cube growing
OOVs: better handling for handling OOV words.
- Allow users to specify the behavior for OOVs (pass through as-is, delete, transliterate)
- Integrate a transliteration module into the code
- Include instructions on how to give OOVs a good non-terminal
Integrate Thrax into the Joshua codebase
- If possible, re-use portions of the code that should be shared, like a Rule class.
Implement PRO for parameter tuning
- Make it modular so that any evaluation metric can be used similar to our Z-MERT implementation
Data distributed with code
- Look over the example folders and see if any of them are worth keeping
- Include some good sample data with the distribution that people can use to run the system on initially
Change LICENSE to BSD
- Fix across all of the files
General code clean-up
- Delete packages that are no longer used
  - all of the suffix-array-based grammar extraction code
  - aligner
  - prefix_tree
  - bloomfilter_lm
  - distributed_lm
  - buildin_lm
- Fold together redundant code
  - Should the lattice package be at joshua.lattice or somewhere else?
  - Where should oracle be located? Not the top-level, presumably.
- Better high-level organization of code
  - Rename the packages to have functional names? Decode, Tune, Preprocess?
Subsampling
- Experiment with the subsampler to make sure it doesn't change translation performance too much
- Reuse the joshua.corpus classes instead of having redundant ones for the subsampler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Projects

Projects

Clone this wiki locally