-
Notifications
You must be signed in to change notification settings - Fork 56
Projects
-
clean up the chart code
- ✓ control pruning with a cell-level cube pruning pop limit instead of the various thresholds
- ✓ clean up code in CubePruneCombiner.java
- ✓ cell signatures on
CubePruneState
should be something quicker to compute than a string - more could be done on this; in particular, it hasn't been implemented for beam and threshold pruning
- ✓ cell signatures on
- Hash cells instead of creating a complete 2d grid in Chart.java
-
clean up joshua's logging output
-
simplify joshua invocation
-
OOVs: better handling for handling OOV words.
- Allow users to specify the behavior for OOVs (pass through as-is, delete, transliterate)
- Integrate a transliteration module into the code
- Include instructions on how to give OOVs a good non-terminal
-
Integrate Thrax into the Joshua codebase
- If possible, re-use portions of the code that should be shared, like a Rule class.
-
Implement PRO for parameter tuning
- Make it modular so that any evaluation metric can be used similar to our Z-MERT implementation
-
Data distributed with code
- Look over the example folders and see if any of them are worth keeping
- Include some good sample data with the distribution that people can use to run the system on initially
-
Change LICENSE to BSD
- Fix across all of the files
-
General code clean-up
- Delete packages that are no longer used
- all of the suffix-array-based grammar extraction code
- aligner
- prefix_tree
- bloomfilter_lm
- ✓ distributed_lm
- buildin_lm
- Fold together redundant code
- Should the lattice package be at joshua.lattice or somewhere else?
- Where should oracle be located? Not the top-level, presumably.
- Better high-level organization of code
- Rename the packages to have functional names? Decode, Tune, Preprocess?
- Delete packages that are no longer used
-
Subsampling
- Experiment with the subsampler to make sure it doesn't change translation performance too much
- Reuse the joshua.corpus classes instead of having redundant ones for the subsampler.
-
✓ Fix multithreaded Joshua (sentences should be placed in a queue that threads pop and deposit somewhere; deposits would then be assembled sequentially) --- currently on the fix_threads branch
-
✓ Clean up the input handling routines (HackishSegmentParser, SAXSegmentParser, PlainSegmentParser)
-
✓ Configuration parameters should be overridable from the command line. This is especially true of runtime related parameters such as the number of threads.
- ✓ Rudimentary support has been added for -threads...
- ✓ ...but it should be rewritten in a more general fashion: (1) load the configuration file, then (2) process command line arguments and let anything be overridden.
-
✓ Fix KenLM integration
- ✓ KenLM typically scores between 0.5 and 1.0 BLEU points less than SRILM using the same model
- ✓ fix the vocabulary mapping
- ✓ use a proper UNK
-
✓ get rid of SRILM