diff --git a/NEWS b/NEWS index 388c3875c..210f56af5 100644 --- a/NEWS +++ b/NEWS @@ -3,6 +3,12 @@ This file lists noteworthy changes between releases, for full list of changes, see git log and then `ChangeLog.old`. +## Significant changes in 20190511 + +* Universal dependencies version 2.4 is a reference for recall tests +* 2879 new words +* No other big changes and no API changes + ## Significant changes in 20180111 * Universal dependencies version 2.3 is a reference for recall tests diff --git a/NEWS.markdown b/NEWS.markdown deleted file mode 100644 index 058e2c1e2..000000000 --- a/NEWS.markdown +++ /dev/null @@ -1,244 +0,0 @@ -# NEWS - -This file lists noteworthy changes between releases, for full list of changes, -see git log and then `ChangeLog.old`. - -## Significant changes in 20180511 - -* Universal dependencies version 2.2 is now used as target -* At least 226 new words: 239 additions and 13 deletions in lexeme database -* Most changes are in development infra, so not visible to end users... -* Started rewriting CG from the scratch -* The APIs for programming language deprecate load(filename) and load(dir) forms - of filename guessing functions in favour of forthcoming loadAnalyser(file), - loadLemmatiser(file), loadUDPipe(file) etc. etc. functions -* Working towards more general tokenise-analyse-disambiguate pipelines maybe, - or just refactoring -* lots more automated tests -> lots less human errors -* By popular request: there are two analysers now, one with small dictionary - and one with full, use the smaller one when you do not want to see birds or - languages or tribes analysed. The smaller one replaces the old default, but - the new tools will require you to select one explicitly anyways -* fixes and workarounds: java and c++ can now be disabled partially or totally -* adopted SG0 as possible verb form analysis from UD data -* The end users are now provided with bash-scripts wrappers for all - functionalities, whereas the typically python versions allow more control - of parametres - - -## Significant changes in 20170515 - -* Universal Dependencies version 2 is now used, still mainly lemma, UPOS, - features fields are analysed -* At least 2,336 new words (based on diffstat: 38886 additions, 3655 deletions) -* Preliminary support for various guessing models: python-based, finite-state - and UDPipe. This means that it is possibly to get analyses for all tokens, - albeit quality of guesses varies. -* A minimal C++ library version has been made to match java and python bindings. - C++-11 and libhfst are required. -* The dix version can now be compiled with lttoolbox with a lot of memory -* A restricted "gold" dictionary mode has been added. This is good for both end - users with limited memory and end users who require higher quality lexemes - (i.e., only research institute approved, no wiktionary words or other weird - stuffs) -* Documentations and automatic testing much reworked with the new modern toys - from github: travis-ci, jekyll -* Started weeding the ADP/ADV jungle... -* Fixed a horrible bug in the corpus coverage testing that terribly - under-estimated our coverage for corpora where hapax legomena etc. were - ignored -* Lot of documentation has been semi-automated, therefore many changes can be - viewed at the new gh-pages site: https://flammie.github.io/omorfi/ - -## Significant changes in 20161115 - -* Started drafting more blacklists and *known good* lexemes subsets for people - who struggle with rare words and productive compounding, derivation -* Updated to Universal Dependencies version 1.4 -* A lot of new derivations by the way -* Preliminary guessers -* More loopy guessery things for punctuation and digit combos -* Minor fixes to UD feature sorting -* Homonym numbers used in some applications -* Added timeouts where downstream tools support them, so tools don't seem like - they are freezing at random -* moved old documentations to github-pages -* added preliminary hfst-pmatch-based tokeniser - -## Significant changes in 20160515 - -* Universal Dependencies for Finnish is the new standard format we now follow: - * POS is now UPOS and classes were changed accordingly (new classes: AUX, - PROPN, DET, CONJ, SCONJ, PUNCT, SYM, and VERB, NOUN, ADP, ADV as before) - * other features mostly match the feature field in UD documentation - * release cycle aims to be same six month cycle as with UD - * the automatic tests verify compatibility with UD; 92 % of lemmas, primary - POS tags and morphological features are the same as Finnish UD corpus, - 75 % same as Finnish FTB UD corpus - * analyser for reading and writing CONLL-U format -* tokenisation as script and more hacks to token stripping in corner cases -* continuous integration with travis-ci, currently only testing basic script - programming conventions -* added a lot of high coverage words and forms by hand -* by popular request, some of the words can now be blacklisted, when you don't - want that guy named Mutta to ambiguate your conjunction analyses or the odd - new guinean bird to clash with some common verb -* the "database" is now only keyed on lemma + homonym number; paradigm is extra - information like anything else -* a lot of work on morphological segmentation towards statistical machine - translation; check proceedings of WMT shared tasks 2015 and 2016 to see why -* started refactoring some python code into classes - -## Significant changes in 20150904 - -* allomorphy can be tagged again to distinguish e.g. *-iden* and *-itten* when - generating -* FinnTreeBank-1 format provided by Miikka Silfverberg is available but not - built by default since it lacks a test set -* lexicalised inflections can have separate tag, e.g. *kännissä* can be lexical - inessive distinguished from regular inessive -* preliminary VISL CG-3 support, with original grammar by Fred Karlsson; - convenience bash scripts available for disambiguated parsing -* preliminary support for conllu and conllx analysis formats -* paradigm categorisation is now verified by regular expressions -* lots of paradigm fixes and some added words - -## Significant changes in 20150326 - -* speed is up to >20,000 tokens per second from ~500 -* coverages are up to: - europarl (99 %) gutenberg (97 %), JRC Acquis (94 %) and fiwiki (93 %) -* moses factored model format supported -* segmentation supported -* Java API -* Python hacks packaged to API and module -* Rest of hand-written Xerox legacy data removed; all is script-generated -* github migration since google code is EOL'd -* file naming for automata changed to include omorfi prefix for all file - names in case they are distributed separately. - -## Significant changes in 20141014 - -* The regressions are also set on coverage over popular corpora: - Europarl (98 %), FTB 3.1 (97 %), gutenberg (96 %), JRC Acquis (93 %) and - fiwiki (90 %) -* sti derivation tentatively added -* number of new paradigms and paradigm moves, esp. in old and archaic styles -* some new words manually added -* apertium formats updated totally -* interjection chaining -* rest of hand-written lexc removed: everything in db and python code now -* more strict building and testing altogether (no more dangling references or - missing tags allowed) -* morphological segmentation should be usable now -* lots of other classifications and attributes added - -## Significant changes in 20130829 - -* Default tag format is now FTB3.1. Recall is 90 % and the format is stable and - easy to read by humans, which is now the main target for computational - morphologies. -* The omor tagsets are now permanently unstable and subject to change any day. - To use them, python scripts have been provided. -* Lots of proper nouns and semantics from Uni Hel projects -* speller build support for new voikko versions -* New regression tests for stuffs -* Most of legacy lexc sources removed; they are now generated from TSV - "databases". -* The morphological classes now follow 3 main classes with some subclasses that - are less morphological -* Twol rules and flag diacritics have been eliminated -* Lots of support scripts to verify and extend classifications -* Lots of new word-forms, inflections and changes to derivations -* Some python support scripts for omor formats - -## Significant changes in 20121226 - -* Added fi.wiktionary.org as lexical source (much thanks to students of my unix - tools course for scripting) -* Added first batch of new proper nouns from a project in Univ. Helsinki -* Lexc data is now rebuild from lexical sources as standard processing; - * requiring python3 -* Minor bug fixes to man pages, special boundaries (e.g. in arkki_tehti) - -## Significant changes in 20120401 - -* Fixed some twol rules w.r.t. new features that blocked compiling -* Autogenerate lexicons from csv data all the time -* Moved to git and googlecode -> chopped most of the documentation and such -* Fixed scripts a bit, added man pages -* Made very crude tests to have at least something back in. - -## Significant changes in 20110505 - -* whole new finntreebank tagset for forthcoming finntreebank work -* uppercasing is noted in the analysis level -* the word boundaries of lexicalised compounds may be available for more cases - (depending on the tagset) -* whole new lemmatizer tagset is available -* some dozens of new words added and fixed -* combine corpus analysis script with apertium's preprocessors -* causative derivation chain added -* bbreviations, adpositions, prefixes and suffixes are no longer pos but subcat - analyses - - -## Significant changes since 20100401 - -* Include deverbal nouns in compounding system - -* Start marking compound and strong morpheme boundaries - -* New lexical data handling systems - -* Implement generator from analyser - -* Subcategorize lots of classes for CG and apertium - -* Write documentation in booklet format - -* New URI and digit string guessers - -* New tagging style colorterm for interactive use - -* Include weighting scheme in default build - -* Demote SUFFIX from POS reading to SUBCAT - -## Significant changes since 20100111 - -* Added marginal enclitics kA, kAs - -* Added LEMMA= structure - -* re-organized source code to modules - -* Added tagging schemes, weighting schemes and suggestion algorithms - -## Significant changes since 0.5 - -* completely new morphology built on traditional lexc-twolc model - -* easier route to add new lexical data via simple CSV format - -* lots of new lexical data from Joukahainen project as well as extended - from kotus-sanalista semi-automatically and *by hand*. - -* titlecasing filter for regular words - -* š filter for old orthography variants - -* compounding much less haphazard concoction - -* parts of speech classified and included - -* pronouns, interjections, numerals, proper nouns - -* much closer to real full fledged morphology - -* movement from SFST to HFST toolset with lots of new cool toys (SFST support - is retained in HFST) - -* towards full-scale automatic test suite - - diff --git a/docs/NEWS.markdown b/docs/NEWS.markdown new file mode 120000 index 000000000..0fae0f802 --- /dev/null +++ b/docs/NEWS.markdown @@ -0,0 +1 @@ +../NEWS \ No newline at end of file diff --git a/lexemes.markdown b/docs/lexemes.markdown similarity index 100% rename from lexemes.markdown rename to docs/lexemes.markdown diff --git a/stuff.markdown b/docs/stuff.markdown similarity index 100% rename from stuff.markdown rename to docs/stuff.markdown