-
Notifications
You must be signed in to change notification settings - Fork 44
/
CHANGES.txt
627 lines (389 loc) · 22 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
Morfologik, Change Log
======================
For an up-to-date CHANGES file see
https://github.com/morfologik/morfologik-stemming/blob/master/CHANGES
======================= morfologik-stemming 2.1.9 =======================
Other Changes
* PR #114: improve run-on suggestions for camel case words (Jaume Ortolà)
======================= morfologik-stemming 2.1.8 =======================
Other Changes
* GH-112: Add automatic module name to all JARs.
* Upgrade selected build dependencies.
======================= morfologik-stemming 2.1.7 =======================
Bug Fixes
* PR #103: fix distance value in the result of `Speller.findReplacementCandidates`
(Daniel Naber).
* GH-102: upgrade jcommander to newest version. (Dawid Weiss)
Other Changes
* PR #103: introduce `Speller.replaceRunOnWordCandidates()` which returns
`CandidateData` (Daniel Naber).
======================= morfologik-stemming 2.1.6 =======================
Other Changes
* PR #101: fix replaceRunOnWords() not working for words that are uppercase at
sentence start (Daniel Naber).
======================= morfologik-stemming 2.1.5 =======================
Bug Fixes
* PR #96: incorrect logic in runOnWords (Jaume Ortolà).
* PR #97: micro performance optimization (Daniel Naber).
Other Changes
* GH-95: Speller: findReplacementCandidates returns full CandidateData. This
commit also refactors the Speller to use a stateless returned array
list rather than reuse an internal field. Should not make a
practical difference. (Dawid Weiss)
======================= morfologik-stemming 2.1.4 =======================
Bug Fixes
* PR #93: Case-changed words are always good suggestions (Jaume Ortolà).
* GH-92: FSATraversal may return NOT_FOUND instead of AUTOMATON_HAS_PREFIX
(stevendolg via Dawid Weiss)
Other Changes
* Updated build and test plugins to newer versions.
======================= morfologik-stemming 2.1.3 =======================
Bug Fixes
* GH-86: Speller: words containing the dictionary separator are not handled
properly (Jaume Ortolà via Dawid Weiss).
======================= morfologik-stemming 2.1.2 =======================
Bug Fixes
* GH-85: Encoded sequences can clash with separator byte and cause assertion
errors. (Daniel Naber, Dawid Weiss).
======================= morfologik-stemming 2.1.1 =======================
Bug Fixes
* PR #78: Fix dependency issue in morfologik-speller (Alden Quimby).
* GH-84: Dictionary resources not found with security manager.
(Uwe Schindler)
Other Changes
* GH-79: Corrected a corner case in DictCompileTest. (Dawid Weiss)
* GH-77: Trailing spaces in encoder name can lead to illegal argument exception.
(Jaume Ortolà, Dawid Weiss)
======================= morfologik-stemming 2.1.0 =======================
New Features
* GH-74: Add dict_apply tool to apply a dictionary to a file or stdin.
(Dawid Weiss)
* GH-73: Update Polish stemming dictionaries to polimorfologik 2.1. (Dawid Weiss)
Bug Fixes
* GH-76: Consolidate and fix character encoding and decoding. (Dawid Weiss)
Other Changes
* GH-63: BufferUtils.ensureCapacity now clears the input buffer. This also
affects WordData methods that accept a reusable byte buffer -- it is now
always cleared prior to being flipped and returned. (Dawid Weiss)
======================= morfologik-stemming 2.0.2 =======================
Bug Fixes
* GH-68: WordData.clone() should be public. (Dawid Weiss)
Other Changes
* GH-64: reverted back OSGi annotations (bundle packaging). (Dawid Weiss)
* GH-72: Rename tools: fsa_dump to fsa_decompile and fsa_build to fsa_compile.
Existing names remain as aliases but will be removed in 2.1.0. (Dawid Weiss)
======================= morfologik-stemming 2.0.1 =======================
Bug Fixes
* GH-65: Dictionary.read(URL) ends in NPE when reading from a JAR resource
(Dawid Weiss)
======================= morfologik-stemming 2.0.0 =======================
This release comes with a cleanup of the API for Java 1.7. There are
several aspects of the code that have been dropped (or added):
- NIO is used extensively, mostly for better error reporting.
- There is a simplified lookup of resources, no class-relative loading
of dictionaries for example. The caller is in charge of looking
up either an URL to the dictionary or providing an InputStream to it.
- Removed internal caching of dictionaries from Dictionary. The
Polish stemmer is initialized lazily and reuses its dictionary
internally.
- Numerous minor tweaks of parameters. JavaDocs.
- A complete rewrite of the tools to compile (and decompile) FSA automata
and complete stemming dictionaries. The tools now assert the validity
of input data files and ensure no corrupt dictionaries can be produced.
Changes in backwards compatibility policy
* GH-64: Removed OSGi support because of Maven issues (forks build
phases, tests, etc.).
* GH-62: Recompress Polish dictionary to use ';' as the separator.
(Dawid Weiss)
* GH-59: Moved Dictionary.convertText utility to
DictionaryLookup.applyReplacements and fixed current reliance on map
ordering. (Dawid Weiss)
* GH-55: Removed the "distribution" module entirely. The tools module
should be self-organizing. Complete overhaul of all the tools.
Examples. Simplified syntax, options and assumptions.
Input sanity checks and validation. (Dawid Weiss)
* GH-57: Restructured the project into FSA traversal/ reading (only)
and FSA Builders (construction). This cleans up dependency
structure as well (HPPC is not required for FSA traversals).
(Dawid Weiss)
* GH-54: Make Java 1.7 the minimum required version. Certain methods
that relied on File as arguments have been removed or changed to
accept Path. (Dawid Weiss)
New Features
* GH-53: Review library dependencies and bring them up to date.
(Dawid Weiss)
* Added OSGi support (Michal Hlavac)
* GH-51: Remove and fail on deprecated metadata (fsa.dict.uses-*).
(Dawid Weiss)
Optimizations
* GH-61: Refactored the code to use one encoding/ decoding routine
and ByteBuffers. Removed dependency on Guava.
Bug Fixes
* GH-32: make replaceRunOnWords return "a lot" for "alot", etc.
(Daniel Naber)
* GH-34: ArrayIndexOutOfBoundsException with replacement-pairs.
(Jaume Ortolà, Daniel Naber)
======================= morfologik-stemming 1.10.0 =======================
Changes in backwards compatibility policy
New Features
* Added OSGi support (Michal Hlavac)
Bug Fixes
* GH-32: make replaceRunOnWords return "a lot" for "alot", etc.
(Daniel Naber)
* GH-34: ArrayIndexOutOfBoundsException with replacement-pairs.
(Jaume Ortolà, Daniel Naber)
======================= morfologik-stemming 1.9.1 =======================
Changes in backwards compatibility policy
New Features
Bug Fixes
* Now only the longest replacement key is selected when using replacement
pairs (thanks to Jaume Ortolà). This fixes a subtle regression
introduced in 1.9.0.
Optimizations
======================= morfologik-stemming 1.9.0 =======================
Changes in backwards compatibility policy
New Features
* Added capability to normalize input and output strings for dictionaries.
This is useful for dictionaries that do not support ligatures, for example.
To specify input conversion, use the property 'fsa.dict.input-conversion'
in the .info file. The output conversion (for example, to use ligatures)
is specified by 'fsa.dict.output-conversion'. Note that lengthy
conversion tables may negatively affect performance.
Bug Fixes
Optimizations
* The suggestion search for the speller is now performed directly by traversing
the dictionary automaton, which makes it much more time-efficient (thanks
to Jaume Ortolà).
* Suggestions are generated faster by avoiding unnecessary case conversions.
======================= morfologik-stemming 1.8.3 =======================
Bug Fixes
* Fixed a bug for spelling dictionaries in non-UTF encodings with
separators: strings with non-encodable characters might have been
accepted as spelled correctly even if they were missing in the
dictionary.
======================= morfologik-stemming 1.8.2 =======================
New Features
* Added the option of using frequencies of words for sorting spelling
replacements. It can be used in both spelling and tagging dictionaries.
'fsa.dict.frequency-included=true' must be added to the .info file.
For building the dictionary, add at the end of each entry a separator and
a character between A and Z (A: less frequently used words;
Z: more frequently used words). (Jaume Ortolà)
======================= morfologik-stemming 1.8.1 =======================
Changes in backwards compatibility policy
* MorphEncodingTool will *fail* if it detects data/lines that contain the
separator annotation byte. This is because such lines get encoded into
something that the decoder cannot process. You can use \u0000 as the
annotation byte to avoid clashes with any existing data.
======================= morfologik-stemming 1.8.0 =======================
Changes in backwards compatibility policy
* Command-line option changes to MorphEncodingTool - it now accepts an explicit
name of the sequence encoder, not infix/suffix/prefix booleans.
* Updating dependencies to their newest versions.
New Features
* Dictionary .info files can specify the sequence decoder explicitly:
suffix, prefix, infix, none are supported. For backwards compatibility,
fsa.dict.uses-prefixes, fsa.dict.uses-infixes and fsa.dict.uses-suffixes
are still supported, but will be removed in the next major version.
* Command-line option changes to MorphEncodingTool - it now accepts an explicit
name of the sequence encoder, not infix/suffix/prefix booleans.
* Rewritten implementation of tab-separated data files (tab2morph tool).
The output should yield smaller files, especially for prefix encoding
and infix encoding. This does *not* necessarily mean smaller automata
but we're working on getting these as well.
Example output before and after refactoring:
Prefix coder:
postmodernizm|modernizm|xyz => [before] postmodernizm+ANmodernizm+xyz
=> [after ] postmodernizm+EA+xyz
Infix coder:
laquelle|lequel|D f s => [before] laquelle+AAHequel+D f s
=> [after ] laquelle+AGAquel+D f s
* Changed the default format of the Polish dictionary from infix
encoded to prefix encoded (smaller output size).
Optimizations
* A number of internal implementation cleanups and refactorings.
======================= morfologik-stemming 1.7.2 =======================
* A quick fix for incorrect decoding of certain suffixes (long suffixes).
* Increased max. recursion level in Speller to 6 from 4. (Jaume Ortolà)
======================= morfologik-stemming 1.7.1 =======================
* Fixed a couple of bugs in morfologik-speller (Jaume Ortolà).
======================= morfologik-stemming 1.7.0 =======================
* Changed DictionaryMetadata API (access methods for encoder/decoder).
* Initial version of morfologik-speller component.
* Minor changes to the FSADumpTool: the header block is always UTF-8
encoded, the default platform encoding does not matter. This is done to
always support certain attributes that may be unicode (and would be
incorrectly dumped otherwise).
* Metadata *.info files can now be encoded in UTF-8 to support text
attributes that otherwise would require text2ascii conversion.
======================= morfologik-stemming 1.6.0 =======================
* Update morfologik-polish data to Morfologik 2.0 PoliMorf (08.03.2013).
Deprecated DICTIONARY constants (unified dictionary only).
* Important! The format of encoding tags has changed and is now
multiple-tags-per-lemma. The value returned from WordData#getTag
may be a number of tags concatenated with a "+" character. Previously
the same lamma/stem would be returned multiple times, each time with
a different tag.
* Moving code from SourceForge to github.
======================= morfologik-stemming 1.5.5 =======================
* Made hppc an optional component of morfologik-fsa. It is required
for constructing FSA automata only and causes problems with javac.
http://stackoverflow.com/questions/3800462/can-i-prevent-javac-accessing-the-class-path-from-the-manifests-of-our-third-par
======================= morfologik-stemming 1.5.4 =======================
* Replaced byte-based speller with CharBasedSpeller.
* Warn about UTF-8 files with BOM.
* Fixed a typo in package name (speller).
======================= morfologik-stemming 1.5.3 =======================
* Initial release of spelling correction submodule.
* Updated morfologik-polish data to morfologik 1.9 [12.06.2012]
* Updated morfologik-polish licensing info to BSD (yay).
======================= morfologik-stemming 1.5.2 =======================
* An alternative Polish dictionary added (BSD licensed): SGJP (Morfeusz).
PolishStemmer can now take an enum switching between the dictionary to
be used or combine both.
* Project split into modules. A single jar version (no external
dependencies) added by transforming via proguard.
* Enabled use of escaped special characters in the tab2morph tool.
* Added guards against the input term having separator character
somewhere (this will now return an empty list of matches). Added
getSeparatorChar to DictionaryLookup so that one can check for this
condition manually, if needed.
======================= morfologik-stemming 1.5.1 =======================
* Build system switch to Maven (tested with Maven2).
======================= morfologik-stemming 1.5.0 =======================
* Major size saving improvements in CFSA2. Built in Polish dictionary
size decreased from 2,811,345 to 1,806,661 (CFSA2 format).
* FSABuilder returns a ready-to-be-used FSA (ConstantArcSizeFSA).
Construction overhead for this automaton is a round zero (it is
immediately serialized in-memory).
* Polish dictionary updated to Morfologik 1.7. [19.11.2010]
* Added an option to serialize automaton to CFSA2 or FSA5 directly from
fsa_build.
* CFSA is now deprecated for serialization (the code still reads CFSA
automata, but will no be able to serialize them). Use CFSA2.
* Added immediate state interning. Speedup in automaton construction by
about 30%, memory use decreased significantly (did not perform exact
measurements, but incremental construction from presorted data should
consume way less memory).
* Added an option to build FSA from already sorted data (--sorted).
Avoids in-memory sorting. Pipe the input through shell sort if
building FSA from large data.
* Changed the default ordering from Java signed-byte to C-like unsigned
byte value. This lets one use GNU sort to sort the input using
'export LC_ALL=C; sort input'.
* Added traversal routines to calculate perfect hashing based on
FSA with NUMBERS.
* Changed the order of serialized arcs in the binary serializer for FSA5
to lexicographic (consistent with the input). Depth-first traversal
recreates the input, in other words.
* Removed character-based automata.
* Incompatible API changes to FSA builders (moved to morfologik.fsa).
* Incompatible API changes to FSATraversalHelper. Cleaned up match
types, added unit tests.
* An external dependency HPPC (high performance primitive collections)
is now required
======================= morfologik-stemming 1.4.1 =======================
* Upgrade of the built-in Morfologik dictionary for Polish (in CFSA
format).
* Added options to define custom FILLER and ANNOT_SEPARATOR bytes in the
fsa_build tool.
* Corrected an inconsistency with the C fsa package -- FILLER and
ANNOT_SEPARATOR characters are now identical with the C version.
* Cleanups to the tools' launcher -- will complain about missing JARs,
if any.
======================= morfologik-stemming 1.4.0 =======================
* Added FSA5 construction in Java (on byte sequences). Added preliminary
support for character sequences. Added a command line tool for FSA5
construction from unsorted data (sorting is done in-memory).
* Added a tool to encode tab-delimited dictionaries to the format
accepted by fsa_build and FSA5 construction tool.
* Added a new version of Morfologik dictionary for Polish (in CFSA format).
======================= morfologik-stemming 1.3.0 =======================
* Added runtime checking for tools availability so that unavailable tools
don't show up in the list.
* Recompressed the built-in Polish dictionary to CFSA.
* Cleaned up FSA/Dictionary separation. FSAs don't store encoding any more
(because it does not make sense for them to do so). The FSA is a purely
abstract class pushing functionality to sub-classes. Input stream
reading cleaned up.
* Added initial code for CFSA (compressed FSA). Reduces automata size
about 10%.
* Changes in the public API. Implementation classes renamed (FSAVer5Impl
into FSA5). Major tweaks and tunes to the API.
* Added support for version 5 automata built with NUMBERS flag (an extra
field stored for each node).
======================= morfologik-stemming 1.2.2 =======================
* License switch to plain BSD (removed the patent clause which did not
make much sense anyway).
* The build ZIP now includes licenses for individual JARs (prevents
confusion).
======================= morfologik-stemming 1.2.1 =======================
* Fixed tool launching routines.
======================= morfologik-stemming 1.2.0 =======================
* Package hierarchy reorganized.
* Removed stempel (heuristic stemmer for Polish).
* Code updated to Java 1.5.
* The API has changed in many places (enums instead of constants,
generics, iterables, removed explicit Arc and Node classes and replaced
by int pointers).
* FSA traversal in version 1.2 is implemented on top of primitive data
structures (int pointers) to keep memory usage minimal. The speed
boost gained from this is enormous and justifies less readable code. We
strongly advise to use the provided iterators and helper functions
for matching state sequences in the FSA.
* Tools updated. Dumping existing FSAs is much, much faster now.
======================= morfologik-stemming 1.1.4 =======================
* Fixed a bug that caused UTF-8 dictionaries to be garbled. Now it
should be relatively safe to use UTF-8 dictionaries (note: separators
cannot be multibyte UTF-8 characters, yet this is probably a very
rare case).
======================= morfologik-stemming 1.1.3 =======================
* Fixed a bug causing NPE when the library is called with null context
class loader (happens when JVM is invoked from an JNI-attached
thread). Thanks to Patrick Luby for report and detailed analysis.
* Updated the built-in dictionary to the newest version available.
======================= morfologik-stemming 1.1.2 =======================
* Fixed a bug causing JAR file locking (by implementing a workaround).
* Fixed the build script (manifest file was broken).
======================= morfologik-stemming 1.1.1 =======================
* Distribution script fixes. The final JAR does not contain test classes
and resources. Size trimmed almost twice compared to release 1.1.
* Updated the dump tool to accept dictionary metadata files.
======================= morfologik-stemming 1.1 =========================
* Introduced an auxiliary "meta" information files about compressed
dictionaries. Such information include delimiter symbol, encoding
and infix/prefix/postfix decoding info.
* The API has changed (repackaging). Some deprecated methods have been
removed. This is a major redesign/ upgrade, you will have to adjust
your source code.
* Cleaned up APIs and interfaces.
* Added infrastructure for command-line tool launching.
* Cleaned up tests.
* Changed project name to morfologik-stemmers and ownership to
(c) Morfologik.
======================= morfologik-stemming 1.0.7 =======================
* Removed one bug in fsa 'compression' decoding.
======================= morfologik-stemming 1.0.6 =======================
* Customized version of stempel replaced with a standard distribution.
* Removed deprecated methods and classes.
* Added infix and prefix encoding support for fsa dictionaries.
======================= morfologik-stemming 1.0.5 =======================
* Added filler and separator char dumps to FSADump.
* A major bug in automaton traversal corrected. Upgrade when possible.
* Certain API changes were introduced; older methods are now deprecated
and will be removed in the future.
======================= morfologik-stemming 1.0.4 =======================
* Licenses for full and no-dict versions.
======================= morfologik-stemming 1.0.3 =======================
* Project code moved to SourceForge (subproject of Morfologik).
LICENSE CHANGED FROM PUBLIC DOMAIN TO BSD (doesn't change much, but
clarifies legal issues).
======================= morfologik-stemming 1.0.2 =======================
* Added a Lametyzator constructor which allows custom dictionary stream,
field delimiters and encoding. Added an option for building stand-alone
JAR that does not include the default polish dictionary.
======================= morfologik-stemming 1.0.1 =======================
* Code cleanups. Added a method that returns the third automaton's column
(form).
======================= morfologik-stemming 1.0 =========================
* Initial release