-
Notifications
You must be signed in to change notification settings - Fork 39
Upgrading Between Versions
Refer to What's New in Baleen 2.7.0 for a detailed description of what's new in Baleen 2.7.0.
In particular, review the section on Content Extractors for information on how to upgrade to Baleen 2.7.0.
For a full list of changes in Baleen 2.4.0, see What's New in Baleen 2.4.0.
As of Baleen 2.4.0, annotators and consumers in a pipeline will (by default) attempt to self-order. This removes the requirement for pipeline developers to have such an in depth knowledge of the various annotators, but may not perform as well as an expert configured pipeline.
You can disable this feature by adding the following to the pipeline configuration:
orderer: uk.gov.dstl.baleen.core.pipelines.orderers.NoOpOrderer
All annotators are now required to implement the getAction()
method. If you only use the core annotators (i.e. the ones that come bundled with Baleen) then these have already been updated. If you use third party annotators, then you will need to upgrade to use a version compatible with Baleen 2.4.
Job configuration files no longer require the top level job
block. So what was previously:
job:
schedule: Once
tasks:
- MongoStats
Should now be written as:
schedule: Once
tasks:
- MongoStats
Baleen 2.3.0 introduces some fairly large changes to the TypeSystem, in particular to the Temporal aspects. The following classes have been replaced with a new Temporal
type:
DateTime
DateType
Time
TimeSpan
This will affect both the outputs of Baleen (and may impact on downstream tools that use Baleen), as well as the required configuration. The following annotators will need removing from existing configurations and replacing with new annotators.
-
cleaners.AddTimeSpans
- no longer required as temporal entities inherently support spans -
cleaners.CleanDates
- replace withcleaners.CleanTemporal
-
cleaners.NormalizeDates
- replace withcleaners.NormalizeTemporal
-
cleaners.NormalizeTimes
- replace withcleaners.NormalizeTemporal
-
cleaners.RemoveNestedDateTimes
- no longer required ascleaners.RemoveNestedEntites
will now work correctly with temporal entities
Additionally, there are now some additional annotators that improve extraction of temporal entities - see the list of new annotators below.
The Temporal
type has the following properties:
-
precision
- EXACT, RELATIVE or UNQUALIFIED depending on the known precision of the temporal instance -
scope
- SINGLE or RANGE depending on the whether the entity represents a single temporal instance (e.g 2nd Feb 2017), or a range of temporal instances (2-12 Feb 2017) -
temporalType
- DATE, TIME or DATETIME depending on the type of the temporal instance -
timestampStart
- the Unix timestamp (inclusive) in seconds of the start of the temporal period being represented -
timestampStop
- the Unix timestamp (exclusive) in seconds of the end of the temporal period being represented
In addition, a new Weapon
type has been added to the type system.
In addition to the changes detailed above, the following components have been removed:
-
LegacyMongo
(Consumer)
The following components have now been added to the standard Baleen build, and you may wish to include these in your configuration. For more information, view the relevant Javadoc.
-
ActiveMQReader
(Collection Reader) - read documents from an ActiveMQ topic -
ActiveMQ
(Consumer) - publish outputs onto an ActiveMQ topic -
cleaners.AddGenderToPerson
- add gender information to Person entities -
cleaners.EntityInitials
- identify initials following an entity and associate these initials with the entity (including other occurrences) -
cleaners.SplitBrackets
- identify entities that include brackets and split the brackets into a separate coreferenced entity -
misc.AddSourceToMetadata
- add source information to the document as a Metadata annotation -
regex.RelativeDate
- identify temporal entities such as 'last Thursday', and resolve them where possible -
regex.UnqualifiedDate
- identify incomplete dates that can't be explicitly resolved (e.g. 2nd February)
The following components have been improved, and may now have additional functionality that you wish to use.
- All gazetteers now support the subtype parameter, allowing you to assign subtype information to any entity from a gazetteer
-
MoveSource
(Consumer) - source files can now be optionally moved to folders based on the document type