-
Notifications
You must be signed in to change notification settings - Fork 19
User Dictionary
Jos Denys edited this page Oct 19, 2020
·
37 revisions
The User Dictionary currently serves 2 purposes : suppress or force sentence end conditions, and provide extra semantic information.
- Sentence End condition : iKnow uses simple heuristics to detect sentence endings. A list of generic acronyms (English acronyms) is part of the language model to prevent unnatural sentence splitting. For finer user control, specific terms can be added to the user dictionary.
- User defined semantics iKnow tags lexreps using labels (English labels). Next to these language specific labels, a vast set of language independent labels are used, a subset of these are User Dictionary (UD*) labels. These can be used to assign extra user defined semantics. User dictionary labels are assigned before lexrep lookup, and override the (English lexreps) labels. However, the language rules need to pick up the UD labels to make them effective. If the language model does not support a specific label, it will not be taken into account. For an overview of the current state of UD label support, see following table.
The user dictionary is supported as of version 1.0.
Both functions have a corresponding method :
- influence the sentence boundary detection by defining abbreviations and sentence-ending strings
engine = iknowpy.iKnowEngine()
user_dictionary = iknowpy.UserDictionary()
user_dictionary.add_sent_end_condition("Fr.", False) # suppress 'Fr.' as a sentence terminator.
ret = engine.load_user_dictionary(user_dictionary)
engine.index("some text Fr. and following.", "en")
# Normally 'Fr.' would split the sentence, but due to the 'False' parameter of method 'add_sent_end_condition()', this remains one sentence.
- Use a user dictionary label to tag a specific term
user_dictionary = iknowpy.UserDictionary()
ret = user_dictionary.add_label("some text", "UDUnit") # "some text" will be labeled "UDUnit", before lexrep lookup
To ease the use of manual labeling, all available user labels have their corresponding shortcut version, making code more readable and preventing typo's in label names :
- enforce words or sequences of words to get a specified role (Concept - Relation - PathRelevant - NonRelevant)
user_dictionary.add_concept("one concept") # mark as a concept
user_dictionary.add_relation("one relation") # mark as a relation
user_dictionary.add_non_relevant("crap") # mark as non relevant
- define additional Negation markers
user_dictionary.add_negation("w/o") # mark w/o as a negation
- define Sentiment markers
user_dictionary.add_positive_sentiment("great") # mark as a positive sentiment
user_dictionary.add_negative_sentiment("awfull") # mark as a negative sentiment
- define Time markers
user_dictionary.add_time("future") # mark as a time attribute
- define units and numbers for Measurements
user_dictionary.add_unit("Hg") # mark as a unit
user_dictionary.add_number("magic number") # mark as a number
For some extra information on sentiment analysis, see this very interesting IRIS article 👍