Client library for Text Analysis
npm install --save @datafire/tisane_ai
let tisane_ai = require('@datafire/tisane_ai').create({
apiKeyHeader: "",
apiKeyQuery: ""
});
.then(data => {
console.log(data);
});
🔬 | Detect abusive content, obtain sentiment analysis, extract entities, detect topics, automatically correct spelling errors, and more. |
Compares two compound named entities and outputs the differences found.
tisane_ai.compare.entities.post({}, context)
- input
object
Output schema unknown
A service method to remove JavaScript, CSS tags, JSON, and other markup, returning pure decoded text.
tisane_ai.helper.extract_text.post({}, context)
- input
object
- body
string
- body
Output schema unknown
Obtain a list of available languages. No parameters.
tisane_ai.5a4c8182a3511b120c2e80bd(null, context)
This action has no parameters
Output schema unknown
The method analyzes the input, returning high-level and low-level metadata.
The request body is a JSON structure made of three elements:
language
(string) - a standard IETF tag for the language to analyzecontent
(string) - a content to analyzesettings
(structure) - the settings to apply when analyzing
Example:
{"language": "en", "content":"Hello Tisane API!", "settings": {}}
- Abusive Content
- Sentiment Analysis
- Entities
- Topics
- Advanced Low-Level Data: Sentences, Phrases, and Words
The response contains several sections which are displayed or hidden according to the settings.
The common attributes are:
text
(string) - the original inputreduced_output
(boolean) - if the input is too big, and verbose information like the lexical chunk was requested, the verbose information will not be generated, and this flag will be set totrue
and returned as part of the responsesentiment
(floating-point number) - a number in range -1 to 1 indicating the document-level sentiment. Only shown whendocument_sentiment
setting is set totrue
.signal2noise
(floating-point number) - a signal to noise ranking of the text, in relation to the array of concepts specified in therelevant
setting. Only shown when therelevant
setting exists.
The abuse
section is an array of detected instances of content that may violate some terms of use. NOTE: the terms of use in online communities may vary, and so it is up to the administrators to determine whether the content is indeed abusive. For instance, it makes no sense to restrict sexual advances in a dating community, or censor profanities when it's accepted in the bulk of the community.
The section exists if instances of abuse are detected and the abuse
setting is either omitted or set to true
.
Every instance contains the following attributes:
offset
(unsigned integer) - zero-based offset where the instance startslength
(unsigned integer) - length of the contentsentence_index
(unsigned integer) - zero-based index of the sentence containing the instancetext
(string) - fragment of text containing the instance (only included if thesnippets
setting is set totrue
)tags
(array of strings) - when exists, provides additional detail about the abuse. For instance, if the fragment is classified as an attempt to sell hard drugs, one of the tags will be hard_drug.type
(string) - the type of the abuseseverity
(string) - how severe the abuse is. The levels of severity arelow
,medium
,high
, andextreme
explanation
(string) - when available, provides rationale for the annotation; set theexplain
setting totrue
to enable.
The currently supported types are:
personal_attack
- an insult / attack on the addressee, e.g. an instance of cyberbullying. Please note that an attack on a post or a point, or just negative sentiment is not the same as an insult. The line may be blurred at times. See our Knowledge Base for more information.bigotry
- hate speech aimed at one of the protected classes. The hate speech detected is not just racial slurs, but, generally, hostile statements aimed at the group as a wholeprofanity
- profane language, regardless of the intentsexual_advances
- welcome or unwelcome attempts to gain some sort of sexual favor or gratificationcriminal_activity
- attempts to sell or procure restricted items, criminal services, issuing death threats, and so onexternal_contact
- attempts to establish contact or payment via external means of communication, e.g. phone, email, instant messaging (may violate the rules in certain communities, e.g. gig economy portals, e-commerce portals)adult_only
- activities restricted for minors (e.g. consumption of alcohol)mental_issues
- content indicative of suicidal thoughts or depression (LIMITED)spam
- (RESERVED) spam contentgeneric
- undefined
The sentiment_expressions
section is an array of detected fragments indicating the attitude towards aspects or entities.
The section exists if sentiment is detected and the sentiment
setting is either omitted or set to true
.
Every instance contains the following attributes:
offset
(unsigned integer) - zero-based offset where the instance startslength
(unsigned integer) - length of the contentsentence_index
(unsigned integer) - zero-based index of the sentence containing the instancetext
(string) - fragment of text containing the instance (only included if thesnippets
setting is set totrue
)polarity
(string) - whether the attitude ispositive
,negative
, ormixed
. Additionally, there is adefault
sentiment used for cases when the entire snippet has been pre-classified. For instance, if a review is split into two portions, What did you like? and What did you not like?, and the reviewer replies briefly, e.g. The quiet. The service, the utterance itself has no sentiment value. When the calling application is aware of the intended sentiment, the default sentiment simply provides the targets / aspects, which will be then added the sentiment externally.targets
(array of strings) - when available, provides set of aspects and/or entities which are the targets of the sentiment. For instance, when the utterance is, The breakfast was yummy but the staff is unfriendly, the targets for the two sentiment expressions aremeal
andstaff
. Named entities may also be targets of the sentiment.reasons
(array of strings) - when available, provides reasons for the sentiment. In the example utterance above (The breakfast was yummy but the staff is unfriendly), thereasons
array for thestaff
is["unfriendly"]
, while thereasons
array formeal
is["tasty"]
.explanation
(string) - when available, provides rationale for the sentiment; set theexplain
setting totrue
to enable.
Example:
"sentiment_expressions": [
{
"sentence_index": 0,
"offset": 0,
"length": 32,
"polarity": "positive",
"reasons": ["close"],
"targets": ["location"]
},
{
"sentence_index": 0,
"offset": 38,
"length": 29,
"polarity": "negative",
"reasons": ["disrespectful"],
"targets": ["staff"]
}
]
The entities_summary
section is an array of named entity objects detected in the text.
The section exists if named entities are detected and the entities
setting is either omitted or set to true
.
Every entity contains the following attributes:
name
(string) - the most complete name of the entity in the text of all the mentionsref_lemma
(string) - when available, the dictionary form of the entity in the reference language (English) regardless of the input languagetype
(string) - a string or an array of strings specifying the type of the entity, such asperson
,organization
,numeric
,amount_of_money
,place
. Certain entities, like countries, may have several types (because a country is both aplace
and anorganization
).subtype
(string) - a string indicating the subtype of the entitymentions
(array of objects) - a set of instances where the entity was mentioned in the text
Every mention contains the following attributes:
offset
(unsigned integer) - zero-based offset where the instance startslength
(unsigned integer) - length of the contentsentence_index
(unsigned integer) - zero-based index of the sentence containing the instancetext
(string) - fragment of text containing the instance (only included if thesnippets
setting is set totrue
)
Example:
"entities_summary": [
{
"type": "person",
"name": "John Smith",
"ref_lemma": "John Smith",
"mentions": [
{
"sentence_index": 0,
"offset": 0,
"length": 10 }
]
}
,
{
"type": [ "organization", "place" ]
,
"name": "UK",
"ref_lemma": "U.K.",
"mentions": [
{
"sentence_index": 0,
"offset": 40,
"length": 2 }
]
}
]
The currently supported types are:
person
, with optional subtypes:fictional_character
,important_person
,spiritual_being
organization
(note that a country is both an organization and a place)place
time_range
date
time
hashtag
email
amount_of_money
phone
phone number, either domestic or international, in a variety of formatsrole
(a social role, e.g. position in an organization)software
website
(URL), with an optional subtype:tor
for Onion links; note that web services may also have thesoftware
type assignedweight
bank_account
only IBAN format is supported; subtypes:iban
credit_card
, with optional subtypes:visa
,mastercard
,american_express
,diners_club
,discovery
,jcb
,unionpay
coordinates
(GPS coordinates)credential
, with optional subtypes:md5
,sha-1
crypto
, with optional subtypes:bitcoin
,ethereum
,monero
,monero_payment_id
,litecoin
,dash
event
file
only Windows pathnames are supported; subtypes:windows
,facebook
(for images downloaded from Facebook)flight_code
identifier
ip_address
, subtypes:v4
,v6
mac_address
numeric
(an unclassified numeric entity)username
The topics
section is an array of topics (subjects, domains, themes in other terms) detected in the text.
The section exists if topics are detected and the topics
setting is either omitted or set to true
.
By default, a topic is a string. If topic_stats
setting is set to true
, then every entry in the array contains:
topic
(string) - the topic itselfcoverage
(floating-point number) - a number between 0 and 1, indicating the ratio between the number of sentences where the topic is detected to the total number of sentences
Tisane allows obtaining more in-depth data, specifically:
- sentences and their corrected form, if a misspelling was detected
- lexical chunks and their grammatical and stylistic features
- parse trees and phrases
The sentence_list
section is generated if the words
or the parses
setting is set to true
.
Every sentence structure in the list contains:
offset
(unsigned integer) - zero-based offset where the sentence startslength
(unsigned integer) - length of the sentencetext
(string) - the sentence itselfcorrected_text
(string) - if a misspelling was detected and the spellchecking is active, contains the automatically corrected textwords
(array of structures) - ifwords
setting is set totrue
, generates extended information about every lexical chunk. (The term "word" is used for the sake of simplicity, however, it may not be linguistically correct to equate lexical chunks with words.)parse_tree
(object) - ifparses
setting is set totrue
, generates information about the parse tree and the phrases detected in the sentence.nbest_parses
(array of parse objects) ifparses
setting is set totrue
anddeterministic
setting is set tofalse
, generates information about the parse trees that were deemed close enough to the best one but not the best.
Every lexical chunk ("word") structure in the words
array contains:
type
(string) - the type of the element:punctuation
for punctuation marks,numeral
for numerals, orword
for everything elsetext
(string) - the textoffset
(unsigned integer) - zero-based offset where the element startslength
(unsigned integer) - length of the elementcorrected_text
(string) - if a misspelling is detected, the corrected formlettercase
(string) - the original letter case:upper
,capitalized
, ormixed
. If lowercase or no case, the attribute is omitted.stopword
(boolean) - determines whether the word is a stopwordgrammar
(array of strings or structures) - generates the list of grammar features associated with theword
. If thefeature_standard
[setting] is defined asnative
, then every feature is an object containing a numeral (index
) and a string (value
). Otherwise, every feature is a plain string
For lexical chunks only:
role
(string) - semantic role, likeagent
orpatient
. Note that in passive voice, the semantic roles are reverse to the syntactic roles. E.g. in a sentence like The car was driven by David, car is the patient, and David is the agent.numeric_value
(floating-point number) - the numeric value, if the chunk has a value associated with itfamily
(integer number) - the ID of the family associated with the disambiguated word-sense of the lexical chunkdefinition
(string) - the definition of the family, if thefetch_definitions
setting is set totrue
lexeme
(integer number) - the ID of the lexeme entry associated with the disambiguated word-sense of the lexical chunknondictionary_pattern
(integer number) - the ID of a non-dictionary pattern that matched, if the word was not in the language model but was classified by the nondictionary heuristicsstyle
(array of strings or structures) - generates the list of style features associated with theword
. Only if thefeature_standard
[setting] is set tonative
ordescription
semantics
(array of strings or structures) - generates the list of semantic features associated with theword
. Only if thefeature_standard
[setting] is set tonative
ordescription
segmentation
(structure) - generates info about the selected segmentation, if there are several possibilities to segment the current lexical chunk and thedeterministic
[setting] is set tofalse
. A segmentation is simply an array ofword
structures.other_segmentations
(array of structures) - generates info about the segmentations deemed incorrect during the disambiguation process. Every entry has the same structure as thesegmentation
structure.nbest_senses
(array of structures) - when thedeterministic
[setting] is set tofalse
, generates a set of hypotheses that were deemed incorrect by the disambiguation process. Every hypothesis contains the following attributes:grammar
,style
, andsemantics
, identical in structure to their counterparts above; andsenses
, an array of word-senses associated with every hypothesis. Every sense has afamily
, which is an ID of the associated family; and, if thefetch_definitions
setting is set totrue
,definition
andref_lemma
of that family.
For punctuation marks only:
id
(integer number) - the ID of the punctuation markbehavior
(string) - the behavior code of the punctuation mark. Values:sentenceTerminator
,genericComma
,bracketStart
,bracketEnd
,scopeDelimiter
,hyphen
,quoteStart
,quoteEnd
,listComma
(for East-Asian enumeration commas like 、)
Every parse tree, or more accurately, parse forest, is a collection of phrases, hierarchically linked to each other.
At the top level of the parse, there is an array of root phrases under the phrases
element and the numeric id
associated with it. Every phrase may have children phrases. Every phrase has the following attributes:
type
(string) - a Penn treebank phrase tag denoting the type of the phrase, e.g. S, VP, NP, etc.family
(integer number) - an ID of the phrase familyoffset
(unsigned integer) - a zero-based offset where the phrase startslength
(unsigned integer) - the span of the phraserole
(string) - the semantic role of the phrase, if any, analogous to that of the wordstext
(string) - the phrase text, where the phrase members are delimited by the vertical bar character. Children phrases are enclosed in brackets. E.g., driven|by|David or (The|car)|was|(driven|by|David).
Example:
"parse_tree": {
"id": 4,
"phrases": [
{
"type": "S",
"family": 1451,
"offset": 0,
"length": 27,
"text": "(The|car)|was|(driven|by|David)",
"children": [
{
"type": "NP",
"family": 1081,
"offset": 0,
"length": 7,
"text": "The|car",
"role": "patient"
},
{
"type": "VP",
"family": 1172,
"offset": 12,
"length": 15,
"text": "driven|by|David",
"role": "verb"
}
]
}
Tisane supports automatic, context-aware spelling correction. Whether it's a misspelling or a purported obfuscation, Tisane attempts to deduce the intended meaning, if the language model does not recognize the word.
When or if it's found, Tisane adds the corrected_text
attribute to the word (if the words / lexical chunks are returned) and the sentence (if the sentence text is generated). Sentence-level corrected_text
is displayed if words
or parses
are set to true.
Note that as Tisane works with large dictionaries, you may need to exclude more esoteric terms by using the min_generic_frequency
setting.
Note that the invocation of spell-checking does not depend on whether the sentences and the words sections are generated in the output. The spellchecking can be disabled by setting disable_spellcheck
to true
. Another option is to enable the spellchecking for lowercase words only, thus excluding potential proper nouns in languages that support capitalization; to avoid spell-checking capitalized and uppercase words, set lowercase_spellcheck_only
to true
.
tisane_ai.5a3b7177a3511b11cc29265c({}, context)
- input
object
Output schema unknown
Calculate semantic similarity between two text fragments, in the same language or in two different languages.
tisane_ai.similarity.post({}, context)
- input
object
Output schema unknown
Finds a URL of an image (Creative Commons) best describing the text.
WARNING: may be slow, as Wikimedia servers are queried.
tisane_ai.text2picture.post({}, context)
- input
object
Output schema unknown
- settings
object
- abuse
boolean
- abuse