-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
9 changed files
with
179 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
Dependency Distance | ||
-------------------- | ||
|
||
The *dependency_distance* component adds measures of depedency distance to both :code:`Doc`, :code:`Span`, and :code:`Token` objects under the ._.dependency_distance attribute. | ||
Dependency distance can be used a measure of syntactics complexity (the greater the distance, the more complex). | ||
|
||
For :code:`Doc` objects, the mean and standard deviation of dependency distance on the sentence level is returned along with the mean and standard deviation of the proportion adjacent dependency relations on sentence level. | ||
|
||
For :code:`Span` objects, the mean dependency distance and the mean proportion adjacent dependency relations in the span are returned. | ||
|
||
For :code:`Token` objects, the dependency distance and whether the dependency relation is an adjacent token is returned. | ||
|
||
textdescriptives.components.dependency_distance | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. automodule:: textdescriptives.components.dependency_distance | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
.. :exclude-members: function | ||
.. for functions you wish to exclude |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
Descriptive Statistics | ||
---------------------- | ||
|
||
The *descriptive_stats* component extracts a number of descriptive statistics. | ||
The following attributes are added: | ||
|
||
* ._.counts (:code:`Doc` & :code:`Span`) | ||
|
||
* Number of tokens. | ||
* Number of unique tokens. | ||
* Proportion unique tokens. | ||
* Number of characters. | ||
* ._.sentence_length(:code:`Doc`) | ||
|
||
* Mean sentence length. | ||
* Median sentence length. | ||
* Std of sentence length. | ||
* ._.syllables(:code:`Doc`) | ||
|
||
* Mean number of syllables per token. | ||
* Median number of syllables per token. | ||
* Std of number of syllables per token. | ||
* ._.token_length(:code:`Doc` & :code:`Span`) | ||
|
||
* Mean token length. | ||
* Median token length. | ||
* Std of token length. | ||
|
||
textdescriptives.components.descriptive_stats | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. automodule:: textdescriptives.components.descriptive_stats | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
.. :exclude-members: function | ||
.. for functions you wish to exclude |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,27 @@ | ||
Readability | ||
-------------------- | ||
|
||
textdescriptives.readability | ||
The *readability* component adds the following readabiltiy metrics under the ._.readability attribute to :code:`Doc` objects. | ||
|
||
* Gunning-Fog | ||
* SMOG | ||
* Flesch reading ease | ||
* Flesch-Kincaid grade | ||
* Automated readability index | ||
* Coleman-Liau index | ||
* Lix | ||
* Rix | ||
|
||
For specifics of the implementation, refer to the source. The equations are largely derived from the `textstat <https://github.com/shivam5992/textstat>`_ library. | ||
|
||
textdescriptives.components.readability | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. automodule:: textdescriptives.readability | ||
.. automodule:: textdescriptives.components.readability | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
:private-members: | ||
|
||
.. :exclude-members: function | ||
.. for functions you wish to exclude |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,88 @@ | ||
Using TextDescriptives | ||
======================= | ||
|
||
Introduction of how to use TD | ||
Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory: | ||
|
||
.. code-block:: python | ||
import spacy | ||
import textdescriptives as td | ||
do_stuff() | ||
# or only load the component: | ||
nlp = spacy.load("en_core_web_sm") | ||
nlp.add_pipe("textdescriptives") | ||
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.") | ||
# access some of the values | ||
doc._.readability | ||
doc._.token_length | ||
The calculated metrics can be conveniently extracted to a Pandas DataFrame using the :code:`extract_df` function. | ||
|
||
|
||
.. code-block:: python | ||
td.extract_df(doc) | ||
You can control which measures to extract with the *metrics* argument. | ||
|
||
.. code-block:: python | ||
td.extract_df(doc, metrics = ["descriptive_stats", "readability", "dependency_distance"]) | ||
.. note:: | ||
An example note. | ||
By default, :code:`extract_df` adds a column containing the text. You can change this behaviour by setting :code:`include_text = False`. | ||
|
||
:code:`extract_df` also works on objects created by :code:`nlp.pipe`. The output will be formatted with 1 row for each document and a column for each metric. | ||
|
||
.. code-block:: python | ||
docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.', | ||
'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.']) | ||
td.extract_df(docs, metrics = "dependency_distance") | ||
Using specific components | ||
========================= | ||
|
||
TextDescriptives includes 3 components that can be used individually: *descriptive_stats*, *readability*, and *dependency_distance*. | ||
This can be helpful if you're only interested in e.g. readabiltiy metrics or descriptive statistics and don't to run a dependency parser. | ||
If you have imported the TextDesriptives package you can add them to a pipe using the standard spaCy syntax. | ||
|
||
.. code-block:: python | ||
nlp = spacy.blank("da") | ||
nlp.add_pipe("descriptive_stats") | ||
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning', | ||
"Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"]) | ||
# extract_df is clever enough to only extract metrics that are in the Doc | ||
td.extract_df(docs, include_text = False) | ||
If you don't to import the entire TextDescriptives library (although it is very lightweight), you can import only the components you need. | ||
|
||
.. code-block:: python | ||
from textdescriptives import (DescriptiveStatistics, | ||
Readability, | ||
DependencyDistance, | ||
TextDecriptives) | ||
Available attributes | ||
==================== | ||
The table below shows the metrics included in TextDecriptives and the attributes they set on spaCy's :code:`Doc`, :code:`Span`, and :code:`Token` objects. | ||
For more details on each metrics, see the following sections in the documentation. | ||
|
||
.. csv-table:: | ||
:header: "Attribute", "Component", "Description" | ||
:widths: 30, 30, 40 | ||
|
||
":code:`Doc._.token_length`", "`descriptive_stats`","Dict containing mean, median, and std of token length." | ||
":code:`Doc._.sentence_length`","`descriptive_stats`","Dict containing mean, median, and std of sentence length." | ||
":code:`Doc._.syllables`","`descriptive_stats`","Dict containing mean, median, and std of number of syllables per token." | ||
":code:`Doc._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc." | ||
":code:`Doc._.readability`","`readability`","Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc." | ||
":code:`Doc._.dependency_distance`","`dependency_distance`","Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc." | ||
":code:`Span._.token_length`","`descriptive_stats`","Dict containing mean, median, and std of token length in the span." | ||
":code:`Span._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span." | ||
":code:`Span._.dependency_distance`","`dependency_distance`","Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc." | ||
":code:`Token._.dependency_distance`","`dependency_distance`","Dict containing the dependency distance and whether the head word is adjacent for a Token." |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters