Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
HLasse committed Jul 28, 2021
1 parent de8a92f commit 98a8aa8
Show file tree
Hide file tree
Showing 9 changed files with 179 additions and 17 deletions.
22 changes: 22 additions & 0 deletions docs/dependencydistance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Dependency Distance
--------------------

The *dependency_distance* component adds measures of depedency distance to both :code:`Doc`, :code:`Span`, and :code:`Token` objects under the ._.dependency_distance attribute.
Dependency distance can be used a measure of syntactics complexity (the greater the distance, the more complex).

For :code:`Doc` objects, the mean and standard deviation of dependency distance on the sentence level is returned along with the mean and standard deviation of the proportion adjacent dependency relations on sentence level.

For :code:`Span` objects, the mean dependency distance and the mean proportion adjacent dependency relations in the span are returned.

For :code:`Token` objects, the dependency distance and whether the dependency relation is an adjacent token is returned.

textdescriptives.components.dependency_distance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: textdescriptives.components.dependency_distance
:members:
:undoc-members:
:show-inheritance:

.. :exclude-members: function
.. for functions you wish to exclude
38 changes: 38 additions & 0 deletions docs/descriptivestats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Descriptive Statistics
----------------------

The *descriptive_stats* component extracts a number of descriptive statistics.
The following attributes are added:

* ._.counts (:code:`Doc` & :code:`Span`)

* Number of tokens.
* Number of unique tokens.
* Proportion unique tokens.
* Number of characters.
* ._.sentence_length(:code:`Doc`)

* Mean sentence length.
* Median sentence length.
* Std of sentence length.
* ._.syllables(:code:`Doc`)

* Mean number of syllables per token.
* Median number of syllables per token.
* Std of number of syllables per token.
* ._.token_length(:code:`Doc` & :code:`Span`)

* Mean token length.
* Median token length.
* Std of token length.

textdescriptives.components.descriptive_stats
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: textdescriptives.components.descriptive_stats
:members:
:undoc-members:
:show-inheritance:

.. :exclude-members: function
.. for functions you wish to exclude
9 changes: 6 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@ TextDescriptives
.. image:: https://img.shields.io/github/stars/hlasse/textdescriptives.svg?style=social&label=Star&maxAge=2592000
:target: https://github.com/hlasse/textdescriptives

TextDescriptives is a... add introduction
TextDescriptives is Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions.
TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.
The components are implemented using getters, which means they will only be calculated when accessed.

📰 News
---------------------------------

Add new items
* TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the `stanza_version branch <https://github.com/HLasse/TextDescriptives/tree/stanza_version>`_ and will no longer be maintained.


Contents
Expand All @@ -33,7 +35,9 @@ The documentation is organized in three parts:
:maxdepth: 3
:caption: Package References

descriptivestats
readability
dependencydistance

.. add more references here
Expand All @@ -44,7 +48,6 @@ The documentation is organized in three parts:




Indices and search
==================

Expand Down
2 changes: 1 addition & 1 deletion docs/installation.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Installation
==================
To get started using TextDescriptives simply install it using pip by running the following line in your terminal:
To get started using TextDescriptives you can install it using pip by running the following line in your terminal:

.. code-block::
Expand Down
18 changes: 16 additions & 2 deletions docs/readability.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,27 @@
Readability
--------------------

textdescriptives.readability
The *readability* component adds the following readabiltiy metrics under the ._.readability attribute to :code:`Doc` objects.

* Gunning-Fog
* SMOG
* Flesch reading ease
* Flesch-Kincaid grade
* Automated readability index
* Coleman-Liau index
* Lix
* Rix

For specifics of the implementation, refer to the source. The equations are largely derived from the `textstat <https://github.com/shivam5992/textstat>`_ library.

textdescriptives.components.readability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: textdescriptives.readability
.. automodule:: textdescriptives.components.readability
:members:
:undoc-members:
:show-inheritance:
:private-members:

.. :exclude-members: function
.. for functions you wish to exclude
81 changes: 78 additions & 3 deletions docs/usingthepackage.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,88 @@
Using TextDescriptives
=======================

Introduction of how to use TD
Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:

.. code-block:: python
import spacy
import textdescriptives as td
do_stuff()
# or only load the component:
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textdescriptives")
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
# access some of the values
doc._.readability
doc._.token_length
The calculated metrics can be conveniently extracted to a Pandas DataFrame using the :code:`extract_df` function.


.. code-block:: python
td.extract_df(doc)
You can control which measures to extract with the *metrics* argument.

.. code-block:: python
td.extract_df(doc, metrics = ["descriptive_stats", "readability", "dependency_distance"])
.. note::
An example note.
By default, :code:`extract_df` adds a column containing the text. You can change this behaviour by setting :code:`include_text = False`.

:code:`extract_df` also works on objects created by :code:`nlp.pipe`. The output will be formatted with 1 row for each document and a column for each metric.

.. code-block:: python
docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])
td.extract_df(docs, metrics = "dependency_distance")
Using specific components
=========================

TextDescriptives includes 3 components that can be used individually: *descriptive_stats*, *readability*, and *dependency_distance*.
This can be helpful if you're only interested in e.g. readabiltiy metrics or descriptive statistics and don't to run a dependency parser.
If you have imported the TextDesriptives package you can add them to a pipe using the standard spaCy syntax.

.. code-block:: python
nlp = spacy.blank("da")
nlp.add_pipe("descriptive_stats")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
"Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])
# extract_df is clever enough to only extract metrics that are in the Doc
td.extract_df(docs, include_text = False)
If you don't to import the entire TextDescriptives library (although it is very lightweight), you can import only the components you need.

.. code-block:: python
from textdescriptives import (DescriptiveStatistics,
Readability,
DependencyDistance,
TextDecriptives)
Available attributes
====================
The table below shows the metrics included in TextDecriptives and the attributes they set on spaCy's :code:`Doc`, :code:`Span`, and :code:`Token` objects.
For more details on each metrics, see the following sections in the documentation.

.. csv-table::
:header: "Attribute", "Component", "Description"
:widths: 30, 30, 40

":code:`Doc._.token_length`", "`descriptive_stats`","Dict containing mean, median, and std of token length."
":code:`Doc._.sentence_length`","`descriptive_stats`","Dict containing mean, median, and std of sentence length."
":code:`Doc._.syllables`","`descriptive_stats`","Dict containing mean, median, and std of number of syllables per token."
":code:`Doc._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc."
":code:`Doc._.readability`","`readability`","Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc."
":code:`Doc._.dependency_distance`","`dependency_distance`","Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc."
":code:`Span._.token_length`","`descriptive_stats`","Dict containing mean, median, and std of token length in the span."
":code:`Span._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span."
":code:`Span._.dependency_distance`","`dependency_distance`","Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc."
":code:`Token._.dependency_distance`","`dependency_distance`","Dict containing the dependency distance and whether the head word is adjacent for a Token."
7 changes: 4 additions & 3 deletions textdescriptives/components/dependency_distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@

@Language.factory("dependency_distance")
def create_dependency_distance_component(nlp: Language, name: str):
"""Create spaCy language factory that allows DependencyDistance attributes to be added to a pipe using nlp.add_pipe("dependency_distance")"""
return DependencyDistance(nlp)


Expand All @@ -28,7 +29,7 @@ def __call__(self, doc: Doc):
"""Run the pipeline component"""
return doc

def token_dependency(self, token: Token):
def token_dependency(self, token: Token) -> dict:
"""Token-level dependency distance"""
dep_dist = 0
ajd_dep = False
Expand All @@ -38,7 +39,7 @@ def token_dependency(self, token: Token):
ajd_dep = True
return {"dependency_distance": dep_dist, "adjacent_dependency": ajd_dep}

def span_dependency(self, span: Span):
def span_dependency(self, span: Span) -> dict:
"""Span-level aggregated dependency distance"""
dep_dists, adj_deps = zip(
*[token._.dependency_distance.values() for token in span]
Expand All @@ -48,7 +49,7 @@ def span_dependency(self, span: Span):
"prop_adjacent_dependency_relation": np.mean(adj_deps),
}

def doc_dependency(self, doc: Doc):
def doc_dependency(self, doc: Doc) -> dict:
"""Doc-level dependency distance aggregated on sentence level"""
if len(doc) == 0:
return {
Expand Down
4 changes: 3 additions & 1 deletion textdescriptives/components/descriptive_stats.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Calculation of descriptive statistics"""
"""Calculation of descriptive statistics."""
from spacy.tokens import Doc, Span
from spacy.language import Language
from typing import Union
Expand All @@ -9,6 +9,8 @@

@Language.factory("descriptive_stats")
def create_descriptive_stats_component(nlp: Language, name: str):
"""Allows DescriptiveStatistics to be added to a spaCy pipe using nlp.add_pipe("descriptive_stats").
If the pipe does not contain a parser or sentencizer, the sentencizer component is silently added."""
sentencizers = set(["sentencizer", "parser"])
if not sentencizers.intersection(set(nlp.pipe_names)):
nlp.add_pipe("sentencizer") # add a sentencizer if not one in pipe
Expand Down
15 changes: 11 additions & 4 deletions textdescriptives/components/readability.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""Calculation of various readability metrics"""
"""Calculation of various readability metrics."""
from textdescriptives.components.utils import n_sentences
from spacy.tokens import Doc
from spacy.language import Language
Expand All @@ -10,6 +10,9 @@

@Language.factory("readability")
def create_readability_component(nlp: Language, name: str):
"""Allows Readability to added to a spaCy pipe using nlp.add_pipe("readability").
Readability requires attributes from DescriptiveStatistics and adds it to the
pipe if it not already loaded."""
if "descriptive_stats" not in nlp.pipe_names:
print(
"'descriptive_stats' component is required for 'readability'. Adding to pipe."
Expand All @@ -33,7 +36,7 @@ def __call__(self, doc: Doc):
return doc

def readability(self, doc: Doc) -> dict[str, float]:
"""Create output"""
"""Apply readability functions and return a dict of the results."""
hard_words = len([syllable for syllable in doc._._n_syllables if syllable >= 3])
long_words = len([t for t in doc._._filtered_tokens if len(t) > 6])

Expand All @@ -51,7 +54,9 @@ def readability(self, doc: Doc) -> dict[str, float]:
def _flesch_reading_ease(self, doc: Doc):
"""
206.835 - (1.015 X avg sent len) - (84.6 * avg_syl_per_word)
Higher = easier to read
Works best for English
"""
avg_sentence_length = doc._.sentence_length["sentence_length_mean"]
Expand All @@ -75,6 +80,7 @@ def _flesch_kincaid_grade(self, doc: Doc):
def _smog(self, doc: Doc, hard_words: int):
"""
grade level = 1.043( sqrt(30 * (hard words /n sentences)) + 3.1291
Preferably need 30+ sentences. Will not work with less than 4
"""
n_sentences = doc._._n_sentences
Expand All @@ -87,6 +93,7 @@ def _smog(self, doc: Doc, hard_words: int):
def _gunning_fog(self, doc, hard_words: int):
"""
Grade level = 0.4 * ((avg_sentence_length) + (percentage hard words))
hard words = 3+ syllables
"""
n_tokens = doc._._n_tokens
Expand All @@ -111,8 +118,8 @@ def _automated_readability_index(self, doc: Doc):

def _coleman_liau_index(self, doc: Doc):
"""
score = 0.0588 * avg number of chars pr 100 words -
0.296 * avg num of sents pr 100 words -15.8
score = 0.0588 * avg number of chars pr 100 words - 0.296 * avg num of sents pr 100 words -15.8
Score = grade required to read the text
"""
n_tokens = doc._._n_tokens
Expand Down

0 comments on commit 98a8aa8

Please sign in to comment.