update docs

HLasse · Jul 28, 2021 · 98a8aa8 · 98a8aa8
1 parent de8a92f
commit 98a8aa8
Show file tree

Hide file tree

Showing 9 changed files with 179 additions and 17 deletions.
diff --git a/docs/dependencydistance.rst b/docs/dependencydistance.rst
@@ -0,0 +1,22 @@
+Dependency Distance
+--------------------
+
+The *dependency_distance* component adds measures of depedency distance to both :code:`Doc`, :code:`Span`, and :code:`Token` objects under the ._.dependency_distance attribute.
+Dependency distance can be used a measure of syntactics complexity (the greater the distance, the more complex). 
+
+For :code:`Doc` objects, the mean and standard deviation of dependency distance on the sentence level is returned along with the mean and standard deviation of the proportion adjacent dependency relations on sentence level.
+
+For :code:`Span` objects, the mean dependency distance and the mean proportion adjacent dependency relations in the span are returned.
+
+For :code:`Token` objects, the dependency distance and whether the dependency relation is an adjacent token is returned. 
+
+textdescriptives.components.dependency_distance
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. automodule:: textdescriptives.components.dependency_distance
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+.. :exclude-members: function
+.. for functions you wish to exclude
diff --git a/docs/descriptivestats.rst b/docs/descriptivestats.rst
@@ -0,0 +1,38 @@
+Descriptive Statistics
+----------------------
+
+The *descriptive_stats* component extracts a number of descriptive statistics. 
+The following attributes are added:
+
+* ._.counts (:code:`Doc` & :code:`Span`) 
+
+  *  Number of tokens.
+  *  Number of unique tokens.
+  *  Proportion unique tokens.
+  *  Number of characters.
+* ._.sentence_length(:code:`Doc`)
+
+  * Mean sentence length.
+  * Median sentence length.
+  * Std of sentence length.
+* ._.syllables(:code:`Doc`)
+
+  * Mean number of syllables per token.
+  * Median number of syllables per token.
+  * Std of number of syllables per token.
+* ._.token_length(:code:`Doc` & :code:`Span`) 
+
+  * Mean token length.
+  * Median token length.
+  * Std of token length.
+
+textdescriptives.components.descriptive_stats
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. automodule:: textdescriptives.components.descriptive_stats
+   :members:
+   :undoc-members:
+   :show-inheritance:
+
+.. :exclude-members: function
+.. for functions you wish to exclude
diff --git a/docs/index.rst b/docs/index.rst
@@ -4,12 +4,14 @@ TextDescriptives
 .. image:: https://img.shields.io/github/stars/hlasse/textdescriptives.svg?style=social&label=Star&maxAge=2592000
    :target: https://github.com/hlasse/textdescriptives
 
-TextDescriptives is a... add introduction
+TextDescriptives is Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. 
+TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance. 
+The components are implemented using getters, which means they will only be calculated when accessed.
 
 📰 News
 ---------------------------------
 
-Add new items
+* TextDescriptives has been completely re-implemented using spaCy v.3.0. The stanza implementation can be found in the `stanza_version branch <https://github.com/HLasse/TextDescriptives/tree/stanza_version>`_ and will no longer be maintained. 
 
 
 Contents
@@ -33,7 +35,9 @@ The documentation is organized in three parts:
    :maxdepth: 3
    :caption: Package References
 
+   descriptivestats
    readability
+   dependencydistance
 
 .. add more references here
 
@@ -44,7 +48,6 @@ The documentation is organized in three parts:
 
 
 
-
 Indices and search
 ==================
 

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -1,6 +1,6 @@
 Installation
 ==================
-To get started using TextDescriptives simply install it using pip by running the following line in your terminal:
+To get started using TextDescriptives you can install it using pip by running the following line in your terminal:
 
 .. code-block::
 

diff --git a/docs/readability.rst b/docs/readability.rst
@@ -1,13 +1,27 @@
 Readability
 --------------------
 
-textdescriptives.readability
+The *readability* component adds the following readabiltiy metrics under the ._.readability attribute to :code:`Doc` objects.
+
+* Gunning-Fog
+* SMOG
+* Flesch reading ease
+* Flesch-Kincaid grade
+* Automated readability index
+* Coleman-Liau index
+* Lix
+* Rix
+
+For specifics of the implementation, refer to the source. The equations are largely derived from the `textstat <https://github.com/shivam5992/textstat>`_ library.
+
+textdescriptives.components.readability
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. automodule:: textdescriptives.readability
+.. automodule:: textdescriptives.components.readability
    :members:
    :undoc-members:
    :show-inheritance:
+   :private-members:
 
 .. :exclude-members: function
 .. for functions you wish to exclude
diff --git a/docs/usingthepackage.rst b/docs/usingthepackage.rst
@@ -1,13 +1,88 @@
 Using TextDescriptives
 =======================
 
-Introduction of how to use TD
+Import the library and add the component to your pipeline using the string name of the "textdescriptives" component factory:
 
 .. code-block:: python
 
+   import spacy
    import textdescriptives as td
-   do_stuff()
+   # or only load the component: 
+   nlp = spacy.load("en_core_web_sm")
+   nlp.add_pipe("textdescriptives") 
+   doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
 
+   # access some of the values
+   doc._.readability
+   doc._.token_length
+
+The calculated metrics can be conveniently extracted to a Pandas DataFrame using the :code:`extract_df` function.
+
+
+.. code-block:: python
+
+   td.extract_df(doc)
+
+You can control which measures to extract with the *metrics* argument.
+
+.. code-block:: python
+
+   td.extract_df(doc, metrics = ["descriptive_stats", "readability", "dependency_distance"])
 
 .. note::
-   An example note.
+   By default, :code:`extract_df` adds a column containing the text. You can change this behaviour by setting :code:`include_text = False`.
+
+:code:`extract_df` also works on objects created by :code:`nlp.pipe`. The output will be formatted with 1 row for each document and a column for each metric.
+
+.. code-block:: python
+
+   docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
+   'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])
+   td.extract_df(docs, metrics = "dependency_distance")
+
+Using specific components
+=========================
+
+TextDescriptives includes 3 components that can be used individually: *descriptive_stats*, *readability*, and *dependency_distance*. 
+This can be helpful if you're only interested in e.g. readabiltiy metrics or descriptive statistics and don't to run a dependency parser.
+If you have imported the TextDesriptives package you can add them to a pipe using the standard spaCy syntax.
+
+.. code-block:: python
+
+   nlp = spacy.blank("da")
+   nlp.add_pipe("descriptive_stats")
+   docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
+            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])
+   # extract_df is clever enough to only extract metrics that are in the Doc
+   td.extract_df(docs, include_text = False)
+
+
+If you don't to import the entire TextDescriptives library (although it is very lightweight), you can import only the components you need.
+
+.. code-block:: python
+
+  from textdescriptives import (DescriptiveStatistics, 
+                                Readability, 
+                                DependencyDistance, 
+                                TextDecriptives)
+
+
+Available attributes
+====================
+The table below shows the metrics included in TextDecriptives and the attributes they set on spaCy's :code:`Doc`, :code:`Span`, and :code:`Token` objects.
+For more details on each metrics, see the following sections in the documentation.
+
+.. csv-table:: 
+   :header: "Attribute", "Component", "Description"
+   :widths: 30, 30, 40
+
+   ":code:`Doc._.token_length`", "`descriptive_stats`","Dict containing mean, median, and std of token length."
+   ":code:`Doc._.sentence_length`","`descriptive_stats`","Dict containing mean, median, and std of sentence length."
+   ":code:`Doc._.syllables`","`descriptive_stats`","Dict containing mean, median, and std of number of syllables per token."
+   ":code:`Doc._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the Doc."
+   ":code:`Doc._.readability`","`readability`","Dict containing Flesch Reading Ease, Flesch-Kincaid Grade, SMOG, Gunning-Fog, Automated Readability Index, Coleman-Liau Index, LIX, and RIX readability metrics for the Doc."
+   ":code:`Doc._.dependency_distance`","`dependency_distance`","Dict containing the mean and standard deviation of the dependency distance and proportion adjacent dependency relations in the Doc."
+   ":code:`Span._.token_length`","`descriptive_stats`","Dict containing mean, median, and std of token length in the span."
+   ":code:`Span._.counts`","`descriptive_stats`","Dict containing the number of tokens, number of unique tokens, proportion unique tokens, and number of characters in the span."
+   ":code:`Span._.dependency_distance`","`dependency_distance`","Dict containing the mean dependency distance and proportion adjacent dependency relations in the Doc."
+   ":code:`Token._.dependency_distance`","`dependency_distance`","Dict containing the dependency distance and whether the head word is adjacent for a Token."
diff --git a/textdescriptives/components/dependency_distance.py b/textdescriptives/components/dependency_distance.py
@@ -7,6 +7,7 @@
 
 @Language.factory("dependency_distance")
 def create_dependency_distance_component(nlp: Language, name: str):
+    """Create spaCy language factory that allows DependencyDistance attributes to be added to a pipe using nlp.add_pipe("dependency_distance")"""
     return DependencyDistance(nlp)
 
 
@@ -28,7 +29,7 @@ def __call__(self, doc: Doc):
         """Run the pipeline component"""
         return doc
 
-    def token_dependency(self, token: Token):
+    def token_dependency(self, token: Token) -> dict:
         """Token-level dependency distance"""
         dep_dist = 0
         ajd_dep = False
@@ -38,7 +39,7 @@ def token_dependency(self, token: Token):
                 ajd_dep = True
         return {"dependency_distance": dep_dist, "adjacent_dependency": ajd_dep}
 
-    def span_dependency(self, span: Span):
+    def span_dependency(self, span: Span) -> dict:
         """Span-level aggregated dependency distance"""
         dep_dists, adj_deps = zip(
             *[token._.dependency_distance.values() for token in span]
@@ -48,7 +49,7 @@ def span_dependency(self, span: Span):
             "prop_adjacent_dependency_relation": np.mean(adj_deps),
         }
 
-    def doc_dependency(self, doc: Doc):
+    def doc_dependency(self, doc: Doc) -> dict:
         """Doc-level dependency distance aggregated on sentence level"""
         if len(doc) == 0:
             return {

diff --git a/textdescriptives/components/descriptive_stats.py b/textdescriptives/components/descriptive_stats.py
@@ -1,4 +1,4 @@
-"""Calculation of descriptive statistics"""
+"""Calculation of descriptive statistics."""
 from spacy.tokens import Doc, Span
 from spacy.language import Language
 from typing import Union
@@ -9,6 +9,8 @@
 
 @Language.factory("descriptive_stats")
 def create_descriptive_stats_component(nlp: Language, name: str):
+    """Allows DescriptiveStatistics to be added to a spaCy pipe using nlp.add_pipe("descriptive_stats").
+    If the pipe does not contain a parser or sentencizer, the sentencizer component is silently added."""
     sentencizers = set(["sentencizer", "parser"])
     if not sentencizers.intersection(set(nlp.pipe_names)):
         nlp.add_pipe("sentencizer")  # add a sentencizer if not one in pipe

diff --git a/textdescriptives/components/readability.py b/textdescriptives/components/readability.py
@@ -1,4 +1,4 @@
-"""Calculation of various readability metrics"""
+"""Calculation of various readability metrics."""
 from textdescriptives.components.utils import n_sentences
 from spacy.tokens import Doc
 from spacy.language import Language
@@ -10,6 +10,9 @@
 
 @Language.factory("readability")
 def create_readability_component(nlp: Language, name: str):
+    """Allows Readability to added to a spaCy pipe using nlp.add_pipe("readability").
+    Readability requires attributes from DescriptiveStatistics and adds it to the
+    pipe if it not already loaded."""
     if "descriptive_stats" not in nlp.pipe_names:
         print(
             "'descriptive_stats' component is required for 'readability'. Adding to pipe."
@@ -33,7 +36,7 @@ def __call__(self, doc: Doc):
         return doc
 
     def readability(self, doc: Doc) -> dict[str, float]:
-        """Create output"""
+        """Apply readability functions and return a dict of the results."""
         hard_words = len([syllable for syllable in doc._._n_syllables if syllable >= 3])
         long_words = len([t for t in doc._._filtered_tokens if len(t) > 6])
 
@@ -51,7 +54,9 @@ def readability(self, doc: Doc) -> dict[str, float]:
     def _flesch_reading_ease(self, doc: Doc):
         """
         206.835 - (1.015 X avg sent len) - (84.6 * avg_syl_per_word)
+        
         Higher = easier to read
+        
         Works best for English
         """
         avg_sentence_length = doc._.sentence_length["sentence_length_mean"]
@@ -75,6 +80,7 @@ def _flesch_kincaid_grade(self, doc: Doc):
     def _smog(self, doc: Doc, hard_words: int):
         """
         grade level = 1.043( sqrt(30 * (hard words /n sentences)) + 3.1291
+        
         Preferably need 30+ sentences. Will not work with less than 4
         """
         n_sentences = doc._._n_sentences
@@ -87,6 +93,7 @@ def _smog(self, doc: Doc, hard_words: int):
     def _gunning_fog(self, doc, hard_words: int):
         """
         Grade level = 0.4 * ((avg_sentence_length) + (percentage hard words))
+        
         hard words = 3+ syllables
         """
         n_tokens = doc._._n_tokens
@@ -111,8 +118,8 @@ def _automated_readability_index(self, doc: Doc):
 
     def _coleman_liau_index(self, doc: Doc):
         """
-        score = 0.0588 * avg number of chars pr 100 words -
-            0.296 * avg num of sents pr 100 words -15.8
+        score = 0.0588 * avg number of chars pr 100 words - 0.296 * avg num of sents pr 100 words -15.8
+        
         Score = grade required to read the text
         """
         n_tokens = doc._._n_tokens