update readme to spacy version

HLasse · Jul 24, 2021 · b11223b · b11223b
1 parent 92a7b93
commit b11223b
Show file tree

Hide file tree

Showing 12 changed files with 123 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,117 @@
-# spacy-textdescriptives
-A spaCy implementation of Textdescriptives
+<!-- 
+[![PyPI version](https://badge.fury.io/py/tomsup.svg)](https://pypi.org/project/tomsup/)
+[![Code style: flake8](https://img.shields.io/badge/Code%20Style-flake8-blue)](https://pypi.org/project/flake8/)
+[![pip downloads](https://img.shields.io/pypi/dm/textdescriptives.svg)](https://crate.io/packages/textdescriptives)
+[![python versions](https://img.shields.io/pypi/pyversions/textdescriptives?colorB=blue)](https://pypi.org/project/textdescriptives/)
+-->
+
+
+
+
+# TextDescriptives
+
+A Python package for calculating a large variety of statistics from text(s).
+
+## Installation
+`python -m pip install git+https://github.com/HLasse/TextDescriptives.git`
+
+## News
+
+* TextDescriptives has been completely re-implemented using `spaCy`. The old `stanza` implementation can be found in the `stanza_version` branch but will no longer be maintained. 
+* Now uses `stanza` for dependency parsing. `stanfordnlp` is no longer a dependency.
+
+## Usage
+
+TextDescriptives adds components to your spaCy pipelines to calculate descriptive statistics, readability metrics, and metrics related to dependency distance. The components are implemented using getters, which means they will only be calculated if you try to access them. 
+
+```py
+import spacy
+import textdescriptives as td
+
+nlp = spacy.load("en_core_news_sm")
+nlp = td.add_components(nlp) # adds the components to the pipeline
+doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")
+
+# access some of the values
+doc._.readability
+doc._.token_length
+```
+
+TODO: add output
+
+TextDescriptives includes a convenience function for converting metrics to a Pandas DataFrame
+
+```py
+td.extract_df(doc)
+```
+
+TODO: add output
+
+Set which group(s) of metrics you want to extract using the `metrics` parameter (one or more of `readability`, `dependency_distance`, `descriptive_stats`, defaults to `all`)
+```py
+td.extract_df(doc, metrics="readability")
+```
+
+If `extract_df` is called on an object created using `nlp.pipe` it will format the output with 1 row for each document and a column for each metric.
+```py
+docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
+            'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])
+
+td.extract_df(doc, metrics="dependency_distance")
+```
+
+TODO: add output
+
+If you don't want to include the `text` column set `include_text` to `False`.
+
+
+Textdescriptives works for any language that has a spaCy model.
+```py
+nlp = spacy.load("da_core_news_sm")
+docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
+            "Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])
+
+td.extract_df(docs, include_text = False)
+```
+
+### Readability
+
+The readability measures are largely derived from the [textstat](https://github.com/shivam5992/textstat) library and are thoroughly defined there.
+
+### Dependency Distance
+Mean dependency distance can be used as a way of measuring the average syntactic complexity of a text. 
+
+## Metrics
+Metrics currently implemented:
+
+1. Descriptive statistics - mean, median, standard deviation of the following:
+  * Word length
+  * Sentence length, words
+  * Syllables per word
+  * Number of characters
+  * Number of sentences
+  * Number of types (unique words)
+  * Number of tokens (total words)
+  * Type/toḱen ratio
+
+2. Readability metrics:
+  * Gunning-Fog
+  * SMOG
+  * Flesch reading ease
+  * Flesch-Kincaid grade
+  * Automated readability index
+  * Coleman-Liau index
+  * Lix
+  * Rix
+
+ 4. Dependency distance metrics:
+  * Mean dependency distance, sentence level (mean, standard deviation)
+  * Mean proportion adjacent dependency relations, sentence level (mean, standard devaiation)
+
+  ## Authors
+
+  Developed by Lasse Hansen at the [Center for Humanities Computing Aarhus](https://chcaa.io)
+
+  Collaborators:
+
+  *  Ludvig Renbo Olsen ([@ludvigolsen]( https://github.com/ludvigolsen ), [ludvigolsen.dk]( http://ludvigolsen.dk ))
diff --git a/setup.py b/setup.py
@@ -1,6 +1,6 @@
 import setuptools
 
-with open("spacy-textdescriptives/about.py") as f:
+with open("textdescriptives/about.py") as f:
     v = f.read()
     for l in v.split("\n"):
         if l.startswith("__version__"):
@@ -13,15 +13,15 @@
     requirements = f.read().split("\n")
 
 setuptools.setup(
-    name="spacy-textdescriptives",
+    name="textdescriptives",
     version=__version__,
     description="A library for calculating a variety of features from text using spaCy",
     license="Apache License 2.0",
     long_description=long_description,
     long_description_content_type="text/markdown",
     author="Lasse Hansen",
     author_email="[email protected]",
-    url="https://github.com/HLasse/spacy-textdescriptives",
+    url="https://github.com/HLasse/textdescriptives",
     packages=["spacy-textdescriptives"],
     install_requires=requirements,
     # See https://pypi.python.org/pypi?%3Aaction=list_classifiers

diff --git a/spacy-textdescriptives/about.py b/spacy-textdescriptives/about.py
diff --git a/spacy-textdescriptives/__init__.py → textdescriptives/__init__.py b/spacy-textdescriptives/__init__.py → textdescriptives/__init__.py
diff --git a/textdescriptives/about.py b/textdescriptives/about.py
@@ -0,0 +1,3 @@
+__title__ = "textdescriptives"
+__version__ = "0.1.0"  # the ONLY source of version ID
+__download_url__ = "https://github.com/HLasse/textdescriptives"
diff --git a/...iptives/components/dependency_distance.py → ...iptives/components/dependency_distance.py b/...iptives/components/dependency_distance.py → ...iptives/components/dependency_distance.py
diff --git a/...criptives/components/descriptive_stats.py → ...criptives/components/descriptive_stats.py b/...criptives/components/descriptive_stats.py → ...criptives/components/descriptive_stats.py
diff --git a/...extdescriptives/components/readability.py → textdescriptives/components/readability.py b/...extdescriptives/components/readability.py → textdescriptives/components/readability.py
diff --git a/spacy-textdescriptives/components/utils.py → textdescriptives/components/utils.py b/spacy-textdescriptives/components/utils.py → textdescriptives/components/utils.py
diff --git a/spacy-textdescriptives/extractor.py → textdescriptives/extractor.py b/spacy-textdescriptives/extractor.py → textdescriptives/extractor.py
diff --git a/spacy-textdescriptives/load_components.py → textdescriptives/load_components.py b/spacy-textdescriptives/load_components.py → textdescriptives/load_components.py
diff --git a/spacy-textdescriptives/subsetters.py → textdescriptives/subsetters.py b/spacy-textdescriptives/subsetters.py → textdescriptives/subsetters.py