Skip to content

Commit

Permalink
update readme to spacy version
Browse files Browse the repository at this point in the history
  • Loading branch information
HLasse committed Jul 24, 2021
1 parent 92a7b93 commit b11223b
Show file tree
Hide file tree
Showing 12 changed files with 123 additions and 8 deletions.
119 changes: 117 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,117 @@
# spacy-textdescriptives
A spaCy implementation of Textdescriptives
<!--
[![PyPI version](https://badge.fury.io/py/tomsup.svg)](https://pypi.org/project/tomsup/)
[![Code style: flake8](https://img.shields.io/badge/Code%20Style-flake8-blue)](https://pypi.org/project/flake8/)
[![pip downloads](https://img.shields.io/pypi/dm/textdescriptives.svg)](https://crate.io/packages/textdescriptives)
[![python versions](https://img.shields.io/pypi/pyversions/textdescriptives?colorB=blue)](https://pypi.org/project/textdescriptives/)
-->




# TextDescriptives

A Python package for calculating a large variety of statistics from text(s).

## Installation
`python -m pip install git+https://github.com/HLasse/TextDescriptives.git`

## News

* TextDescriptives has been completely re-implemented using `spaCy`. The old `stanza` implementation can be found in the `stanza_version` branch but will no longer be maintained.
* Now uses `stanza` for dependency parsing. `stanfordnlp` is no longer a dependency.

## Usage

TextDescriptives adds components to your spaCy pipelines to calculate descriptive statistics, readability metrics, and metrics related to dependency distance. The components are implemented using getters, which means they will only be calculated if you try to access them.

```py
import spacy
import textdescriptives as td

nlp = spacy.load("en_core_news_sm")
nlp = td.add_components(nlp) # adds the components to the pipeline
doc = nlp("The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.")

# access some of the values
doc._.readability
doc._.token_length
```

TODO: add output

TextDescriptives includes a convenience function for converting metrics to a Pandas DataFrame

```py
td.extract_df(doc)
```

TODO: add output

Set which group(s) of metrics you want to extract using the `metrics` parameter (one or more of `readability`, `dependency_distance`, `descriptive_stats`, defaults to `all`)
```py
td.extract_df(doc, metrics="readability")
```

If `extract_df` is called on an object created using `nlp.pipe` it will format the output with 1 row for each document and a column for each metric.
```py
docs = nlp.pipe(['The world is changed. I feel it in the water. I feel it in the earth. I smell it in the air. Much that once was is lost, for none now live who remember it.',
'He felt that his whole life was some kind of dream and he sometimes wondered whose it was and whether they were enjoying it.'])

td.extract_df(doc, metrics="dependency_distance")
```

TODO: add output

If you don't want to include the `text` column set `include_text` to `False`.


Textdescriptives works for any language that has a spaCy model.
```py
nlp = spacy.load("da_core_news_sm")
docs = nlp.pipe(['Da jeg var atten, tog jeg patent på ild. Det skulle senere vise sig at blive en meget indbringende forretning',
"Spis skovsneglen, Mulle. Du vil jo gerne være med i hulen, ikk'?"])

td.extract_df(docs, include_text = False)
```

### Readability

The readability measures are largely derived from the [textstat](https://github.com/shivam5992/textstat) library and are thoroughly defined there.

### Dependency Distance
Mean dependency distance can be used as a way of measuring the average syntactic complexity of a text.

## Metrics
Metrics currently implemented:

1. Descriptive statistics - mean, median, standard deviation of the following:
* Word length
* Sentence length, words
* Syllables per word
* Number of characters
* Number of sentences
* Number of types (unique words)
* Number of tokens (total words)
* Type/toḱen ratio

2. Readability metrics:
* Gunning-Fog
* SMOG
* Flesch reading ease
* Flesch-Kincaid grade
* Automated readability index
* Coleman-Liau index
* Lix
* Rix

4. Dependency distance metrics:
* Mean dependency distance, sentence level (mean, standard deviation)
* Mean proportion adjacent dependency relations, sentence level (mean, standard devaiation)

## Authors

Developed by Lasse Hansen at the [Center for Humanities Computing Aarhus](https://chcaa.io)

Collaborators:

* Ludvig Renbo Olsen ([@ludvigolsen]( https://github.com/ludvigolsen ), [ludvigolsen.dk]( http://ludvigolsen.dk ))
6 changes: 3 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import setuptools

with open("spacy-textdescriptives/about.py") as f:
with open("textdescriptives/about.py") as f:
v = f.read()
for l in v.split("\n"):
if l.startswith("__version__"):
Expand All @@ -13,15 +13,15 @@
requirements = f.read().split("\n")

setuptools.setup(
name="spacy-textdescriptives",
name="textdescriptives",
version=__version__,
description="A library for calculating a variety of features from text using spaCy",
license="Apache License 2.0",
long_description=long_description,
long_description_content_type="text/markdown",
author="Lasse Hansen",
author_email="[email protected]",
url="https://github.com/HLasse/spacy-textdescriptives",
url="https://github.com/HLasse/textdescriptives",
packages=["spacy-textdescriptives"],
install_requires=requirements,
# See https://pypi.python.org/pypi?%3Aaction=list_classifiers
Expand Down
3 changes: 0 additions & 3 deletions spacy-textdescriptives/about.py

This file was deleted.

File renamed without changes.
3 changes: 3 additions & 0 deletions textdescriptives/about.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
__title__ = "textdescriptives"
__version__ = "0.1.0" # the ONLY source of version ID
__download_url__ = "https://github.com/HLasse/textdescriptives"
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit b11223b

Please sign in to comment.