Skip to content

Commit

Permalink
Merge pull request #63 from HLasse/feature-add-quality
Browse files Browse the repository at this point in the history
Feature: Add quality descriptives
  • Loading branch information
HLasse authored Sep 26, 2022
2 parents 4b9179f + 1578599 commit b317cf0
Show file tree
Hide file tree
Showing 13 changed files with 1,001 additions and 79 deletions.
7 changes: 5 additions & 2 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# News

## V1.0.7 - May 4, 2022
## v1.1.0 - 21st of September, 2022
- Added the new pipe; "quality". This pipe implements a series of metrics related to text quality, some of which were used by Rae et al. (2021) and Raffel et al. (2020) to filter large text corpora. See the documentation for examples.

## v1.0.7 - 4th May, 2022
- Some minor fixes and bells and whistles.

## V1.0.5 - October 4, 2021
## v1.0.5 - 4th October, 2021
- POS proportions now use `pos_` instead of `tag_` by default. This behavior can be changed by setting `use_tag` to `False` when initialising the `pos_stats` module.
51 changes: 26 additions & 25 deletions README.md

Large diffs are not rendered by default.

39 changes: 39 additions & 0 deletions docs/quality.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Quality
--------------------

The :code:`quality` component adds the following quality metrics under the
:code:`._.quality`` attribute to :code:`Doc` and :code:`Span` objects.

Heuristic quality metrics:

* Number of stop words (:code:`n_stop_words``): The number of stop words in the document.
* Alpha Ratio (:code:`alpha_ratio`): Ratio of words containing at least one alphabetic characters.
* Mean word length (:code:`mean_word_length`): Mean/average word length.
* Proportion of ellipsis (:code:`proportion_ellipsis`): Proportion of lines in a documents which end with an ellipsis.
* Proportion of bullet points (:code:`proportion_bullet_points`): Proportion of lines in a documents which start with a bullet point.
* Symbol to word ratio (:code:`symbol_{symbol}_2_word_ratio`): Ratio of specified symbols to words, could e.g. include ratio of hashtags or curly brackets.
* Contains string (:code:`contains_{string}`): Whether the document contains a specified string. For instance documents containing the string "lorem ipsum".

Repetitious text metrics:

* Duplicate lines character fraction (:code:`duplicate_lines_chr_fraction`): Fraction of characters in a document which are contained within duplicate lines.
* Duplicate paragraphs character fraction (:code:`duplicate_paragraphs_chr_fraction`): Fraction of characters in a document which are contained within duplicate paragraphs.
* Duplicate n-gram character fraction (:code:`duplicate_{n}_gram_chr_fraction`): Fraction of characters in a document which are contained within duplicate n-grams. For a speciifed n-gram range.
* Top n-gram character fraction (:code:`top_{n}_gram_chr_fraction`): Fraction of characters in a document which are contained within the top n-grams. For a speciifed n-gram range.


These quality metrics were for example used by
`Rae et al. (2021) <https://arxiv.org/abs/2112.11446>`__ and
`Raffel et al. (2020) <https://arxiv.org/abs/1910.10683>`__` to filter large text
corpora for pre-training language models.

Note: this implementation is not optimized for speed, but rather for usability, simplicity, and spacy integration.
If you need to run quality filters on a large corpus, you should consider using the implementation from
[Danish Foundation Models](https://github.com/centre-for-humanities-computing/danish-foundation-models) which also
includes a number of other quality filters and deduplication strategies.


Quality Component
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autofunction:: textdescriptives.components.quality.create_quality_component
5 changes: 5 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,8 @@ ftfy>=6.0.3,<6.2.0
pytest>=7.1.3,<7.2.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0.tar.gz
en_core_web_sm==3.2.0

# style
flake8
black
isort
18 changes: 9 additions & 9 deletions textdescriptives/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
from .load_components import TextDescriptives
from .components import (
DescriptiveStatistics,
Readability,
from .about import __title__, __version__ # noqa: F401
from .components import ( # noqa: F401
DependencyDistance,
DescriptiveStatistics,
POSStatistics,
Quality,
Readability,
)
from .dataframe_extract import (
from .dataframe_extract import ( # noqa: F401
dependency_cols,
descriptive_stats_cols,
extract_df,
extract_dict,
readability_cols,
dependency_cols,
descriptive_stats_cols,
)

from .about import __version__, __title__
from .load_components import TextDescriptives # noqa: F401
9 changes: 5 additions & 4 deletions textdescriptives/components/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .readability import Readability
from .dependency_distance import DependencyDistance
from .descriptive_stats import DescriptiveStatistics
from .pos_stats import POSStatistics
from .dependency_distance import DependencyDistance # noqa: F401
from .descriptive_stats import DescriptiveStatistics # noqa: F401
from .pos_stats import POSStatistics # noqa: F401
from .quality import Quality # noqa: F401
from .readability import Readability # noqa: F401
Loading

1 comment on commit b317cf0

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coverage

Coverage Report
FileStmtsMissCoverMissing
textdescriptives
   init.py40100% 
   dataframe_extract.py69593%129–133
   load_components.py120100% 
textdescriptives/components
   init.py50100% 
   dependency_distance.py320100% 
   descriptive_stats.py51198%112
   pos_stats.py25292%14, 46
   quality.py1931194%101, 107, 125, 130, 169, 175, 196, 202, 445, 448, 455
   readability.py720100% 
   utils.py24388%59–62
TOTAL4872295% 

Please sign in to comment.