-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #63 from HLasse/feature-add-quality
Feature: Add quality descriptives
- Loading branch information
Showing
13 changed files
with
1,001 additions
and
79 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,10 @@ | ||
# News | ||
|
||
## V1.0.7 - May 4, 2022 | ||
## v1.1.0 - 21st of September, 2022 | ||
- Added the new pipe; "quality". This pipe implements a series of metrics related to text quality, some of which were used by Rae et al. (2021) and Raffel et al. (2020) to filter large text corpora. See the documentation for examples. | ||
|
||
## v1.0.7 - 4th May, 2022 | ||
- Some minor fixes and bells and whistles. | ||
|
||
## V1.0.5 - October 4, 2021 | ||
## v1.0.5 - 4th October, 2021 | ||
- POS proportions now use `pos_` instead of `tag_` by default. This behavior can be changed by setting `use_tag` to `False` when initialising the `pos_stats` module. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
Quality | ||
-------------------- | ||
|
||
The :code:`quality` component adds the following quality metrics under the | ||
:code:`._.quality`` attribute to :code:`Doc` and :code:`Span` objects. | ||
|
||
Heuristic quality metrics: | ||
|
||
* Number of stop words (:code:`n_stop_words``): The number of stop words in the document. | ||
* Alpha Ratio (:code:`alpha_ratio`): Ratio of words containing at least one alphabetic characters. | ||
* Mean word length (:code:`mean_word_length`): Mean/average word length. | ||
* Proportion of ellipsis (:code:`proportion_ellipsis`): Proportion of lines in a documents which end with an ellipsis. | ||
* Proportion of bullet points (:code:`proportion_bullet_points`): Proportion of lines in a documents which start with a bullet point. | ||
* Symbol to word ratio (:code:`symbol_{symbol}_2_word_ratio`): Ratio of specified symbols to words, could e.g. include ratio of hashtags or curly brackets. | ||
* Contains string (:code:`contains_{string}`): Whether the document contains a specified string. For instance documents containing the string "lorem ipsum". | ||
|
||
Repetitious text metrics: | ||
|
||
* Duplicate lines character fraction (:code:`duplicate_lines_chr_fraction`): Fraction of characters in a document which are contained within duplicate lines. | ||
* Duplicate paragraphs character fraction (:code:`duplicate_paragraphs_chr_fraction`): Fraction of characters in a document which are contained within duplicate paragraphs. | ||
* Duplicate n-gram character fraction (:code:`duplicate_{n}_gram_chr_fraction`): Fraction of characters in a document which are contained within duplicate n-grams. For a speciifed n-gram range. | ||
* Top n-gram character fraction (:code:`top_{n}_gram_chr_fraction`): Fraction of characters in a document which are contained within the top n-grams. For a speciifed n-gram range. | ||
|
||
|
||
These quality metrics were for example used by | ||
`Rae et al. (2021) <https://arxiv.org/abs/2112.11446>`__ and | ||
`Raffel et al. (2020) <https://arxiv.org/abs/1910.10683>`__` to filter large text | ||
corpora for pre-training language models. | ||
|
||
Note: this implementation is not optimized for speed, but rather for usability, simplicity, and spacy integration. | ||
If you need to run quality filters on a large corpus, you should consider using the implementation from | ||
[Danish Foundation Models](https://github.com/centre-for-humanities-computing/danish-foundation-models) which also | ||
includes a number of other quality filters and deduplication strategies. | ||
|
||
|
||
Quality Component | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
.. autofunction:: textdescriptives.components.quality.create_quality_component |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,16 @@ | ||
from .load_components import TextDescriptives | ||
from .components import ( | ||
DescriptiveStatistics, | ||
Readability, | ||
from .about import __title__, __version__ # noqa: F401 | ||
from .components import ( # noqa: F401 | ||
DependencyDistance, | ||
DescriptiveStatistics, | ||
POSStatistics, | ||
Quality, | ||
Readability, | ||
) | ||
from .dataframe_extract import ( | ||
from .dataframe_extract import ( # noqa: F401 | ||
dependency_cols, | ||
descriptive_stats_cols, | ||
extract_df, | ||
extract_dict, | ||
readability_cols, | ||
dependency_cols, | ||
descriptive_stats_cols, | ||
) | ||
|
||
from .about import __version__, __title__ | ||
from .load_components import TextDescriptives # noqa: F401 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
from .readability import Readability | ||
from .dependency_distance import DependencyDistance | ||
from .descriptive_stats import DescriptiveStatistics | ||
from .pos_stats import POSStatistics | ||
from .dependency_distance import DependencyDistance # noqa: F401 | ||
from .descriptive_stats import DescriptiveStatistics # noqa: F401 | ||
from .pos_stats import POSStatistics # noqa: F401 | ||
from .quality import Quality # noqa: F401 | ||
from .readability import Readability # noqa: F401 |
Oops, something went wrong.
b317cf0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coverage Report