Skip to content

Commit

Permalink
tutorial: minor
Browse files Browse the repository at this point in the history
  • Loading branch information
HLasse committed Jan 11, 2023
1 parent c339782 commit c5e9eff
Showing 1 changed file with 8 additions and 8 deletions.
16 changes: 8 additions & 8 deletions docs/tutorials/filter_corpus_using_quality.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@
"# download the first 1 000\n",
"dataset = dataset.take(1000)\n",
"\n",
"# extract the text and remove text which are too long\n",
"# extract the text and remove texts which are too long\n",
"texts = [sample [\"text\"] for sample in dataset]\n"
]
},
Expand Down Expand Up @@ -257,7 +257,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Naturally we realize that you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of character which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n",
"Naturally, you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of characters which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n",
"\n",
"If we examine the quality thresholds of the pipeline we can see that the max allowed value for `duplicate_10-gram_chr_fraction` is 0.1:"
]
Expand Down Expand Up @@ -292,7 +292,7 @@
"metadata": {},
"source": [
"### Extracting high quality texts\n",
"Naturally we are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check."
"We are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check."
]
},
{
Expand Down Expand Up @@ -327,7 +327,7 @@
"metadata": {},
"source": [
"### Changing the filters\n",
"Naturally, in some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:"
"In some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags it will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:"
]
},
{
Expand Down Expand Up @@ -391,7 +391,7 @@
"source": [
"## Comparing Domains\n",
"\n",
"These quality metrics are heuristic based an thus, while they are reasonable for one domain, might not be reasonable for another. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems in applied directly to other domains.\n",
"These quality metrics are heuristic based and need to be tuned. While the defaults are reasonable for some domains, they may not be for others. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems when applied directly to other domains.\n",
"\n"
]
},
Expand Down Expand Up @@ -1153,7 +1153,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "textdescriptives",
"display_name": "Python 3.10.9 ('.venv': venv)",
"language": "python",
"name": "python3"
},
Expand All @@ -1167,12 +1167,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
"version": "3.10.9"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88"
"hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8"
}
}
},
Expand Down

0 comments on commit c5e9eff

Please sign in to comment.