diff --git a/docs/tutorials/filter_corpus_using_quality.ipynb b/docs/tutorials/filter_corpus_using_quality.ipynb index 51582d55..79bf2f83 100644 --- a/docs/tutorials/filter_corpus_using_quality.ipynb +++ b/docs/tutorials/filter_corpus_using_quality.ipynb @@ -78,7 +78,7 @@ "# download the first 1 000\n", "dataset = dataset.take(1000)\n", "\n", - "# extract the text and remove text which are too long\n", + "# extract the text and remove texts which are too long\n", "texts = [sample [\"text\"] for sample in dataset]\n" ] }, @@ -257,7 +257,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Naturally we realize that you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of character which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n", + "Naturally, you might not know what all of these mean, but you can easily check it on [the documentation site](https://hlasse.github.io/TextDescriptives/quality.html). Examining these we see that this text has a high proportion of characters which appear in duplicate n-grams `duplicate_10-gram_chr_fraction`. When this fraction is really high it means that the text contains a high proportion of repititions. This is often a sign of low quality text.\n", "\n", "If we examine the quality thresholds of the pipeline we can see that the max allowed value for `duplicate_10-gram_chr_fraction` is 0.1:" ] @@ -292,7 +292,7 @@ "metadata": {}, "source": [ "### Extracting high quality texts\n", - "Naturally we are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." + "We are typically interested in text which are not of low quality. Thus we can extract these by filtering out the text which did not pass the quality check." ] }, { @@ -327,7 +327,7 @@ "metadata": {}, "source": [ "### Changing the filters\n", - "Naturally, in some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:" + "In some cases you might want to apply other filters. For instance the current filter sets a `symbol_to_word_ratio` threshold of 0.1 for hashtags `#`. This means that if a text contains a lot of hashtags it will be filtered out. However if you are working on e.g. tweets this is an unreasonable filter and you might want to adjust that. You can do this by overwriting the quality_thresholds:" ] }, { @@ -391,7 +391,7 @@ "source": [ "## Comparing Domains\n", "\n", - "These quality metrics are heuristic based an thus, while they are reasonable for one domain, might not be reasonable for another. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems in applied directly to other domains.\n", + "These quality metrics are heuristic based and need to be tuned. While the defaults are reasonable for some domains, they may not be for others. We will explore this a bit further in this section. These filters are specifically tuned for the web domain and this can lead to problems when applied directly to other domains.\n", "\n" ] }, @@ -1153,7 +1153,7 @@ ], "metadata": { "kernelspec": { - "display_name": "textdescriptives", + "display_name": "Python 3.10.9 ('.venv': venv)", "language": "python", "name": "python3" }, @@ -1167,12 +1167,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.15" + "version": "3.10.9" }, "orig_nbformat": 4, "vscode": { "interpreter": { - "hash": "31387647799921bb85032eec7bb02e281325ae7f8ffa6f9cd7cdead815b36c88" + "hash": "1fec3abd59d8d4e793464ce299b69082c8b9c618d555ba6df7044c7d7b4183f8" } } },