Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate OpusCleaner #163

Merged
merged 49 commits into from
Sep 26, 2023
Merged

Integrate OpusCleaner #163

merged 49 commits into from
Sep 26, 2023

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Aug 18, 2023

The first pass on integrating OpusCleaner that's meant to replace our cleaning scripts.

Included in this PR:

  • cleaning parallel corpus
  • default config that is similar to our current config
  • integration with Snakemake
  • integration with TaskCluster
  • run in CI
  • tested with a full pipeline production config and a default filter on en-ru and Snakemake (all completed except CCMatrix)

Out of scope:

  • cleaning monolingual data
  • reviewing deeper and improving the default config
  • creating the custom filters for a language pair using OpusCleaner UI
  • running experiments to see how the new filtering and custom filters affect translation quality
  • migrating the repository of configs to a different place (we might not want to include all the custom configs in the TC artifacts and it seems resources doesn't support directories or wildcards)

fixes #160

Discovered unfixed issues:

  • opuscleaner-clean hangs on CCMatrix hplt-project/OpusCleaner#116
  • TaskCluster doesn't redirect stderr to stdout like we do in snakemake, so OpusCleaner logs are not visible by default. I had to redirect manually to a file which is ok but makes it less visible when running manually. We should probably think of a more generic solution because some tools that work with stdin/stdout write log to stderr.

@eu9ene eu9ene marked this pull request as ready for review September 22, 2023 22:18
@eu9ene eu9ene requested a review from a team as a code owner September 22, 2023 22:18
@eu9ene eu9ene requested review from hneiva, gregtatum and gabrielBusta and removed request for hneiva September 22, 2023 22:18
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (docs): Since we're adding new tools to the pipeline, it would be nice to start the practice of documenting them a bit more. It would be nice to get a high-level markdown file documenting what is going on with OpusCleaner and why we use it. It would be nice to explain how we layer the configs, and how to start writing and using your own. For this PR it would be nice to at least start this documentation process, and I think it would be fine to file a follow-up to add more docs from there.

In conclusion: I'm going to mark this as approved on my end as I don't see any reason to block it. If you add substantial changes it would be nice for me to look at it again, so feel free to re-flag me for review.

Again, I use https://conventionalcomments.org/ for my reviews, so things like "Thought" is not something that required action, while "Suggestion" is. I use "Thought" when it's nice to have a conversation or when I'm thinking about the bigger problem.

Makefile Outdated
@@ -116,6 +116,19 @@ tensorboard:
tensorboard --logdir=$(MODELS) --host=0.0.0.0 &
python utils/tb_log_parser.py --prefix=


install-opuscleaner:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion(docs): Since the Makefile is a developer-facing entry point into the code, it would be nice to provide a short high-level piece of documentation describing what each these actions are. It would also be nice to mention what OpusCleaner is for someone onboarding to the codebase.

OpusCleaner is a data cleaner for training data. https://github.com/hplt-project/OpusCleaner

@@ -0,0 +1,91 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: It's a shame this is JSON as we can't add comments about the configuration and why certain defaults were chosen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to adhere to a specific JSON schema? We could have an optional comment/description field since we can't have comments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created this filter template using the OpusCleaner UI tool. It should be compatible with the filters. I don't know if they have a schema somewhere to validate it. This would be a good follow-up task to validate the schema before running things.

@@ -0,0 +1,9 @@
# Starting text in a lot of rows in OpenSubtitles, not everytime in both source and targets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This first comment doesn't make sense to me. It would be nice to re-phrase it using a more complete sentence.

The same goes for "Text in Czech Bible translations that is not in the source"

@@ -0,0 +1,9 @@
# Starting text in a lot of rows in OpenSubtitles, not everytime in both source and targets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: What is this file exactly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is required for one of the filters and it is a copy-paste from opus-cleaner. https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/remove_frequent_patterns.txt

@@ -0,0 +1,78 @@
"""
Generates filter config for a dataset based on defaults to use in OpusCleaner
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Thanks for the docs!

@@ -0,0 +1,61 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: This command is testable as well, but I'm not going to request a test here at this time.

@@ -0,0 +1,61 @@
#!/bin/bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a reason this is in bash rather than python?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very simple script that just calls CLI tools and does some IO with them, so I didn't see any reason to write it in Python. Also, we don't have a pattern setup of how to write such things in Python efficiently. It shouldn't be hard, but it's just way faster to write this in bash. As a follow up you can try refactoring this later in Python to compare the complexity and the benefits of such scripts.

--batch-size=50000 \
--input=- \
"${filter_path}" "${SRC}" "${TRG}" \
2> "${output_prefix}.opus_filter.log" |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: I guess this outputs the opus log to the artifacts directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sometimes the opuscleaner writes to stderr that we can't see in the artifacts. We should address this issue on a higher level probably.

- >-
pip install -r $VCS_PATH/pipeline/clean/requirements/clean.txt &&
if [ ${USE_OPUSCLEANER} = "true" ]; then dir="clean/opuscleaner"; else dir="clean"; fi &&
$VCS_PATH/pipeline/${dir}/clean-corpus.sh $MOZ_FETCHES_DIR/{dataset_sanitized} /builds/worker/artifacts/{dataset_sanitized} auto {dataset}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This will fail the new yamllint rules as beeing too long. It's also not very readable as it is written.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge the linting and then fix all the issues it finds.

- >-
pip install -r $VCS_PATH/pipeline/clean/requirements/clean.txt &&
if [ ${USE_OPUSCLEANER} = "true" ]; then dir="clean/opuscleaner"; else dir="clean"; fi &&
$VCS_PATH/pipeline/${dir}/clean-corpus.sh $MOZ_FETCHES_DIR/{dataset_sanitized} /builds/worker/artifacts/{dataset_sanitized} auto {dataset}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: This is another example where the bash script vs python argparse makes it difficult to understand how everything is wired together. With named arguments this would be much easier to understand.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 25, 2023

Suggestion (docs): Since we're adding new tools to the pipeline, it would be nice to start the practice of documenting them a bit more. It would be nice to get a high-level markdown file documenting what is going on with OpusCleaner and why we use it. It would be nice to explain how we layer the configs, and how to start writing and using your own. For this PR it would be nice to at least start this documentation process, and I think it would be fine to file a follow-up to add more docs from there.

In conclusion: I'm going to mark this as approved on my end as I don't see any reason to block it. If you add substantial changes it would be nice for me to look at it again, so feel free to re-flag me for review.

Again, I use https://conventionalcomments.org/ for my reviews, so things like "Thought" is not something that required action, while "Suggestion" is. I use "Thought" when it's nice to have a conversation or when I'm thinking about the bigger problem.

I have refactoring of the docs on my list and filed an issue for this.

Also, this PR was meant as just an initial integration and to hand over the work as we discussed. The follow-up work that will be required to complete this is listed here. We can refactor things there and experiment with alternative approaches.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Sep 26, 2023

I added the docs and comments according to the suggestions and fixed an issue with the custom config. Nothing major, so merging.

@eu9ene eu9ene merged commit e9102a3 into mozilla:main Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate OpusCleaner
3 participants