-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate OpusCleaner #163
Conversation
# Conflicts: # Makefile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion (docs): Since we're adding new tools to the pipeline, it would be nice to start the practice of documenting them a bit more. It would be nice to get a high-level markdown file documenting what is going on with OpusCleaner and why we use it. It would be nice to explain how we layer the configs, and how to start writing and using your own. For this PR it would be nice to at least start this documentation process, and I think it would be fine to file a follow-up to add more docs from there.
In conclusion: I'm going to mark this as approved on my end as I don't see any reason to block it. If you add substantial changes it would be nice for me to look at it again, so feel free to re-flag me for review.
Again, I use https://conventionalcomments.org/ for my reviews, so things like "Thought" is not something that required action, while "Suggestion" is. I use "Thought" when it's nice to have a conversation or when I'm thinking about the bigger problem.
Makefile
Outdated
@@ -116,6 +116,19 @@ tensorboard: | |||
tensorboard --logdir=$(MODELS) --host=0.0.0.0 & | |||
python utils/tb_log_parser.py --prefix= | |||
|
|||
|
|||
install-opuscleaner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion(docs): Since the Makefile is a developer-facing entry point into the code, it would be nice to provide a short high-level piece of documentation describing what each these actions are. It would also be nice to mention what OpusCleaner is for someone onboarding to the codebase.
OpusCleaner is a data cleaner for training data. https://github.com/hplt-project/OpusCleaner
@@ -0,0 +1,91 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: It's a shame this is JSON as we can't add comments about the configuration and why certain defaults were chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have to adhere to a specific JSON schema? We could have an optional comment/description field since we can't have comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created this filter template using the OpusCleaner UI tool. It should be compatible with the filters. I don't know if they have a schema somewhere to validate it. This would be a good follow-up task to validate the schema before running things.
@@ -0,0 +1,9 @@ | |||
# Starting text in a lot of rows in OpenSubtitles, not everytime in both source and targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: This first comment doesn't make sense to me. It would be nice to re-phrase it using a more complete sentence.
The same goes for "Text in Czech Bible translations that is not in the source"
@@ -0,0 +1,9 @@ | |||
# Starting text in a lot of rows in OpenSubtitles, not everytime in both source and targets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: What is this file exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is required for one of the filters and it is a copy-paste from opus-cleaner. https://github.com/hplt-project/OpusCleaner/blob/main/opuscleaner/filters/remove_frequent_patterns.txt
@@ -0,0 +1,78 @@ | |||
""" | |||
Generates filter config for a dataset based on defaults to use in OpusCleaner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: Thanks for the docs!
@@ -0,0 +1,61 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: This command is testable as well, but I'm not going to request a test here at this time.
@@ -0,0 +1,61 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a reason this is in bash rather than python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very simple script that just calls CLI tools and does some IO with them, so I didn't see any reason to write it in Python. Also, we don't have a pattern setup of how to write such things in Python efficiently. It shouldn't be hard, but it's just way faster to write this in bash. As a follow up you can try refactoring this later in Python to compare the complexity and the benefits of such scripts.
--batch-size=50000 \ | ||
--input=- \ | ||
"${filter_path}" "${SRC}" "${TRG}" \ | ||
2> "${output_prefix}.opus_filter.log" | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: I guess this outputs the opus log to the artifacts directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sometimes the opuscleaner writes to stderr that we can't see in the artifacts. We should address this issue on a higher level probably.
taskcluster/ci/clean-corpus/kind.yml
Outdated
- >- | ||
pip install -r $VCS_PATH/pipeline/clean/requirements/clean.txt && | ||
if [ ${USE_OPUSCLEANER} = "true" ]; then dir="clean/opuscleaner"; else dir="clean"; fi && | ||
$VCS_PATH/pipeline/${dir}/clean-corpus.sh $MOZ_FETCHES_DIR/{dataset_sanitized} /builds/worker/artifacts/{dataset_sanitized} auto {dataset} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: This will fail the new yamllint rules as beeing too long. It's also not very readable as it is written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge the linting and then fix all the issues it finds.
taskcluster/ci/clean-corpus/kind.yml
Outdated
- >- | ||
pip install -r $VCS_PATH/pipeline/clean/requirements/clean.txt && | ||
if [ ${USE_OPUSCLEANER} = "true" ]; then dir="clean/opuscleaner"; else dir="clean"; fi && | ||
$VCS_PATH/pipeline/${dir}/clean-corpus.sh $MOZ_FETCHES_DIR/{dataset_sanitized} /builds/worker/artifacts/{dataset_sanitized} auto {dataset} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: This is another example where the bash script vs python argparse makes it difficult to understand how everything is wired together. With named arguments this would be much easier to understand.
I have refactoring of the docs on my list and filed an issue for this. Also, this PR was meant as just an initial integration and to hand over the work as we discussed. The follow-up work that will be required to complete this is listed here. We can refactor things there and experiment with alternative approaches. |
I added the docs and comments according to the suggestions and fixed an issue with the custom config. Nothing major, so merging. |
The first pass on integrating OpusCleaner that's meant to replace our cleaning scripts.
Included in this PR:
Out of scope:
resources
doesn't support directories or wildcards)fixes #160
Discovered unfixed issues: