Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Experiment] Data cleaning Apr 2024 #517

Closed
wants to merge 81 commits into from
Closed

[Experiment] Data cleaning Apr 2024 #517

wants to merge 81 commits into from

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Apr 8, 2024

Experiment insights

closes #814

OpusCleaner

  • legacy cleaning slightly outperforms all OpusCleaner configs (likely due to num_mismatch filter in OpusCleaner)
  • large FastText model significantly reduces false positives compared to small one
  • FastText can remove a lot of useful data on cleaner datasets, especially short phrases
  • alpha ratio filter can remove useful data on cleaner datasets
  • custom OpusCleaner configs slightly outperform the default one
  • custom OpusCleaner configs + bicleaner significantly outperform the default one + bicleaner (+5M useful sentences due to removing some cleaning rules)

OpusFilter:

  • a similar to OpusCleaner config in OpusFilter with auto-tuning performs a lot worse than the OpusCleaner one (likely due to the difference in filters)
  • OpusFilter with LASER and autotuning performs better than without it but still worse than OpusCleaner (Helsinki folks pointed out that there's a bug in sampling with LASER)
  • Autotuning with only basic OpusCleaner like filters (no bicleaner or laser) performs better than the OpusCleaner like defaults and better than autotuning with disabled feature selection. Mostly because it trained longer and had more data
  • Autotuning with enabled LASER and BicleanerAI filters way too much data and underperforms
  • Autotuned and defaults based OpusCleaner like rules do not outperform OpusCleaner defaults baseline (likely difference in fast text implementation)
  • (TODO) tune laser and bicleaner separately

Bicleaner AI

  • I deployed OpusCleaner on GPU with Bicleaner AI support, it's a little slow but works
  • it's very hard to tune bicleaner thresholds in OpusCleaner
  • Manual analysis of score distributions and example in Jupyter show that even with 0.9 there are plenty of incorrect translations
  • Experiment with 0.5 vs 0.8 vs 0.9 for all datasets. 0.8 slightly outperforms 0.5, 0.9 filters too much but also competitive

LASER

  • also hard to tune in OpusCleaner
  • LASER 2/3 is slower than LASER 1, requires GPU

More questions to explore:

LASER embedding similarity filter:

  • What's the impact of LASER filter?
  • Can LASER be useful together with Bicleaner-AI?
  • Does LASER 2/3 significantly outperform LASER 1?

Bilcleaner-AI:

  • Will customizing the thresholds for large datasets boost performance?

Setup

en-ru pair, all data except CCMatrix/NLLB, training backward model (ru-en)

Example config:


datasets:
  # all except ccmatrix and nllb to test filtering
  train:
    - opus_Books/v1
    - opus_CCAligned/v1
    - opus_ELRC-3075-wikipedia_health/v1
    - opus_ELRC-3855-SWPS_University_Soci/v1
    - opus_ELRC-5067-SciPar/v1
    - opus_ELRC-5183-SciPar_Ukraine/v1
    - opus_ELRC-wikipedia_health/v1
    - opus_ELRC_2922/v1
    - opus_EUbookshop/v2
    - opus_GNOME/v1
    - opus_GlobalVoices/v2018q4
    - opus_KDE4/v2
    - opus_LinguaTools-WikiTitles/v2014
    - opus_NeuLab-TedTalks/v1
    - opus_News-Commentary/v16
    - opus_OpenSubtitles/v2018
    - opus_PHP/v1
    - opus_ParaCrawl/v9
    - opus_QED/v2.0a
    - opus_TED2013/v1.1
    - opus_TED2020/v1
    - opus_Tanzil/v1
    - opus_Tatoeba/v2023-04-12
    - opus_TildeMODEL/v2018
    - opus_UNPC/v1.0
    - opus_Ubuntu/v14.10
    - opus_WikiMatrix/v1
    - opus_WikiTitles/v3
    - opus_Wikipedia/v1.0
    - opus_XLEnt/v1.2
    - opus_ada83/v1
    - opus_bible-uedin/v1
    - opus_infopankki/v1
    - opus_tico-19/v2020-10-28
    - opus_tldr-pages/v2023-08-29
    - opus_wikimedia/v20230407
    - mtdata_Statmt-commoncrawl_wmt13-1-rus-eng
    - mtdata_Statmt-news_commentary_wmt18-13-rus-eng
    - mtdata_Tilde-airbaltic-1-eng-rus
    - mtdata_Tilde-czechtourism-1-eng-rus
    - mtdata_Tilde-worldbank-1-eng-rus
    - mtdata_UN-un_dev-1-eng-rus
    - mtdata_UN-un_test-1-eng-rus
  # datasets to merge for validation while training
  devtest:
    - flores_dev
    - sacrebleu_aug-mix_wmt19
    - sacrebleu_aug-mix_wmt17
    - sacrebleu_aug-mix_wmt15
    - sacrebleu_aug-mix_wmt14
  # datasets for evaluation
  test:
    - flores_devtest
    - sacrebleu_wmt20
    - sacrebleu_wmt20
    - sacrebleu_wmt18
    - sacrebleu_wmt16
    - sacrebleu_wmt13
  # monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
  # to be translated by the teacher model
  mono-src:
    - news-crawl_news.2008
  # to be translated by the backward model to augment teacher corpus with back-translations
  # leave empty to skip augmentation step (high resource languages)
  mono-trg:
    - news-crawl_news.2008
experiment:
  src: en
  trg: ru
  name: opuscleaner_custom_laser_bicleaner
  vocab: NOT-YET-SUPPORTED
  bicleaner:
    default-threshold: 0.5
    dataset-thresholds: {}
  best-model: chrf
  split-length: 2000000
  backward-model: NOT-YET-SUPPORTED
  spm-sample-size: 10000000
  spm-vocab-size: 32000
  teacher-ensemble: 1
  mono-max-sentences-src: 500000000
  mono-max-sentences-trg: 500000000
  use-opuscleaner: 'true'
marian-args:
  decoding-teacher:
    precision: float16
    mini-batch-words: '4000'
  training-student:
    early-stopping: '20'
  decoding-backward:
    beam-size: '8'
    mini-batch-words: '2000'
  training-backward:
    after: 10e
  training-teacher:
    early-stopping: '20'
  training-student-finetuned:
    early-stopping: '20'
taskcluster:
  split-chunks: 10
target-stage: train-backwards


@eu9ene eu9ene changed the title [Experiment] OpusFilter [Experiment] Data cleaning Apr 16, 2024
@eu9ene eu9ene changed the title [Experiment] Data cleaning [Experiment] Data cleaning Apr 2024 Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment with data cleaning
1 participant