Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update IWSLT dataset link #86

Merged
merged 2 commits into from
Feb 21, 2022
Merged

Conversation

maj0e
Copy link
Contributor

@maj0e maj0e commented Feb 21, 2022

Fixes: #85

As mentioned in #72 and #85 the IWSLT2016 dataset moved to google drive and all language pairs are now in a single archive.

I've rewritten the post_fetch_method to extract the nested archives for the requested language pair.

I also found a bug in an error message in the tunefile function. Here the "fr-en" language was accidently hardcoded, which lead to the download of "fr-en" language pair, when the error was triggered.

I tested the changes with the following script:

#test_iwslt.jl
using Transformers
using Transformers.Datasets # utilities for dataset 
using Transformers.Datasets: IWSLT # IWSLT datasets

# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :de 
dst_lang = :en 

iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset

# get vocabulary from training data
vocab = get_vocab(iwslt2016)

# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does

batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences

...and it works for the language pairs I've tested ("en-de", "de-en", "fr-en" and "en-fr").

Regards,
maj0e

maj0e and others added 2 commits February 21, 2022 09:03
All IWSLT datasets are now on Google Drive. Also the language pairs are
not provided as seperate archives anymore, but all in a single archive.
The IWSLT datadep was updated with the new Google drive link and the
postfetchmethod adapted to extract the nested language pair archives.
When the requested file was not found in tunefile, a
error was thrown, which should have included the list of
available files in the datadeps. Here the wrong datadeps
string was used, which triggered the download of the fr-en
language pair.
@chengchingwen
Copy link
Owner

Looks great! Thanks!

@chengchingwen chengchingwen merged commit ce1dff3 into chengchingwen:master Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IWSLT2016 link outdated
2 participants