Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The configuration of data searching and downloading directories is not linked #142

Open
bhaddow opened this issue Jan 8, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@bhaddow
Copy link
Contributor

bhaddow commented Jan 8, 2024

If you set DATA_PATH to something other than the default, then you can download data successfully, but it does not show up on the data listing page. This is because the download directory is hard-coded, and also it is relative to wherever you run the server from.

Also, DATA_PATH is actually a glob. Setting it to a directory will result in no data being found, but this is hard to debug.

@jelmervdl jelmervdl added the bug Something isn't working label Jan 12, 2024
@jelmervdl
Copy link
Collaborator

It probably shouldn't be a glob, I thought that flexibility would come in useful but it just makes things complicated.

The pattern of the glob is not even free to choose. datasets.py specifically looks for files matching $NAME.$LANG.gz so there have to be at least two dots in the filename for it to not cause issues:

datasets = [
(name, list(files))
for name, files in groupby(
sorted(files, key=lambda entry: str(entry)),
key=lambda entry: str(entry.relative_to(root)).rsplit('.', 2)[0])
]

Lol this was a todo all along:

# TODO: Derive this from DATA_PATH. The `train-parts` is a mtdata compatibility
# thing. I'm now used to also have a data/clean directory there, so keeping it.
DOWNLOAD_PATH = 'data/train-parts'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants