Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display corpus size in W&B #529

Closed
Tracked by #164
eu9ene opened this issue Apr 16, 2024 · 3 comments · Fixed by #720
Closed
Tracked by #164

Display corpus size in W&B #529

eu9ene opened this issue Apr 16, 2024 · 3 comments · Fixed by #720
Labels
weights and biases Intergration with Weights and Biases

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Apr 16, 2024

We should display things we look at often in W&B. Final merged corpus size after deduplication is something I look at periodically to understand how aggressive the cleaning is overall. We can also display corpus size after each cleaning stage as we discussed with @gregtatum which should probably be a part of the analysis job.

@eu9ene eu9ene added the weights and biases Intergration with Weights and Biases label Apr 16, 2024
@La0
Copy link
Collaborator

La0 commented Jul 1, 2024

As discussed today, we would expose either

  • the list of TSV files to parse_tc_logs so it can count their number of lines & publish that
  • directly count the nb ob lines in train.sh and provide it to parse_tc_logs so it publishes that number

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 1, 2024

We could also provide the final OpusTrainer config to the parser. It includes paths to the training datasets.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 1, 2024

I think the idea of this ticket was also to display the size of the corpus after different cleaning steps but we can start with uploading only the size of the final corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weights and biases Intergration with Weights and Biases
Projects
None yet
2 participants