Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(DPR-Data-Read): Improve data reading process #733

Merged
merged 2 commits into from
Mar 5, 2021

Conversation

voidful
Copy link
Contributor

@voidful voidful commented Mar 4, 2021

Refer to issue #730

Improvement:

  • Support jsonl format for loading large file.
  • Downsampling data to reduce memory usage.
  • Add passage id when not exist.

Support jsonl format, downsampling data, add passage id when not exist.
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good already. Thanks for the changes.

  • lets keep max samples
  • made a comment about a failing test

Although the downsampling now exists twice in our codebase I vote for keeping the downsampling in our TextSimilarityProcessor as wll, since we might use it for other datasets. Or do you have any strong preferences?

farm/data_handler/utils.py Show resolved Hide resolved
farm/data_handler/utils.py Outdated Show resolved Hide resolved
@Timoeller Timoeller self-requested a review March 5, 2021 08:23
Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, thanks for the suggested changes.

@Timoeller Timoeller merged commit fc82058 into deepset-ai:master Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants