Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not do rereading step in fread #2558

Merged
merged 40 commits into from
Aug 6, 2020
Merged

Do not do rereading step in fread #2558

merged 40 commits into from
Aug 6, 2020

Conversation

st-pasha
Copy link
Contributor

@st-pasha st-pasha commented Aug 5, 2020

Since the very early days, the design of fread was such that if any "type-bumps" happen in the middle of a file, then the corresponding column(s) will be marked as "require re-reading", and then at the end we would re-parse the entire file using the new column type. This approach has obvious drawbacks:

  • the amount of time necessary to read the file almost doubles (and type bumps happen more often than you'd think);
  • the logic behind type-bumping is very error-prone;
  • it is impossible to read a stream-like input, where the data cannot be arbitrarily rewound;

The new approach for handling type-bumping is the following:

  • When a type-bump occurs while reading a chunk, we enter the ordered section and temporarily suspend execution of other threads;
  • While all other threads are paused, we:
    • "archive" the column which was type-bumped;
    • update the global types array with the new column parse types;
  • After that, the parallel execution resumes from the start of the type-bumped chunk;
  • (the process above may occur multiple times with different columns or different chunks);
  • In the end, when the final frame is constructed, the columns that were comprised of multiple chunks are rbound together with automatic type promotion.

Closes #1843
Closes #1446

st-pasha added 30 commits July 28, 2020 11:16
@st-pasha st-pasha added improve Improvement of an existing functionality fread Issues related to parsing any input files via fread function labels Aug 5, 2020
@st-pasha st-pasha added this to the Release 0.11.0 milestone Aug 5, 2020
@st-pasha st-pasha self-assigned this Aug 5, 2020
ci/ext.py Outdated Show resolved Hide resolved
@@ -184,6 +208,7 @@ class ParserIterator {
value_type operator*() const;
};

// unused?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove ParserIterable ParserLibrary::successor_types(dt::read::PT pt) const; method (that is not used anywhere anyways) ParserIterable and ParserIterator classes could be safely removed too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to do more cleanup around the ParserLibrary in general, but it's better to defer to a future PR.

src/core/parallel/parallel_for_ordered.cc Outdated Show resolved Hide resolved
src/core/read/output_column.cc Outdated Show resolved Hide resolved
src/core/tests/test_parallel_for_ordered.cc Show resolved Hide resolved
@st-pasha st-pasha merged commit ca2f1e0 into master Aug 6, 2020
@st-pasha st-pasha deleted the ps-typebumps branch August 6, 2020 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fread Issues related to parsing any input files via fread function improve Improvement of an existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate "Rereading" pass in fread? Improve re-read behavior in fread
2 participants