-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate "Rereading" pass in fread? #1843
Comments
Seems easiest solution, for allowing streaming, is:
All these options sound easy and could be chosen by user. |
Since the very early days, the design of fread was such that if any "type-bumps" happen in the middle of a file, then the corresponding column(s) will be marked as "require re-reading", and then at the end we would re-parse the entire file using the new column type. This approach has obvious drawbacks: - the amount of time necessary to read the file almost doubles (and type bumps happen more often than you'd think); - the logic behind type-bumping is very error-prone; - it is impossible to read a stream-like input, where the data cannot be arbitrarily rewound; The new approach for handling type-bumping is the following: - When a type-bump occurs while reading a chunk, we enter the ordered section and temporarily suspend execution of other threads; - While all other threads are paused, we: - "archive" the column which was type-bumped; - update the global types array with the new column parse types; - After that, the parallel execution resumes from the start of the type-bumped chunk; - (the process above may occur multiple times with different columns or different chunks); - In the end, when the final frame is constructed, the columns that were comprised of multiple chunks are rbound together with automatic type promotion. Closes #1843 Closes #1446
Current design of fread's algorithm follows the following rough steps:
This algorithm was designed based on the assumption that type-detection in stage 1 would be reliable, that stage (3) would occur only rarely, and consequently its impact would be minimal. In practice, however, we see that the "rereading" pass occurs way too often, essentially doubling the dataset parse time. Moreover, the larger the input file (especially when it's > 100GB), the more likely that it would have to be re-read, which means we are very likely to double the already quite long read time.
In addition to the obvious time cost of the re-reading pass, it affects us in more subtle ways too. In particular, it prevents us from designing a single-pass algorithm, which means
fread
cannot consume data from a stream. Instead, we have to dump the content of the stream into a temporary file, fread that file, and then delete it. These are all overheads that could have been avoided if we knew that the data needs to be read only once.So, can the "re-reading" pass be eliminated? It can, but there are certain costs associated with such redesign.
Remember that the need to "re-read" a column appears whenever we encounter a value that cannot be understood within the current column's type. For example, if the column is integer
2, 7, 11, 0, ...
, and then out of a sudden we see a value3.4
in it. Then we say "oh well, looks like we'll need to reread this entire column" (and in practice it means rereading the entire file). But we could have approached this differently. We could have said "let's convert all the values we've read so far into doubles, and continue reading this column as type double". In practice it's slightly more complicated because of multiple threads having to maintain a consistent view of what the column's type is; but this is doable.Cases like this are "easy", in the sense that they involve conversions that are unambiguous. There are, however, certain cases when the conversions are ambiguous, in particular when converting values into strings. For example, consider column containing values
0, +0, -0, 000
. These would be detected as integers0,0,0,0
, as they should. However, if we later find that the column has to be converted to string, then the problem arises: the integer values that we have will be converted into four strings "0": the information that those zeros originally all looked differently as strings is lost.It is because of these uncertainties, the "rereading" pass was deemed absolute necessity. Without it, the important invariant could be violated: reading data with auto-detected types could produce different results than reading data with explicitly specified types (but same as were auto-detected).
So the question really is -- is this invariant truly important? Are we willing to sacrifice speed (by a factor of 2 or even more), and sometimes disk usage (by a factor of 2), in order to satisfy it? Or should this choice be deferred to the user?
The text was updated successfully, but these errors were encountered: