-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we incorporate memory-mapping io into python runtime? #277
Comments
If that is stable and straightforward enough, and carries no additional dependencies — we can probably incorporate that as |
I agree that memory-mapped IO should be included: no dependencies, allows parsing gigasized files. Note that this upgrades capabilities, not performance/speed. |
I'll toss in a vote in favor of this as well. (I am unable to use the visualization tool on my data, owing to its size....) |
@webbnh |
I don't think that has mmap has anything to do with the ability to handle gigabyte-sized files. KS by itself handles them perfectly (for example, see most filesystem implementations): it boils down to format spec being lazy, not usage or lack of usage of mmap. Support of all that stuff by visualization tools is yet another topic. It, again, is somehow related to mmap support, but for hex viewer components of these visualization tools, not for KS per se. |
Hmm it seems that misunderstood the topic. Now that I think of it, file streams do not need to load entire files to read them. I appologise for speaking out without due understanding. |
As for on topic, probably it's worth to support both mmap and regular read/write operations mode, if there are no major stoppers and let the user decide. Java runtime already incorporates it as 2 clases:
Note that there's also one not-so-obvious difference between mmap- vs read-based implementations. In mmap, you need to know mmapped region's size beforehand => thus you only can deal with fixed size files. Files that grow in process, or files that report their size wrong (like system virtual files in /proc or /sys Linux filesystems) couldn't be read that way, and that can be really useful sometimes (for example, running a parser on some EEPROM dump). |
That is another reason to not implement mmap. |
We're pretty comfortable with our current use of KS, so moving to a replacement is not very appealing. :-) We've coped by using KS to read our file one record at a time (with some lazy evaluation), instead of giving KS the definition of the entire file. (And, in fact, we've ended up benefiting from this choice.) As for the difference between mmap and read, it turns out that mmap is usually not a win if you lack sufficient virtual memory to map the entire file. (Knowing the file size is helpful, but it's not an absolute requirement, as the mapping can be changed as things run.) In our case, in certain deployments, we are very memory-limited, which means we have to read only the records that we are working with at the time and then reuse the underlying buffers. (It's true that you can do the equivalent using mmap, but it requires a very active management of the mapping, and at that point you might as well just read the file.) |
Perhaps that would be a good moment to go back to the topic of lazy parsing in Kaitai? Construct does have lazy Struct Sequence Range classes. |
One of the challenges we have which was helped by reading a record at a time is that occasionally there are "corruptions" in the record stream. Since we are working in C++, KS seems to be limited in its ability to deal with the situation (e.g., there is no "debug mode" available, so we cannot look at the partial results if the parse fails), and so we have to be a little pro-active: we read the record first (with the payload described by an instance so that it isn't evaluated immediately); we then validate the record; and finally evaluate the payload by referencing an instance with a big So, having KS read the entire file for us, even with lazy evaluation wouldn't really address the problem of dealing with corruptions. |
@webbnh Technically it's not very hard to implement that part of "debug mode" as separate mode (like "exception-safe mode"?), and do it for all languages. It could be useful for many purposes, not only visualization. |
The example how this can be done is by the link
The text was updated successfully, but these errors were encountered: