Should we incorporate memory-mapping io into python runtime? #277

KOLANICH · 2017-10-06T11:20:32Z

The example how this can be done is by the link

GreyCat · 2017-10-06T17:50:19Z

If that is stable and straightforward enough, and carries no additional dependencies — we can probably incorporate that as from_file_mmap or some helper like that, if you want?

arekbulski · 2018-01-18T07:30:09Z

I agree that memory-mapped IO should be included: no dependencies, allows parsing gigasized files. Note that this upgrades capabilities, not performance/speed.

webbnh · 2018-01-18T08:00:39Z

I'll toss in a vote in favor of this as well. (I am unable to use the visualization tool on my data, owing to its size....)

arekbulski · 2018-01-18T08:05:00Z

@webbnh
Not to advertise competitive projects on this forum, but I am Construct's developer (you might not know that) and @GreyCat will probably forgive me that...
Construct can handle gigasized files, and it has Lazy* classes for parsing. Just in case you were not aware. Kaitai may implement mmap eventually but I would not hold my breath until they do. This topic is not new.

GreyCat · 2018-01-18T09:27:25Z

I don't think that has mmap has anything to do with the ability to handle gigabyte-sized files. KS by itself handles them perfectly (for example, see most filesystem implementations): it boils down to format spec being lazy, not usage or lack of usage of mmap.

Support of all that stuff by visualization tools is yet another topic. It, again, is somehow related to mmap support, but for hex viewer components of these visualization tools, not for KS per se.

arekbulski · 2018-01-18T09:30:50Z

Hmm it seems that misunderstood the topic. Now that I think of it, file streams do not need to load entire files to read them. I appologise for speaking out without due understanding.

GreyCat · 2018-01-18T09:36:21Z

As for on topic, probably it's worth to support both mmap and regular read/write operations mode, if there are no major stoppers and let the user decide. Java runtime already incorporates it as 2 clases:

RandomAccessFileKaitaiStream is a regular file-based stream
ByteBufferKaitaiStream is a mmap-based stream

Note that there's also one not-so-obvious difference between mmap- vs read-based implementations. In mmap, you need to know mmapped region's size beforehand => thus you only can deal with fixed size files. Files that grow in process, or files that report their size wrong (like system virtual files in /proc or /sys Linux filesystems) couldn't be read that way, and that can be really useful sometimes (for example, running a parser on some EEPROM dump).

arekbulski · 2018-01-18T09:49:48Z

That is another reason to not implement mmap.

webbnh · 2018-01-20T16:27:21Z

We're pretty comfortable with our current use of KS, so moving to a replacement is not very appealing. :-)

We've coped by using KS to read our file one record at a time (with some lazy evaluation), instead of giving KS the definition of the entire file. (And, in fact, we've ended up benefiting from this choice.)

As for the difference between mmap and read, it turns out that mmap is usually not a win if you lack sufficient virtual memory to map the entire file. (Knowing the file size is helpful, but it's not an absolute requirement, as the mapping can be changed as things run.) In our case, in certain deployments, we are very memory-limited, which means we have to read only the records that we are working with at the time and then reuse the underlying buffers. (It's true that you can do the equivalent using mmap, but it requires a very active management of the mapping, and at that point you might as well just read the file.)

arekbulski · 2018-01-20T16:57:13Z

Perhaps that would be a good moment to go back to the topic of lazy parsing in Kaitai? Construct does have lazy Struct Sequence Range classes.

webbnh · 2018-01-22T21:44:21Z

One of the challenges we have which was helped by reading a record at a time is that occasionally there are "corruptions" in the record stream. Since we are working in C++, KS seems to be limited in its ability to deal with the situation (e.g., there is no "debug mode" available, so we cannot look at the partial results if the parse fails), and so we have to be a little pro-active: we read the record first (with the payload described by an instance so that it isn't evaluated immediately); we then validate the record; and finally evaluate the payload by referencing an instance with a big switch-on in it. If the record doesn't validate, we set the stream back to one byte ahead of where it was previously and try reading the record again.

So, having KS read the entire file for us, even with lazy evaluation wouldn't really address the problem of dealing with corruptions.

GreyCat · 2018-01-23T01:48:53Z

@webbnh Technically it's not very hard to implement that part of "debug mode" as separate mode (like "exception-safe mode"?), and do it for all languages. It could be useful for many purposes, not only visualization.

GreyCat added the enhancement label Oct 6, 2017

KOLANICH mentioned this issue Jan 18, 2018

Cython runtime and compile target #311

Closed

KOLANICH mentioned this issue Mar 2, 2020

Kaitai and scientific data #711

Open

KOLANICH mentioned this issue Apr 25, 2022

reducing I/O by using buffers and memoryviews kaitai-io/kaitai_struct_python_runtime#67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we incorporate memory-mapping io into python runtime? #277

Should we incorporate memory-mapping io into python runtime? #277

KOLANICH commented Oct 6, 2017

GreyCat commented Oct 6, 2017

arekbulski commented Jan 18, 2018

webbnh commented Jan 18, 2018

arekbulski commented Jan 18, 2018 •

edited

Loading

GreyCat commented Jan 18, 2018 •

edited

Loading

arekbulski commented Jan 18, 2018

GreyCat commented Jan 18, 2018

arekbulski commented Jan 18, 2018

webbnh commented Jan 20, 2018

arekbulski commented Jan 20, 2018 •

edited

Loading

webbnh commented Jan 22, 2018

GreyCat commented Jan 23, 2018 •

edited

Loading

Should we incorporate memory-mapping io into python runtime? #277

Should we incorporate memory-mapping io into python runtime? #277

Comments

KOLANICH commented Oct 6, 2017

GreyCat commented Oct 6, 2017

arekbulski commented Jan 18, 2018

webbnh commented Jan 18, 2018

arekbulski commented Jan 18, 2018 • edited Loading

GreyCat commented Jan 18, 2018 • edited Loading

arekbulski commented Jan 18, 2018

GreyCat commented Jan 18, 2018

arekbulski commented Jan 18, 2018

webbnh commented Jan 20, 2018

arekbulski commented Jan 20, 2018 • edited Loading

webbnh commented Jan 22, 2018

GreyCat commented Jan 23, 2018 • edited Loading

arekbulski commented Jan 18, 2018 •

edited

Loading

GreyCat commented Jan 18, 2018 •

edited

Loading

arekbulski commented Jan 20, 2018 •

edited

Loading

GreyCat commented Jan 23, 2018 •

edited

Loading