-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyvips.Image.new_from_array from np.mmap vs hdf5 #492
Comments
Hi @rgluskin, Could you share some sample code that shows the problem? How are you saving in HDF5? Is this with As far as I know, HDF5 is not a good format for large images, it needs huge amounts of memory. I would use (maybe) uncompressed pyramidal tiled TIFF, or perhaps jpeg-compressed, depending on the image size. Are you sure you need to save intermediates? If you stick to pyvips, you can do most processing with no need for intermediates. What operation is forcing you to save? |
I'm running various processing, including classical vision and deep learning pipelines.
Disk size is not an issue, only having to reallocate the whole image to RAM is (when |
I think loading HDF5 from disc will need a lot of memory, won't it? And it'll use amazing amounts of disc space -- your WSI scans will probably have been through jpeg already, so there's little chance of quality loss, I'd think. You need to make a complete test program I can run that shows the problem. Otherwise I'll waste 30 minutes making a test myself, do something slightly different from your code, and not hit the same issue. Or that's what usually happens to me :( |
Sorry here's the full code to reproduce:
|
and this snippet works
My default course of action would be to refactor all such hdf5 usages to use np.memmap instead. |
Great! Thanks for that. I tried:
And watched it run in I think this is probably unavoidable with HDF5 files opened via numpy. Does it have to be HDF5? TIFF (for example) should work well, eg. with your test file in TIFF format I see:
1.3gb of peak memory use. Q85 jpeg should be no worse than LZW (assuming your WSI scanner outputs something like SVS), and much faster. |
No, direct numpy works fine without hdf5.
It has some other flexibility limitations, such as having to store a single
tensor per file but I can refactor my code around that.
…On Thu, 8 Aug 2024, 19:34 John Cupitt, ***@***.***> wrote:
Great! Thanks for that. I tried:
$ VIPS_PROGRESS=1 ./rgluskin.py
mapping ...
making vips image ...
writing ...
rgluskin.py temp-2: 192000 x 100000 pixels, 32 threads, 192000 x 1 tiles, 640 lines in buffer
rgluskin.py temp-2: 24% complete
...
And watched it run in top. It allocated 78gb of VIRT at the start, then
as the save executes, RES slowly creeps up until, when the save ticker
reaches 100%, it's equal to VIRT.
I think this is probably unavoidable with HDF5 files opened via numpy.
Does it have to be HDF5? TIFF (for example) should work well, eg. with your
test file in TIFF format I see:
$ VIPS_PROGRESS=1 /usr/bin/time -f %M:%e vips copy temp.tiff x.tif[tile,pyramid,compression=jpeg]
vips temp-6: 192000 x 100000 pixels, 32 threads, 128 x 128 tiles, 640 lines in buffer
vips temp-6: done in 151s
memory: high-water mark 1.02 GB
1368380:151.39
1.3gb of peak memory use. Q85 jpeg should be no worse than LZW (assuming
your WSI scanner outputs something like SVS), and much faster.
—
Reply to this email directly, view it on GitHub
<#492 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Y3IYGTMLBUZA2XSWN6ZTZQOMZ7AVCNFSM6AAAAABMGINPM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZWGIZTCNRXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi.
I'm performing manipulations on WSIs and saving intermediate results in HDF5.
The end results are being saved to a new WSI using
pyvips.Image.new_from_array
.However when running on very large slides I get OOM exceptions.
I did a quick experiment and saw that saving the intermediate results in np.mmap doesn't cause this issue.
From pyvips code it seems that the difference stems from
__array_interface__
vs__array__
attributes (mmap
has both buthdf5
only has the latter). However the fields that are actually accessed are fields such asdtype
which are present in hdf5 anyway.What would be your recommendation ?
Should I rewrite my whole code to use
mmap
?Or is there a more elegant approach ?
Thanks in advance.
The text was updated successfully, but these errors were encountered: