-
-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization: reduce memory footprint without compromising speed and quality #1350
Comments
I'm afraid I don't recall much detail of those conversations myself 😄 I did your option 1 for mine, using an octree for the histogram, palette generation, and palette lookup. Unfortunately, due to C#'s lack of first class support for discriminated unions and the JIT's insistence on inserting null checks when dereferencing a member from a struct ptr, the code ended up being quite ugly for perf reasons. But it gives really nice visual results with only about 280KiB of rented pool memory, and in my limited testing was on average an order of magnitude faster than the current ImageSharp octree implementation. You can find my implementation here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/ColorQuantizer.cs Mine differs quite a bit from any other octree implementation I've seen, which makes it a little difficult to follow, but there's method to any madness you see in there, and I'm happy to answer any questions. Unfortunately I haven't the drawing skills to do good diagrams of how it works, so the code will have to speak for itself for now. The major architectural differences:
The The same ideas could be applied to support RGBA in a 'hexadecitree' with 4 bits per node index instead of 3. The alpha channel could be weighted less or quantized in advance. I was only going for GIF support initially, so I haven't yet done any experiments to figure out the best number of nodes, etc. |
@saucecontrol thanks for the very detailed reply! What is the pixel format of Our big problem is that currently we need the following cache to produce good enough quality when images are transparent: ImageSharp/src/ImageSharp/Processing/Processors/Quantization/OctreeQuantizer{TPixel}.cs Lines 114 to 120 in b06cb32
@JimBobSquarePants in case of pixels having no alpha channel, we likely don't need this path. For users caring about perf we can suggest then to work with |
@antonfirsov I found that the difference in quality was noticeable with or without transparency when I experimented with removing the color map entirely. Looking back at |
My implementation works only on BGR24 input. In the final palette derivation step, when merging nodes, I convert the colors to 32bpc linear for better blending/averaging results. I also have only a single dithering algorithm (a modified Floyd-Steinberg) integrated directly into the palette matching for perf reasons. I didn't really investigate which part of the speed differences between mine and the current ImageSharp implementation come from the histogram vs lookup vs dithering parts of the code, but if your main concern is reducing memory, I do believe you can eliminate the cache without negative perf impact as long as your octree traversal is fast. And of course a faster octree speeds up your histogram step as well so it may be a win-win. The key to keeping memory usage under control while keeping accuracy is to allow the lookup octree to reach full 8 bit depth for the nodes that contain your palette entries but then restrict the depth for parts of the gamut not covered by the palette. My implementation caps that at 4 bits (a max of 4096 nodes) so I can fit everything in my 256KiB slab. You give up some accuracy on those, but since they represent colors not in the original image anyway (they're used for colors created by dithering error propagation), it's not a big deal. Again, my use case was animated GIF, so it was very important to me to get it completely optimized for processing hundreds of RGB frames at a go. I don't know if you'd want to generalize your implementation more to allow the same code to handle RGB and RGBA, but if it were me, I'd have separate implementations for those. The RGBA path would naturally require more memory, but my guess would be that around 2MiB would do it if you do some psychovisual trickery with the alpha channel ;) |
ImageSharp/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap{TPixel}.cs Lines 20 to 21 in b06cb32
@JimBobSquarePants Note: converting pixel spans to |
@antonfirsov I only mention the My biggest concern here is that we are focusing too much on Octree. All of the quantizers use the pixel map to improve output quality (due to naïve color distance formula) and two of the quantizers depend on it entirely for palette application. ImageSharp/src/ImageSharp/Processing/Processors/Quantization/PaletteQuantizer{TPixel}.cs Lines 63 to 64 in b06cb32
So... Perhaps we can do something to allow fast linear color space conversion using LUTs since we're working with pixel formats with byte accuracy in the quantizers @saucecontrol I know you do this but I can never read your code (I'm convinced you've been sent here from the future) which could improve the accuracy of the built in palette matching mechanisms. Maybe if the quality is improved we can drop the pixel map entirely. Then we can figure out what to do for the other quantizers that depend on it. The pixel map was always a crutch. That's why I buried it far away from the API surface. |
@JimBobSquarePants I think I might be confused about how your implementation works now. I thought the purpose of the pixel map was to improve speed by eliminating the need for a full search through the palette to find the nearest palette color for each input pixel color. You're saying it has something to do with improving quality? How does that work? You can actually use an octree for palette lookup regardless of how the palette was originally obtained, so the idea isn't necessarily that you go all octree all the time. If you needed to map an image to e.g. a standard web palette, you would build out an octree with the values from the palette to start. As you map the input image to the palette, any time you navigate the octree and the node for your pixel color doesn't exist, you do the linear search through the palette and then cache the winner by creating that node so that next time you see that color or one like it, it's already there. Essentially, the octree works as a sparse cache in that mode. The real win is in the fact that it can group similar colors instead of having a key per distinct input value, because you pick the max tree depth for those cache nodes. It's actually very similar to @antonfirsov's |
@saucecontrol Using the pixel map actually results in much slower quantization than not using one. The quality improvement is achieved by finding the closest color in the resolved palette to the given pixel via Euclidian distance. I originally started using that approach because a bug in my dithering caused exceptions when trying the match the dithered result against the palette. I stick with it now that issue is gone because I still see quite dramatic differences in quality when compared against both Wu and Octree quantizers. The dictionary caches results since the distance calculation is horribly slow. ImageSharp/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap{TPixel}.cs Lines 72 to 103 in b06cb32
I'm intrigued by the variable bit depth approach. From our previous conversations the output of your quantizer color wise was amazing (though you were having dithering issues) |
This is the problem. We need a significantly more memory-efficient lookup mechanism. Octree comes in the picture as probably the best candidate for such a lookup in case of 3-component images, assuming that @saucecontrol 's implementation is producing decent quality, and it's possible to integrate it with our generic dithering logic. |
Ah, yeah, we're talking about the same thing. I think @antonfirsov and I were phrasing it poorly by talking about eliminating the cache. The map is a necessity as long as you're doing dithering, because even if you had a perfect histogram to start (which you can't always), error diffusion might still introduce colors that fall outside the input image's gamut. And you'll want that map to cache any new palette entries it has resolved. The real issue is that a dictionary isn't a good map/cache because its keys are opaque integers and its allocations are GC-heavy. Using an octree as the map/cache can potentially solve both of those. I don't know if there's any part of my code you can lift directly, just because it's so tuned for speed specific to my use case. Reviewing it now, there's some ugly stuff in there. Conceptually, though, I think you could model after it because I haven't seen anything that balances quality, speed, and memory as well it does. I did end up resolving my dithering issues we chatted about on gitter, by the way. My final implementation just ended up being Floyd-Steinberg with the error weighted to 7/8. I'd say the quality is generally equal or better than what ImageSharp does today, although there's not a ton of improvement to be had there -- your implementation is already very good quality-wise. |
Yep, that's why I was going to just insert an LRU cache in there for now but if there's a better cache then I'm happy for anyone to have a go at replacing it. If there was a cheaper way to match the dithered result to the palette and retain (or improve) output quality then I'm all ears. It'll need someone else to experiment/implement though. |
I don't think a dictonary-based LRU cache is a good candidate, it's still very heavy, and at the same time likely much slower than a dictionary. In other words: it seems like a low hanging fruit, but I can bet for 95% that it just won't work well enough, so I would discourage investing into it. If we want a "best effort" map as a quick workaround, let's use |
But in general, I would suggest investing into @saucecontrol 's octree instead of trying half-solutions.
What bits are specific to your use case? Is there anything else than the hard-coded dithering and pixel format?
Without doing too much reading, I would assume it works as following in your code: GetPaletteIndex(color):
ditheredColor = dither(color)
return octree.FindIndex(ditheredColor) # can this can add a new palette entry ?? So the difference between MagicScaler and ImageSharp implementation is that the magic octree integrates the logic for If my assumptions are correct, adapting MagicScalers octree should be easy. All we need to change is:
@saucecontrol @JimBobSquarePants anything I'm missing? |
@antonfirsov What about |
I would focus on the gif encoding use-case (= current OctreeQuantizer) first, as it seems to be the most painful. The remaining use-cases can survive with As such: |
Bear in mind that quantizer can be set to any |
Output is always BGR with gif, so we can use it as color map with other quantizers in case of gif, if we want. But I believe most users go with defaults. Having performant defaults for the most common use-cases is what we should focus on first. This will give us time, and it will be also easier to improve the general cases based on the leanings we acquire by solving the common/default ones. |
@antonfirsov I think there are a few other things that would have to be changed to fit my implementation to your use case, but it's all manageable. There's already a non-dithering mapping (remap method) in place, but it has some behaviors that probably won't fit your needs. Off the top of my head:
That's all I can think of for now, but if you do go down that path and find anything that doesn't make sense, I can probably shed some light. |
I have similar code in Paint.NET for image quantization when saving indexed images (8-bit, 4-bit, 2-bit, 1-bit). After palette generation, a class called I researched how to better implement |
@rickbrew I made experiments with two KD tree implementations:
Unfortunately, even with 2, this turned out to be almost as slow as the linear search with the Euclidean distances. |
Thanks @antonfirsov. I suspected that the cost of the data structure and lookup wouldn't result in a 16x increase in perf (that is, O(256) vs O(log2 256) = O(16). Good to know! Maybe I'll play around with your implementation at some point to see if I can squeeze some better performance out of it (I only glanced through it -- it may already be maxed out!). If I do I'll post back here with my improvements. |
This could be of interest. Appears to be faster than a KD Tree https://github.com/spotify/annoy https://stackoverflow.com/questions/37105782/performance-of-annoy-method-vs-kd-tree |
@JimBobSquarePants following our chat a couple weeks back on Discord, I did a bit more experimentation with our quantizers, and I discovered a couple of interesting things. 1) MagicScaler was doing significantly better at generating a palette for a given image but 2) ImageSharp was doing significantly better at quantizing an image to a given palette. The net result was that MagicScaler tended to edge out ImageSharp in quality because dithering hides a great many faults, and a palette with a better gamut match will do better in either case. But in running both without dithering, and especially in trying ImageSharp's quantization with a MagicScaler-created palette, I could see the shortcomings in my mapping quality. With that, I set about overhauling my implementation, and I ended up with something that improves the accuracy of the mappings while being just a bit faster than it was before (and in half the memory). I've also scrapped the code that preserved the histogram tree to use as the basis of the mapping tree, meaning it now builds out the mapping tree straight off the palette, which would make that code easier to adapt to your needs. The code is hopefully much easier to follow now as well, because I've split the octree nodes used for the histogram phase and the nodes for the mapping phase into separate structs. The new version is here: https://github.com/saucecontrol/PhotoSauce/blob/master/src/MagicScaler/Magic/OctreeQuantizer.cs |
@JimBobSquarePants @saucecontrol I've also spent a lot of time in the Paint.NET quantization code making fixes and improvements. I've talked with Jim about this a little on Twitter, and have made my code available over at https://github.com/rickbrew/PaintDotNet.Quantization The most important part of my findings are in sections 1 and 7 of the README over at that repo. Section 1 is just a simple fix for the Octree binning (ye ol' MSDN article bugs). It will result in more memory usage and CPU time, but should improve the quality/correctness of the palette. Section 7 doesn't yet have a writeup as I have some other demands on my time right now. The tl;dr is that I've found a better way to map colors to palette entries that is faster and uses less memory. The code is at https://github.com/rickbrew/PaintDotNet.Quantization/blob/main/PaintDotNet/Imaging/Quantization/ProximityPaletteMap.cs and a discussion is at https://twitter.com/rickbrewPDN/status/1379238853832155136 . Pictures are really needed to properly explain the algorithm and hopefully I'll get to all that once I'm done with taxes and some other things. It should be easy to integrate into ImageSharp and/or PhotoSauce. Benchmark results, in the PDN codebase at least, are very strongly in favor of the new "proximity map" code. Memory use is about 1MB max, as compared to basically unbounded for the linear search + Dictionary cache approach. The code isn't even that complicated, and further experimentation -- in search of an even better result -- should be straightforward. |
Wow, great to see so much work going on in this area! I'll give your project a look for sure. When I was chatting with James on Discord, he shared the tweet where you pointed out the shift math error in that MSDN code. That's certainly contributing to the ImageSharp palette quality today. I've got my quantizer down to a max of 87KB of memory now, and I built an AVX2 implementation of nearest distance match to a palette color to make brute-force search through the whole palette fast when filling in missing nodes in my octree. I subsample large images to a max of 4 megapixels when building the histogram, so those aren't directly comparable, but here's what my numbers look like on the worst-case image (all 24 bit colors), mapped to 5 bits of accuracy on my 4 year old laptop 😄 : |
BTW, @rickbrew, since you're in the high-perf C# clan, I'd suggest you join us over in the #lowlevel channel on the C# Discord. Tons of good chat and info on there. |
Pop quiz. @saucecontrol What does your Octree output for this image look like? I notice that @rickbrew encoding this image with Paint.NET leads to blending with a white background. My local Octree Quantizer currently produces which simply drops the alpha component. While Wu can deliver far better since it can handle multiple transparent colors (which png can support). |
Thanks @saucecontrol I could, fairly easily, build background blending into the current architecture but I think I'll leave that for the future. I'm going to be pushing a PR very soon that improves our memory usage though. |
Cool, looking forward to seeing what you've come up with! For reference, here's what my quantizer does with 1/255 alpha threshold And matted against white. It looks lighter than the above browser-blended and PDN output because I do linear light blending with background when matting. Not so great in this case but avoids the red+green=brown problem. Just need to get that config API built... :) (Edit: updated first image to match ImageSharp's threshold value) |
All our current quantizers rely on a very heavy cache in order to produce acceptable quality:
ImageSharp/src/ImageSharp/Processing/Processors/Quantization/EuclideanPixelMap{TPixel}.cs
Line 21 in b06cb32
With large images this could consume up to several megabytes of memory every time a larger GIF (or palettized PNG) gets saved. We need to reduce memory usage without compromising speed and quality.
We made a couple of experiments and exchanged many ideas a couple of months ago, but unfortunately all those discussions got lost in the noise of the gitter chatroom. I suggest to recollect and rediscuss those ideas.
Improvements we may consider (according to what I remember from the gitter chat):
@saucecontrol since I remember you had really valuable ideas here, I would really appreciate your feedback on all 4 options, and apologies if I'm making you to repeat yourself because of my terrible memory 😄 If I remember correctly, you also had some promising results in your repos.
Bravely assigning milestone 1.1 for now.
The text was updated successfully, but these errors were encountered: