Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

secondarycachecompress #1379

Closed
antry opened this issue Apr 1, 2013 · 31 comments
Closed

secondarycachecompress #1379

antry opened this issue Apr 1, 2013 · 31 comments
Labels
Type: Feature Feature request or new feature
Milestone

Comments

@antry
Copy link

antry commented Apr 1, 2013

is there any plan to merge secondarycachecompress into zfsonlinux?

@ryao
Copy link
Contributor

ryao commented Apr 1, 2013

It will likely be merged soon after someone finds time to port the patches.

@jimmyH
Copy link
Contributor

jimmyH commented Apr 24, 2013

I had a try at porting the l2arc compression patches a while ago. I have been running with this for a couple of months and haven't had any issues - so I guess I didn't make too many mistakes porting it....
see jimmyH@c8d5f5a

@behlendorf
Copy link
Contributor

@jimmyH Nice, thanks for taking a crack at this. However, since these changes haven't even been merged to Illumos in their final form I'd prefer not to rush ahead. Let's let it soak for a while. We could create a pull request with the patches applied for those who want to give this functionality some additional testing.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Aug 1, 2013
3137 L2ARC compression
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>

References:
  illumos/illumos-gate@aad0257
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#1379
@behlendorf
Copy link
Contributor

@jimmyH I've refreshed the l2arc compression patch against the latest master using what was committed to Illumos. They appear to have dropped the secondarycachecompress property in favor of inferring l2arc compression from the existing compression property. I'm not sure I care for that but I want to avoid diverging from the upstream code so we'll go with it.

http://www.listbox.com/member/archive/182191/2013/05/sort/from/page/1/entry/4:118/20130507165342:31693C3A-B758-11E2-B297-C9B0912993E8/

I did add a l2arc_nocompress module option which is disabled by default so the legacy behavior can be maintained if needed.

@skiselkov Any chance you can shed some light on why the secondarycachecompress property was dropped? Also you'll want to update the Illumos documentation below which describes how to enable l2arc compression by setting the secondarycachecompress property.

http://wiki.illumos.org/display/illumos/L2ARC+Compression

@skiselkov
Copy link
Contributor

@behlendorf Hi Brian. The reason we dropped the property is because ultimately we couldn't come up with a reason why you would want to avoid compressing to L2ARC while compressing to the main pool. There's almost no performance penalty in decompression and given that most deployments never ever touch the defaults (especially not the more esoteric ones, like those dealing with caching), we figured it is better to keep it simple. Thanks for pointing out the docs discrepancy, I'll go fix it right away.

@spacelama
Copy link

On Thu, 1 Aug 2013, skiselkov wrote:

@behlendorf Hi Brian. The reason we dropped the property is because ultimately we couldn't come up with a reason why you would want to avoid compressing to L2ARC while compressing to the main pool. There's almost no performance penalty in decompression and given that most deployments never ever touch the defaults (especially not the more esoteric ones, like those dealing with caching), we figured it is better to keep it simple. Thanks for pointing out the docs discrepancy, I'll go fix it right away.

Metadata. I don't use compression on any of my pools, because I know that
all of my data is incompressible.

But we've read here many times before that the metadata compresses
extremely well - more than 50%. for secondarycache=metadata, it makes a
lot of sense to extend the useful size of your l2arc and the write-life of
your SSDs.

Tim Connors

@behlendorf
Copy link
Contributor

I think it makes sense to always compress metadata. This is already the case for the primary pool and we should carry that behavior over for the l2arc where it makes even more sense. We can leverage the existing zfs_mdcomp_disable module option to disable this if needed.

But I have a similar use case where the data is incompressible but I'd love to cache as much metadata as possible in the l2arc.

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

One issue I have noticed with this is that you do not get as good a compression ratio versus using zram:

  • Using l2arc compression with secondarycache=metadata I get ~2x compression.
  • Using l2arc on zram with secondarycache=metadata I get ~4x compression with the same dataset.
    @skiselkov any ideas as to why there is this discrepancy?

@skiselkov
Copy link
Contributor

On 8/2/13 2:18 AM, Brian Behlendorf wrote:

I think it makes sense to always compress metadata. This is already the
case for the primary pool and we should carry that behavior over for the
l2arc where it makes even more sense. We can leverage the existing
zfs_mdcomp_disable module option to disable this if needed.

But I have a similar use case where the data is incompressible but I'd
love to cache as much metadata as possible in the l2arc.

True that, I'll add a modification which will always compress metadata
then, regardless of the os_compress setting on the dbuf's objset.

@skiselkov
Copy link
Contributor

On 8/2/13 8:10 AM, jimmyH wrote:

One issue I have noticed with this is that you do not get as good a
compression ratio versus using zram:

  • Using l2arc compression with secondarycache=metadata I get ~2x
    compression.

This would seem in line with general LZ4 performance.

I don't understand how you can get any more L2ARC compression when using
zRam - the L2ARC operates wholly outside of RAM. Can you elaborate on
your test case and what statistics you're using to gauge your
compression performance?

Cheers,

Saso

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

Saso,

On machines without SSDs, I find it best to restrict the arc cache and use the free memory for a zram disk which uses LZO compression. I then use this as L2ARC for metadata only.
I do not have a specific test case for this. Below is an example of the compression ratio reported by zram.

    zram0 orig=1000751104 compr=170572156
    zram1 orig=991719424 compr=170423437
    zram2 orig=982605824 compr=167444387
    zram3 orig=996671488 compr=171963690
                   capacity     operations    bandwidth
    pool        alloc   free   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    zfs1        1.17T   525G     41     68  1.55M   915K
      vdb1      1.17T   525G     41     68  1.55M   915K
    cache           -      -      -      -      -      -
      zram0     1.99G     8M      3      0  40.7K  21.8K
      zram1     1.99G     8M      3      0  36.6K  22.1K
      zram2     1.99G     8M      3      0  47.6K  17.9K
      zram3     1.99G     8M      2      0  34.3K  17.8K
    ----------  -----  -----  -----  -----  -----  -----
    
    l2_size                         4    4694123008

This is an extreme case, where zram is reporting that the L2ARC disks are really only 50% full before compression, and we are getting ~6x compression on that data.

On a different machine this is what I see from just using secondarycachecompress:

    -----------------------------------  -----  -----  -----  -----  -----  -----
    zfs2                                 1.79T   709G     43      7  1.13M   318K
      dm-name-grandadibu-zfs2            1.79T   709G     43      7  1.13M   318K
    cache                                    -      -      -      -      -      -
      nbd5                               3.89G     8M      8      0  26.3K  6.86K
      nbd6                               3.89G     8M      8      0  24.9K  9.49K
    -----------------------------------  -----  -----  -----  -----  -----  -----
    l2_size                         4    13368855040
    l2_asize                        4    8122558976
    l2_hdr_size                     4    330429216

Perhaps linux is misreporting the zram usage (I will run some tests to see if this is true).
Perhaps the linux port of zfs is not fully utilising the l2arc.

One test I will run is take a snapshot of the cache device whilst it is running with secondarycachecompress and then compress it using LZO. Would be worth while someone running the same test on illumos.

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

I took one of the 4GB L2ARC partitions where secondarycachecompress gives a ~1.65x compression ratio and observed:

    |  lz4 (zfs filesystem with compression=lz4) | 4GB   | 1x   |
    |  lz4c (default settings)                   | 611MB | 6.7x |
    |  lzo (default settings)                    | 596MB | 6.9x |
    |  gzip -1                                   | 516MB | 7.9x |

So I guess this is an issue with the settings we are using for lz4.

@skiselkov
Copy link
Contributor

On 8/2/13 1:23 PM, jimmyH wrote:

I took one of the 4GB L2ARC partitions where secondarycachecompress
gives a ~1.65x compression ratio and observed:

|  lz4 (zfs filesystem with compression=lz4) | 4GB   | 1x   |
|  lz4c (default settings)                   | 611MB | 6.7x |
|  lzo (default settings)                    | 596MB | 6.9x |
|  gzip -1                                   | 516MB | 7.9x |

So I guess this is an issue with the settings we are using for lz4.

There can hardly be, there are almost no settings to tweak for LZ4, the
algorithm is pretty straightforward. What kind of dataset are you trying
this out on?

Cheers,

Saso

@skiselkov
Copy link
Contributor

On 8/2/13 1:23 PM, jimmyH wrote:

I took one of the 4GB L2ARC partitions where secondarycachecompress
gives a ~1.65x compression ratio and observed:

|  lz4 (zfs filesystem with compression=lz4) | 4GB   | 1x   |
|  lz4c (default settings)                   | 611MB | 6.7x |
|  lzo (default settings)                    | 596MB | 6.9x |
|  gzip -1                                   | 516MB | 7.9x |

So I guess this is an issue with the settings we are using for lz4.

Btw: what was your l2_size/l2_asize for this test? I'm guessing you're
compressing empty blocks on an L2ARC device that wasn't very highly
utilized.

Cheers,

Saso

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

Sorry. A moment of stupidity - this was within an ext4 disk image which is obviously not aware of the zfs compression.

 | lz4 (zfs filesystem with compression=lz4) | 632MB | 6.5x |

So despite the l2arc using secondarycachecompress it can still be compressed.

I wonder if the arc fragmentation issues we get in the linux port are also affecting the l2arc?

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

I had 2x4GB L2ARC partitions with:
l2_size 4 13368855040
l2_asize 4 8122558976

@skiselkov
Copy link
Contributor

On 8/2/13 1:54 PM, jimmyH wrote:

I had 2x4GB L2ARC partitions with:
l2_size 4 13368855040
l2_asize 4 8122558976

This doesn't make any sense to me. According to l2_asize you have
approximately 7746 MB of data, but in your other messages you note that
this compresses down to 632 MB with a compression ratio of 6.5x. But 632
MB vs. 7746 MB isn't 6.5x, it's more like 12.2x. So which is it?

Cheers,

Saso

@jimmyH
Copy link
Contributor

jimmyH commented Aug 2, 2013

To clarify:
I have 2x4GB L2ARC partitions with secondarycachecompress=all, secondarycache=metadata, storing 7746MB of compressed data (12.5GB uncompressed, a compression ratio of 1.65x)
I only compressed one of these partitions - 4GB - to obtain a compression ratio of 6.5x.
Assuming both L2ARC partitions contain similar data this means that I am getting a total compression ratio of 6.5 x 1.65 = 10.7x.

The question is why can we get the additional 6.5x compression ratio on data which should already be compressed.

@skiselkov
Copy link
Contributor

On 8/2/13 2:18 AM, Brian Behlendorf wrote:

I think it makes sense to always compress metadata. This is already the
case for the primary pool and we should carry that behavior over for the
l2arc where it makes even more sense. We can leverage the existing
zfs_mdcomp_disable module option to disable this if needed.

But I have a similar use case where the data is incompressible but I'd
love to cache as much metadata as possible in the l2arc.

Brian, just wanted to let you know that your suggestion has been
accepted and I've already sent in an RTI for it:
https://www.illumos.org/issues/3964

@jimmyH
Copy link
Contributor

jimmyH commented Aug 4, 2013

I think I've understood the reason for the poor compression ratio:

  • With secondarycache=metadata many of the buffers sent to l2arc_write_buffers() are less than the minimum size, so do not get compressed.
  • Also, all compressed buffers get zero padded up to the cache devices block size.

If we did not align the data in the L2ARC we would get a much better compression ratio for small buffers at the expense of slower read/writes.

I did a quick hack to modifiy l2arc_write_buffers() and zio_compress() so that small buffers get compressed. This gave me a very good l2_size/l2_asize compression ratio. I am not sure what would need to be changed to the vdev code to allow you to write unaligned blocks (zio_write_phys() returns an error if you try this)

@skiselkov
Copy link
Contributor

On 8/4/13 1:29 PM, jimmyH wrote:

I think I've understood the reason for the poor compression ratio:

  • With secondarycache=metadata many of the buffers sent to
    l2arc_write_buffers() are less than the minimum size, so do not get
    compressed.
  • Also, all compressed buffers get zero padded up to the cache devices
    block size.

If we did not align the data in the L2ARC we would get a much better
compression ratio for small buffers at the expense of slower read/writes.

I did a quick hack to modifiy l2arc_write_buffers() and zio_compress()
so that small buffers get compressed. This gave me a very good
l2_size/l2_asize compression ratio. I am not sure what would need to be
changed to the vdev code to allow you to write unaligned blocks
(zio_write_phys() returns an error if you try this)

How exactly did you modify l2arc_write_buffers?

Unaligned writes to block devices are not possible. I'm somewhat
surprised you didn't get a kernel panic (maybe the SPL modifies ASSERT()
not to invoke one, though that would be strange).

Cheers,

Saso

@jimmyH
Copy link
Contributor

jimmyH commented Aug 4, 2013

On 04/08/2013 13:38, skiselkov wrote:

How exactly did you modify l2arc_write_buffers?

I set buf_compress_minsz=0, and modified zio_compress_data() to not
align to SPA_MINBLOCKSIZE when called from l2arc.
Unaligned writes to block devices are not possible. I'm somewhat
surprised you didn't get a kernel panic (maybe the SPL modifies ASSERT()
not to invoke one, though that would be strange).
Oh. I really should have realised that.

The l2arc code could pack the buffers when writing them to the vdev.
This would improve the effective compression ratio for small buffers,
though there would obviously be a significant overhead.
Is this a sensible option?

thanks,

Jimmy

@skiselkov
Copy link
Contributor

On 8/4/13 3:38 PM, jimmyH wrote:

On 04/08/2013 13:38, skiselkov wrote:

How exactly did you modify l2arc_write_buffers?

I set buf_compress_minsz=0, and modified zio_compress_data() to not
align to SPA_MINBLOCKSIZE when called from l2arc.

Are you running the latest upstream where buf_compress_minsz is
determined from l2ad_vdev->vdev_ashift? Setting minsz to zero just means
that you'll be compressing even in cases where it makes no sense. The
upstream code takes care of correctly sizing the lower bound.

The modification of zio_compress_data() is not good. Alignment matters.
Maybe the block layer in Linux is able to fake unaligned access by
reverting to RMW, but this is bad behavior and should never be part of
the design (plus it hurts performance badly).

The l2arc code could pack the buffers when writing them to the vdev.
This would improve the effective compression ratio for small buffers,
though there would obviously be a significant overhead.
Is this a sensible option?

While it might improve the compression ratio for these corner cases, it
would result in the need to do extra large reads and decompresses to get
at a buffer that's "inside" of such a packed structure, and this hurts
performance for everybody else. Keep in mind that the L2ARC is almost
completely about speeding up small random reads, so forcing a hit to do
extra work is going to impact performance significantly.

@behlendorf
Copy link
Contributor

@skiselkov Thanks for the quick turn around on the meta data compression change.

@jimmyH beat me too it but I was going to say the improved zram compression must be due to the vdev alignment constraints. Metadata is quite likely to compress significantly smaller than the device sector size.

The modification of zio_compress_data() is not good. Alignment matters.

Alignment absolutely matters, reverting to read-modify-write for an ssd will hurt performance badly. But keep in mind you're assuming certain performance characteristics for that block device. The backing device could be a normal ram disk, a compressed ram disk like zram, a low latency network block device, or something slightly more exotic like phase change memory. All of these things exist today and could be used to back an l2arc device, we should made sure the l2arc can be tuned to take advantage of them.

While it might improve the compression ratio for these corner cases, it would result in the need to do extra large reads and decompresses to get at a buffer that's "inside" of such a packed structure, and this hurts performance for everybody else.

This becomes important when we're talking about the metadata-only use case. Here the corner case may easily become the common case and it would be nice to handle that cleanly. Packing the buffers is one solution which I could see working pretty well. I'm a little less concerned about the extra overhead here since I think this could be implemented efficiently, it's certainly something we'd need to keep in mind.

For me the bigger question is is this worth doing at all. And I think the answer really depends on what you're storing in the l2arc and what technology you're using to back it. For ssds which are now large, relatively cheap, but slow to write it doesn't make sense. If you're using something like zram then what you probably want is to disable all l2arc compression. The zram device which doesn't have the alignment concerns will be able to do a much better job compressing the data and packing it. For the new types of non-volatile storage arriving in the next few years which will likely be small, expensive, and fast I suspect packing may be worthwhile.

@skiselkov
Copy link
Contributor

On 8/4/13 7:00 PM, Brian Behlendorf wrote:

For me the bigger question is is this worth doing at all.

You hit the nail on the head. Of course one can always devise complex
workarounds for corner cases. Buffer packing, speculative reading, smart
compression algorithm selection etc. there's no shortage of clever
things code can do to get around certain limitations. But every new code
path needs to be developed, QA'd and supported, plus test and problem
resolution complexity increases exponentially with the number of
possible behaviors.

More to the specific case here, using L2ARC on ram disks appears to me
as just an ugly hack and kludge that should never have existed in the
first place. If we need a knob to dedicate a portion of RAM for metadata
caching, we should add a knob, not go through megatons of other code to
implement what is essentially a simple numeric comparison.

Don't get me wrong, I'm not against innovation in the ARC (there is
plenty to be done there), but before we go and hack away on a highly
performance-sensitive part of the code base for questionable
hypothetical gains, I'd like to see each idea "scientifically"
scrutinized with real data to validate our assumptions (and there will
always need to be assumptions about hardware behavior in a storage
stack, that's just a fact of life).

@jimmyH
Copy link
Contributor

jimmyH commented Aug 5, 2013

@skiselkov Totally agree with your comments. I hacked a really basic patch to pack small l2arc buffers - jimmyH@243d508 - which I'll leave soaking on some real servers to see if it is a stupid idea...

One comment with regards to the compression patch: is storing b_tmp_cdata in the l2arc_buf_hdr structure sensible? I'd have thought we'd want to keep these headers as small as possible.

@skiselkov
Copy link
Contributor

Can you do an analysis on your workload on how much of a performance/efficiency gain you get in running this code vs running upstream source? I'm personally not convinced that it's really worth the hassle, especially using ramdisk-backed L2ARC devices instead of just implementing more sensible ARC partitioning. (To clarify here, I mean a real-world use case that would hit this, not just synthetics.)

The reason b_tmp_cdata is in the header there is, as you have no doubt discovered writing that patch, that we need to keep the pointer around until after we're done writing it and the l2arc_hdr_t structure was just the most convenient place to put it. Now I could do it by allocating a separate "freelist" and stuffing it into the callback, but that would detach the compressed data portion of a buffer from the buffer header itself, which would make it harder to track where it is and making sure it is released after a write, plus the cost on overhead is rather small (an arc_hdr_t is around 200 bytes anyway). If we were to integrate the packing method you propose here, it'd probably make sense to start doing something smarter here.

@jimmyH
Copy link
Contributor

jimmyH commented Aug 7, 2013

Workload: backup server running rsync. Where rsync spends most of its time stat'ing millions of files before realising they are all up to date. So backups are mostly limited on random reads of metadata. This server has an old ssd for l2arc, the more metadata we can fit in l2arc the faster the backups.

  • The secondarycachecompress patch increased the amount of metadata we could store, and reduced backup times by ~10%.
  • Packing the metadata has reduced the backup times by a further 10% (NB Hasn't been enabled for very long so this is only an initial observation). But I have had to reduce the size of the cache device, since the l2_hdr_size would otherwise fill the arc.

Now get 8.1x compression ratio, but l2_hdr_size is approximately 1/3 of l2_asize.

On its' own I'm also not sure that this is 'worth the hassle' while we have such a large header overhead.

I understand why the original zfs developers keep an arc_hdr_t (264 bytes on linux x86_64) for every block in the l2arc, but it looks like a design flaw to me.

behlendorf pushed a commit to behlendorf/zfs that referenced this issue Aug 8, 2013
3964 L2ARC should always compress metadata buffers
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>

References:
  https://www.illumos.org/issues/3964

Ported-by: Brian Behlendorf <[email protected]>
Closes openzfs#1379
@ryao
Copy link
Contributor

ryao commented Aug 8, 2013

I don't understand how you can get any more L2ARC compression when using
zRam - the L2ARC operates wholly outside of RAM. Can you elaborate on
your test case and what statistics you're using to gauge your
compression performance?

@skiselkov You might want to read the following thread on the lz4 mailing list where one guy managed to compress a 3.5GB file into a 750KB file by compressing it multiple times.

https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ

Yann Collet provided a good explanation for the observed behavior in his response.

@skiselkov
Copy link
Contributor

Well, a curious case for sure, but I'm not convinced it's really applicable to filesystem compression. The cited scale in a the second pass there, going from 3.5GB to 9.2MB is already much greater (368x) than the difference between SPA_MAXBLOCKSIZE and SPA_MINBLOCKSIZE (256x). Nevertheless, there probably is a case to be made for implementing compress="lz4-2" or indeed lz4-n with n={1..9} in a fashion similar to gzip. Even at two compression passes, lz4 would still be almost as fast as lzjb, but probably produce much smaller outputs for certain workloads. But I fear giving too many knobs into the hands of users here, gun-shoot-foot.

@behlendorf
Copy link
Contributor

L2ARC compression from Illumos has been merged. Thanks @skiselkov for developing this, we can always pursue further improvements but this syncs us back up with Illumos.

4e59f47 Illumos #3964 L2ARC should always compress metadata buffers
3a17a7a Illumos #3137 L2ARC compression

unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
3137 L2ARC compression
Reviewed by: George Wilson <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>

References:
  illumos/illumos-gate@aad0257
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#1379
unya pushed a commit to unya/zfs that referenced this issue Dec 13, 2013
3964 L2ARC should always compress metadata buffers
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>

References:
  https://www.illumos.org/issues/3964

Ported-by: Brian Behlendorf <[email protected]>
Closes openzfs#1379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

6 participants