Choose the best recordsize #15287

christianmerges · 2023-09-17T02:09:07Z

christianmerges
Sep 17, 2023

I wanna discuss ideal redordsize choice.

So let's start from bare metal: On the Disk there is 4K per Sector, but it shows to the OS as 512B for compatibility reasons. Ashift=12 on pool creation should make sure that always 8x 512B fits into one 4K Block on the Disk.

The bittorrent Client is writing in 16K Blocks as far as i know. So when i create a raidz pool with recordsize 16K and five disks, it should write 4x4K to Disk + 1x4K parity. Sounds like a good alignment.

But how is data written when in one pool, if there are two raidz vdevs?
4+1
4+1

Is zfs always writing to all disks? Or load balancing the vdevs or filling one vdev after the other one is full? Because depending on that, i would set the recordsize of the pool to 16K or 32K.

amotin · 2023-09-17T02:43:11Z

amotin
Sep 17, 2023
Collaborator

If there are multiple top-level vdev, first in each transaction group ZFS writes ~1MB to each vdev, trying to balance their usage if not equal, and then continues writing as fast as each vdev can handle it.

But you should not want to use so small recordsize's especially if you are talking about HDDs (not SSDs) and torrents. Torrents due to their random download/write order will create enormous data fragmentation on your pool, and while for write your logic makes some sense, attempt to read that afterwards from HDDs will be a disaster. ZFS will sure try to use prefetch to compensate that, etc, but there were cases when people could not even playback video they just downloaded without constant delays. Best solution for that use case is using small separate SSD pool as temporary storage and move completed files onto the main pool after completion. At least Transmission client can do it automatically. And at that point you do not need to care so much about the recordsize.

2 replies

christianmerges Sep 17, 2023
Author

I have nvme drives for log,dedup and cache. So the best solution would be to have a separate dataset with small recordsize (4K per hdd) for active downloading torrents? And then moving finished files to the 1M recordsize dataset?

I chose the small recordsize because of the copy on write design. I assume downloading torrents to a pool with recordsize 1M would make the filesystem to copy over the full 1M to a new block for every 16K change, isn't it?

christianmerges Sep 17, 2023
Author

So to be clear, one block is striped across only one vdev, but the filesystem is writing to both vdevs simultaneously? So in my example for a 2M big file with recordsize 1M, it will be written like that:

First Block of the file is split into four pieces and written to the first raidz, second block of the file is split into four pieces and written to the second vdev raidz. So each vdev haved one full block, right?

doesn't it make more sense then to make rather a raidz2 vdev with 10 disks?

IvanVolosyuk · 2023-09-17T03:15:38Z

IvanVolosyuk
Sep 17, 2023

In the torrent datasetsI use recordsize=128k, sync=disabled and have txg_timeout=120 globally, and move the finished files after that. There is write amplification surely, but it mostly happens in memory before hitting the disk where it is all aggregated and written once in a while.

…

On Sun, Sep 17, 2023 at 12:43 PM Alexander Motin ***@***.***> wrote: If there are multiple top-level vdev, first in each transaction group ZFS writes ~1MB to each vdev, trying to balance their usage if not equal, and then continues writing as fast as each vdev can handle it. But you should not want to use so small recordsize's especially if you are talking about HDDs (not SSDs) and torrents. Torrents due to their random download/write order will create enormous data fragmentation on your pool, and while for write your logic makes some sense, attempt to read that afterwards from HDDs will be a disaster. ZFS will sure try to use prefetch to compensate that, etc, but there were cases when people could not even playback video they just downloaded without constant delays. Best solution for that use case is using small separate SSD pool as temporary storage and move completed files onto the main pool after completion. At least Transmission client can do it automatically. And at that point you do not need to care so much about the recordsize. — Reply to this email directly, view it on GitHub <#15287 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HIOGNAVBEV33IOUSALX2ZPUXANCNFSM6AAAAAA43JBWKY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1 reply

christianmerges Sep 17, 2023
Author

How many disks in the pool and which raidlevel? Only HDDs?

What is txg_timeout doing in fact? Is it delaying writing data from ram to disk?

IvanVolosyuk · 2023-09-17T03:38:21Z

IvanVolosyuk
Sep 17, 2023

txg_timeout is the maximum interval the dirty data will be written to disk from ram. If there is enough pressure in ram it can be written more frequently. I have raidz with 4 hdd disks and l2arc, which is mostly irrelevant in this case I guess as I think it is not in use for this dataset.

…

On Sun, 17 Sept 2023, 13:21 christianmerges, ***@***.***> wrote: How many disks in the pool and which raidlevel? Only HDDs? What is txg_timeout doing in fact? Is it delaying writing data from ram to disk? — Reply to this email directly, view it on GitHub <#15287 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HNLYBANAYCJS2MJ72DX2ZUFDANCNFSM6AAAAAA43JBWKY> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

christianmerges Sep 17, 2023
Author

So do i get it right, setting the txg_timeout higher, keeps written data longer in ram and log device, so it can be written to disk in a bigger chunk sequentially? So the data on disk will be less fragmented and can be read faster later?

IvanVolosyuk · 2023-09-17T04:41:34Z

IvanVolosyuk
Sep 17, 2023

To read one byte from 10 disk raidz2 you will have to spin 8 disks (seek + read), to read 1 byte from 2xraidz1 you will have to spin 4 disks. Thus your IOPS will be larger with 2xraidz1, but redundancy will be lower. There is also a bit more space wasted on the one raidz2: the stripes smaller than the full "8 * 4k of data" will consume more space.

…

On Sun, Sep 17, 2023 at 2:09 PM christianmerges ***@***.***> wrote: So to be clear, one block is striped across only one vdev, but the filesystem is writing to both vdevs simultaneously? So in my example for a 2M big file with recordsize 1M, it will be written like that: First Block of the file is split into four pieces and written to the first raidz, second block of the file is split into four pieces and written to the second vdev raidz. So each vdev haved one full block, right? doesn't it make more sense then to make rather a raidz2 vdev with 10 disks? — Reply to this email directly, view it on GitHub <#15287 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HL73FQF2YNCHUGHAUDX2ZZX3ANCNFSM6AAAAAA43JBWKY> . You are receiving this because you commented.Message ID: ***@***.***>

1 reply

christianmerges Sep 17, 2023
Author

So it's actually a good practice to extend capacity with batches of five disks and add another raidz1 vdev to the pool.

IvanVolosyuk · 2023-09-17T04:54:45Z

IvanVolosyuk
Sep 17, 2023

Depends on your needs

…

On Sun, 17 Sept 2023, 14:47 christianmerges, ***@***.***> wrote: So it's actually a good practice to extend capacity with batches of five disks and add another raidz1 vdev to the pool. — Reply to this email directly, view it on GitHub <#15287 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABXQ6HNH2L4PKCLIKXWVXALX2Z6HDANCNFSM6AAAAAA43JBWKY> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

IvanVolosyuk · 2023-09-17T08:33:05Z

IvanVolosyuk
Sep 17, 2023

It helps as well as the larger recordsize, log device is irrelevant with sync=disabled, it will not be used and will not contribute to the pool fragmentation. I use two pools with these settings for more than 4 years and reported 'free space' fragmentation is 3% and 12%. I don't know how to check file fragmentation in ZFS, so all of this is just my educated(?) guess.

…

On Sun, Sep 17, 2023 at 4:20 PM christianmerges ***@***.***> wrote: So do i get it right, setting the txg_timeout higher, keeps written data longer in ram and log device, so it can be written to disk in a bigger chunk sequentially? So the data on disk will be less fragmented and can be read faster later? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose the best recordsize #15287

{{title}}

Replies: 6 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Choose the best recordsize #15287

christianmerges Sep 17, 2023

Replies: 6 comments · 5 replies

amotin Sep 17, 2023 Collaborator

christianmerges Sep 17, 2023 Author

christianmerges Sep 17, 2023 Author

IvanVolosyuk Sep 17, 2023

christianmerges Sep 17, 2023 Author

IvanVolosyuk Sep 17, 2023

christianmerges Sep 17, 2023 Author

IvanVolosyuk Sep 17, 2023

christianmerges Sep 17, 2023 Author

IvanVolosyuk Sep 17, 2023

IvanVolosyuk Sep 17, 2023

christianmerges
Sep 17, 2023

Replies: 6 comments 5 replies

amotin
Sep 17, 2023
Collaborator

christianmerges Sep 17, 2023
Author

christianmerges Sep 17, 2023
Author

IvanVolosyuk
Sep 17, 2023

christianmerges Sep 17, 2023
Author

IvanVolosyuk
Sep 17, 2023

christianmerges Sep 17, 2023
Author

IvanVolosyuk
Sep 17, 2023

christianmerges Sep 17, 2023
Author

IvanVolosyuk
Sep 17, 2023

IvanVolosyuk
Sep 17, 2023