Choose the best recordsize #15287
Replies: 6 comments 5 replies
-
If there are multiple top-level vdev, first in each transaction group ZFS writes ~1MB to each vdev, trying to balance their usage if not equal, and then continues writing as fast as each vdev can handle it. But you should not want to use so small recordsize's especially if you are talking about HDDs (not SSDs) and torrents. Torrents due to their random download/write order will create enormous data fragmentation on your pool, and while for write your logic makes some sense, attempt to read that afterwards from HDDs will be a disaster. ZFS will sure try to use prefetch to compensate that, etc, but there were cases when people could not even playback video they just downloaded without constant delays. Best solution for that use case is using small separate SSD pool as temporary storage and move completed files onto the main pool after completion. At least Transmission client can do it automatically. And at that point you do not need to care so much about the recordsize. |
Beta Was this translation helpful? Give feedback.
-
In the torrent datasetsI use recordsize=128k, sync=disabled and have
txg_timeout=120 globally, and move the finished files after that. There is
write amplification surely, but it mostly happens in memory before hitting
the disk where it is all aggregated and written once in a while.
…On Sun, Sep 17, 2023 at 12:43 PM Alexander Motin ***@***.***> wrote:
If there are multiple top-level vdev, first in each transaction group ZFS
writes ~1MB to each vdev, trying to balance their usage if not equal, and
then continues writing as fast as each vdev can handle it.
But you should not want to use so small recordsize's especially if you are
talking about HDDs (not SSDs) and torrents. Torrents due to their random
download/write order will create enormous data fragmentation on your pool,
and while for write your logic makes some sense, attempt to read that
afterwards from HDDs will be a disaster. ZFS will sure try to use prefetch
to compensate that, etc, but there were cases when people could not even
playback video they just downloaded without constant delays. Best solution
for that use case is using small separate SSD pool as temporary storage and
move completed files onto the main pool after completion. At least
Transmission client can do it automatically. And at that point you do not
need to care so much about the recordsize.
—
Reply to this email directly, view it on GitHub
<#15287 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HIOGNAVBEV33IOUSALX2ZPUXANCNFSM6AAAAAA43JBWKY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
txg_timeout is the maximum interval the dirty data will be written to disk
from ram. If there is enough pressure in ram it can be written more
frequently. I have raidz with 4 hdd disks and l2arc, which is mostly
irrelevant in this case I guess as I think it is not in use for this
dataset.
…On Sun, 17 Sept 2023, 13:21 christianmerges, ***@***.***> wrote:
How many disks in the pool and which raidlevel? Only HDDs?
What is txg_timeout doing in fact? Is it delaying writing data from ram to
disk?
—
Reply to this email directly, view it on GitHub
<#15287 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HNLYBANAYCJS2MJ72DX2ZUFDANCNFSM6AAAAAA43JBWKY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
To read one byte from 10 disk raidz2 you will have to spin 8 disks (seek +
read), to read 1 byte from 2xraidz1 you will have to spin 4 disks. Thus
your IOPS will be larger with 2xraidz1, but redundancy will be lower. There
is also a bit more space wasted on the one raidz2: the stripes smaller than
the full "8 * 4k of data" will consume more space.
…On Sun, Sep 17, 2023 at 2:09 PM christianmerges ***@***.***> wrote:
So to be clear, one block is striped across only one vdev, but the
filesystem is writing to both vdevs simultaneously? So in my example for a
2M big file with recordsize 1M, it will be written like that:
First Block of the file is split into four pieces and written to the first
raidz, second block of the file is split into four pieces and written to
the second vdev raidz. So each vdev haved one full block, right?
doesn't it make more sense then to make rather a raidz2 vdev with 10 disks?
—
Reply to this email directly, view it on GitHub
<#15287 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HL73FQF2YNCHUGHAUDX2ZZX3ANCNFSM6AAAAAA43JBWKY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Depends on your needs
…On Sun, 17 Sept 2023, 14:47 christianmerges, ***@***.***> wrote:
So it's actually a good practice to extend capacity with batches of five
disks and add another raidz1 vdev to the pool.
—
Reply to this email directly, view it on GitHub
<#15287 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABXQ6HNH2L4PKCLIKXWVXALX2Z6HDANCNFSM6AAAAAA43JBWKY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
It helps as well as the larger recordsize, log device is irrelevant
with sync=disabled, it will not be used and will not contribute to the
pool fragmentation. I use two pools with these settings for more than
4 years and reported 'free space' fragmentation is 3% and 12%. I don't
know how to check file fragmentation in ZFS, so all of this is just my
educated(?) guess.
…On Sun, Sep 17, 2023 at 4:20 PM christianmerges ***@***.***> wrote:
So do i get it right, setting the txg_timeout higher, keeps written data longer in ram and log device, so it can be written to disk in a bigger chunk sequentially? So the data on disk will be less fragmented and can be read faster later?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I wanna discuss ideal redordsize choice.
So let's start from bare metal: On the Disk there is 4K per Sector, but it shows to the OS as 512B for compatibility reasons. Ashift=12 on pool creation should make sure that always 8x 512B fits into one 4K Block on the Disk.
The bittorrent Client is writing in 16K Blocks as far as i know. So when i create a raidz pool with recordsize 16K and five disks, it should write 4x4K to Disk + 1x4K parity. Sounds like a good alignment.
But how is data written when in one pool, if there are two raidz vdevs?
4+1
4+1
Is zfs always writing to all disks? Or load balancing the vdevs or filling one vdev after the other one is full? Because depending on that, i would set the recordsize of the pool to 16K or 32K.
Beta Was this translation helpful? Give feedback.
All reactions