zfs-0.6.0rc8 (repo) mkdir can hang on ARM 32bit system #749

lundman · 2012-05-21T03:25:31Z

Creating this ticket after conversations with behlendorf, to attempt to track down other issues.

Initial porting work was done with zfs-0.6.0rc8, on PREEMPTIVE kernels, but recompiled kernel, and using github repository versions now. Tempted to go back to old setup as I don't remember seeing the issue then. (I could install a whole Ubuntu image after all)

At the moment, IO can stall/hang (per processes) on write operands (for example mkdir), but also on commands like "zfs create" which presumably also causes write requests. Interestingly, after reboot the "zfs create" filesystem does exist.

Symptoms are that the process executed never returns. But the OS itself keeps functioning.

For example:

May 21 02:49:35 cubox kernel: [  632.243007] ZFS: Loaded module v0.6.0-rc8 (DEBUG mode), ZFS pool version 28, ZFS filesystem version 5
# zpool create -f ROGER /dev/sdb1
# zpool status
  pool: ROGER
 state: ONLINE
 scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        ROGER       ONLINE       0     0     0
          sdb1      ONLINE       0     0     0
# zfs create ROGER/moore

Which does not return;

top - 03:11:25 up 3 min,  2 users,  load average: 1.58, 0.52, 0.19
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
  836 root       0 -20     0    0    0 D 70.4  0.0   0:31.73 txg_sync           
  846 root      20   0  3624 1280  936 D 21.7  0.2   0:11.71 zfs                
  787 root      39  19     0    0    0 R  3.3  0.0   0:02.97 z_null_iss/0       

# zfs list
NAME          USED  AVAIL  REFER  MOUNTPOINT
ROGER         129K  1.81G    30K  /ROGER
ROGER/moore    30K  1.81G    30K  /ROGER/moore

# ps -FAl
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN    RSS PSR STIME TTY          TIME CMD
0 R root       846   714 25  80   0 -   906 -       1280   0 03:10 pts/0    00:00:38 zfs create ROGER/moore

Strangely enough, if I strace "zfs create" it also dies in mkdir:

# strace -f -T -v -x zfs create ROGER/moore
lstat64("/ROGER/moore", 0xbe9ee210)     = -1 ENOENT (No such file or directory) <0.003260>
mkdir("/ROGER/moore", 0755

There are some compile time warnings, I will mention just in case they are relevant;

/media/usb0/zfs/module/zfs/../../module/zfs/zpl_ctldir.c:108: warning: integer constant is too large for 'long' type
/media/usb0/zfs/module/zfs/../../module/zfs/zpl_ctldir.c:116: warning: integer constant is too large for 'long' type
/media/usb0/zfs/module/zfs/../../module/zfs/zpl_ctldir.c: In function 'zpl_snapdir_readdir':
/media/usb0/zfs/module/zfs/../../module/zfs/zpl_ctldir.c:256: warning: integer constant is too large for 'long' type

/media/usb0/zfs/module/zfs/../../module/zfs/zfs_vnops.c:2004: warning: integer constant is too large for 'long' type
/media/usb0/zfs/module/zfs/../../module/zfs/zfs_vfsops.c:1341: warning: integer constant is too large for 'long' type
/media/usb0/zfs/module/zfs/../../module/zfs/zfs_vfsops.c:1341: warning: integer constant is too large for 'long' type
/media/usb0/zfs/module/zfs/../../module/zfs/zfs_vfsops.c:1344: warning: integer constant is too large for 'long' type

/media/usb0/zfs/module/zfs/../../module/zfs/zfs_byteswap.c:202: warning: the frame size of 1032 bytes is larger than 1024 bytes

Recompiling ZFS with --enable-debug did not seem to make any more output that I can observe.

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-05-21T17:27:14Z

Thanks for filing the bug, what would be helpful would be to dump the stacks for the processes which are stuck. You can do this by the running the following command as root echo t >/proc/sysrq-trigger. After it returns all of the stack will have been printed to dmesg.

It's possible the work @ryao has been doing to improve the memory management and support swap devices will help with this. But it's hard to say without seeing the stacks.

lundman · 2012-05-22T00:52:23Z

git pull latest, just in case but only spl had some kmem changes.

 # zpool create ROGER /dev/sdb1
 # mkdir /ROGER/hello &
 0 D root      7356   701 17  80   0 -   343 cv_tim   468   0 00:47 pts/0    00:00:27 mkdir /ROGER/hello

May 22 00:50:04 cubox kernel: [76591.693274] mkdir         D c04e7a70     0  7356    701 0x00020000
May 22 00:50:04 cubox kernel: [76591.693303] [<c04e7a70>] (schedule+0x410/0x470) from [<c04e7e7c>] (schedule_timeout+0x1e8/0x224)
May 22 00:50:04 cubox kernel: [76591.693357] [<c04e7e7c>] (schedule_timeout+0x1e8/0x224) from [<bf1d26bc>] (__cv_timedwait_common+0x154/0x24c [spl])
May 22 00:50:04 cubox kernel: [76591.693643] [<bf1d26bc>] (__cv_timedwait_common+0x154/0x24c [spl]) from [<bf2f1c6c>] (txg_delay+0x10c/0x1b8 [zfs])
May 22 00:50:04 cubox kernel: [76591.694087] [<bf2f1c6c>] (txg_delay+0x10c/0x1b8 [zfs]) from [<bf2c14b4>] (dsl_dir_tempreserve_space+0x218/0x32c [zfs])
May 22 00:50:04 cubox kernel: [76591.694482] [<bf2c14b4>] (dsl_dir_tempreserve_space+0x218/0x32c [zfs]) from [<bf29f4f0>] (dmu_tx_assign+0x708/0x784 [zfs])
May 22 00:50:04 cubox kernel: [76591.694959] [<bf29f4f0>] (dmu_tx_assign+0x708/0x784 [zfs]) from [<bf33815c>] (zfs_mkdir+0x344/0x4e0 [zfs])
May 22 00:50:04 cubox kernel: [76591.695554] [<bf33815c>] (zfs_mkdir+0x344/0x4e0 [zfs]) from [<bf354324>] (zpl_mkdir+0x8c/0x118 [zfs])
May 22 00:50:04 cubox kernel: [76591.695875] [<bf354324>] (zpl_mkdir+0x8c/0x118 [zfs]) from [<c00f5250>] (vfs_mkdir+0x68/0xb8)
May 22 00:50:04 cubox kernel: [76591.695899] [<c00f5250>] (vfs_mkdir+0x68/0xb8) from [<c00f70e8>] (sys_mkdirat+0x84/0xd0)
May 22 00:50:04 cubox kernel: [76591.695922] [<c00f70e8>] (sys_mkdirat+0x84/0xd0) from [<c0035e40>] (ret_fast_syscall+0x0/0x2c)

(I assume you only need the mkdir stack, and not the whole thing)

ryao · 2012-05-22T01:05:50Z

The entire stack trace would be more helpful.

lundman · 2012-05-22T02:16:48Z

My apologies;

http://lundman.net/ftp/kern.log

behlendorf · 2012-05-22T16:39:20Z

Nothing looks horribly wrong from the stack traces. Your mkdir is waiting on the syncing txg which appears to be waiting on a disk I/O. That's nothing abnormal about that.

However you might be suffering from some of the other known 32-bit issues. Can you try the followings.

Increase the virtual address region for the kernel by setting the vmalloc=512M boot option.
Try the patch in pull request OS-682: ARC minimum is set too high, potentially blocking VM creation #651, other have reported that limiting the ARC is this way significantly helps 32-bit machines.

lundman · 2012-05-23T06:11:33Z

I have tried option 1) by specifying the bootarg line, changing vmalloc=384M to vmalloc=500M.

No difference.

Applying 4 line patch, reboot;

No difference

Doing both at the same time

No difference.

Stack trace of zfs create;

http://lundman.net/ftp/kern2.log

lundman · 2012-05-24T05:25:44Z

Ok, so I still had zfs-0.6.0rc8 on disk, which works.

So I load all .ko's except for zfs.ko, and it still works.

# insmod ../zfs-0.6.0rc8/modules/zfs/zfs.ko
# insmod modules/zpios.ko
# zpool create -f ROGER /dev/sdb1
# zfs create ROGER/moore
# mkdir /ROGER/moore/a
# mkdir /ROGER/moore/b
# mkdir /ROGER/moore/c

# zpool export ROGER
# rmmod zpios.ko zfs.ko
# insmod modules/zfs/zfs.ko
# insmod modules/zpios/zpios.ko
# zpool create -f ROGER /dev/sdb1
# zfs create ROGER/moore 
hung

So, there is something in github HEAD that was not in 0.6.0rc8, and it in the zfs.ko module. I can use all kernel modules from HEAD, except zfs.ko from 0.6.0rc8 tarball, and it still works. (and a little pleasing to know arm divmod, which is in spl, is not at all related)

Does that give us a lead?

behlendorf · 2012-05-24T16:15:38Z

The post -rc8 change set is relatively small. You could bisect these changes to find the offending patch. However, if I were to make an educated guess I'd try reverting 302f753 first. This was the most disruptive change and while it's been seriously tested on x86_64 there may have been unexpected 32-bit side effects.

c421831 Define the needed ISA types for ARM
7101140 Revert "Disable direct reclaim on zvols"
3462fa2 Add mdadm and bc dependencies
b39d3b9 Linux 3.3 compat, iops->create()/mkdir()/mknod()
ce90208 Disable direct reclaim on zvols
518b487 Update ARC memory limits to account for SLUB internal fragmentation
302f753 Integrate ARC more tightly with Linux
afec56b Add zfs_mdcomp_disable module option
ad60af8 Illumos #2067: uninitialized variables in zfs(1M) may make snapshots und
95bcd51 Illumos #1946: incorrect formatting when listing output of multiple pool
187632d Illumos #952: separate intent logs should be obvious in 'zpool iostat' o
ebf8e3a Illumos #1909: disk sync write perf regression when slog is used post oi
409dc1a Use KM_PUSHPAGE in l2arc_write_buffers
cf81b00 ZFS list snapshot property alias
10b7549 ZFS snapshot alias
7d5cd71 Illumos #1346: zfs incremental receive may leave behind temporary clones
22cd4a4 Illumos #1475: zfs spill block hold can access invalid spill blkptr
5ffb9d1 Illumos #1951: leaking a vdev when removing an l2cache device
b129c65 OS-926: zfs panic in zfs_fill_zplprops_impl()
3adfc40 Illumos #1680: zfs vdev_file_io_start: validate vdev before using vdev_t
109491a Improve error message consistency
f4605f0 Document the zle compression algorithm
f0fd83b Export additional dsl symbols
1f0d8a5 Fixed a NULL pointer dereference bug in zfs_preumount
2ce9d0e Make Gentoo initscript use modinfo
847de12 Print human readable error message for ENOENT
fc41c64 Properly expose the mfu ghost list kstats
9fc6070 Remove hard-coded 80 column output

lundman · 2012-05-25T06:04:55Z

First time with "git bisect", and I guess I have to reluctantly admit that to be pretty damn useful. :)

Outcome of bisect is:

# git bisect bad
302f753f1657c05a4287226eeda1f53ae431b8a7 is the first bad commit
commit 302f753f1657c05a4287226eeda1f53ae431b8a7
Author: Brian Behlendorf <[email protected]>
Date:   Tue Mar 13 14:29:16 2012 -0700

    Integrate ARC more tightly with Linux
[snip]
    Signed-off-by: Prakash Surya <[email protected]>
    Signed-off-by: Brian Behlendorf <[email protected]>

:040000 040000 ba5710ca97cd7febf6c3ff46f7d7dbbc2c4668ea 81c7a06fb482aa5ca2414b57d48166bd56ca8957 M      module

lundman · 2012-05-25T07:13:26Z

To re-iterate what bisect already told us, I went to "master", and reverted only "302f753f1657c05a4287226eeda1f53ae431b8a7" and it does indeed work.

Doing some heavy use, I did find yet another trouble, but is separate from this. Is it known? Should I create a new ticket?

zpool create -f rpool /dev/sdb1
zfs create -o compression=on -o dedup=on -o atime=off rpool/ROOT
zpool set bootfs=rpool/ROOT rpool
cd /rpool/ROOT/
rsync -ar --progress --exclude=rpool / ./
[snip very long rsync]
home/cubox/.xbmc/userdata/addon_data/weather.wunderground/settings.xml
         301 100%    7.94kB/s    0:00:00 (xfer#1189, to-check=1442/4159)
lib/klibc-hm10z2i-yCdl9nqO2zmVzbBKmPU.so
       32768  42%  727.27kB/s    0:00:00  
[  692.567244] BUG: soft lockup - CPU#0 stuck for 61s! [z_wr_iss/0:1056]
[  758.066863] BUG: soft lockup - CPU#0 stuck for 61s! [z_wr_iss/0:1056]

So, I got 1189 files copied!

behlendorf · 2012-05-25T16:00:37Z

OK, well that's progress. We'll of need to dig in to exactly why those VM improvements cause trouble for a 32-bit ARM systems. But frankly until we move away from vmalloc() entirely 32-bit system are going to be a bit dodgy.

As for the sort lockups those are already known and usually not-fatal. They are just advisory that something it taking longer than expected. With both dedup and compression enabled that's perhaps not to surprising.

lundman · 2012-05-28T01:19:57Z

Actually, the soft-lockups are not temporary, they are stalled for good. I left it over the weekend and it does not recover. And lockups still happen with compression and dedup off. But as you say, 32bit has issues..

behlendorf · 2012-05-29T01:54:03Z

Bummer, well you should probably open a new issue for the soft lockup so we can at least track it. Please include the fact that it's a 32-bit system and the contents of dmesg which will have the full stacks associated with those soft lockup messages.

behlendorf · 2012-05-29T01:55:04Z

Oh, and on an unrelated note I may set up an ARM and PPC VM to ensure I don't break those builds going forward.

lundman · 2012-05-29T03:09:33Z

One of the troubles with the lockup, is that it is indeed 'locked up'. I can not run more commands, nor do the logs after a reboot contain any messages. Including serial console.
I've been using the cubox, and mele a2000 so far, but a VM is probably a better idea :)

lundman · 2012-09-28T04:39:10Z

Hey guys, not sure if I should add on here, or re-create a ticket. I was distracted for a bit.

Trying zfs-0.6.0rc11

Compile is 100% without modifications, very nice. Compiled with debug (removed Werror though). Even though we know the patch revision that I can roll out to fix things, here is what happens with sources intact; Note the newer kernel.

Linux cubox 3.4.6 #8 PREEMPT Sun Jul 22 14:18:53 MDT 2012 armv7l GNU/Linux

# insmod ../spl-0.6.0-rc11/module/spl/spl.ko && insmod ./module/avl/zavl.ko && insmod ./module/nvpair/znvpair.ko && insmod ./module/unicode/zunicode.ko && insmod ./module/zcommon/zcommon.ko && insmod ./module/zfs/zfs.ko && insmod ./module/zpios/zpios.ko

Sep 28 04:23:25 cubox kernel: SPL: Loaded module v0.6.0-rc11
Sep 28 04:23:25 cubox kernel: zavl: module license 'CDDL' taints kernel.
Sep 28 04:23:25 cubox kernel: Disabling lock debugging due to kernel taint
Sep 28 04:23:25 cubox kernel: ZFS: Loaded module v0.6.0-rc11 (DEBUG mode), ZFS pool version 28, ZFS filesystem version 5

root@cubox:/media/hdd/src/zfs-0.6.0-rc11# zpool create -f testpool /media/hdd/zpool.dump 
Sep 28 04:23:57 cubox kernel: SPL: Ignoring the /etc/hostid file because it is 0 bytes; expecting 4 bytes instead.
Sep 28 04:23:57 cubox kernel: SPL: using hostid 0xffffffff
Sep 28 04:23:57 cubox kernel: failure to allocate a tage (0)
Sep 28 04:24:32 cubox kernel: last message repeated 647 times

Note that first allocate error, not entirely sure what I am doing wrong already. Everything still works though.

root@cubox:/media/hdd/src/zfs-0.6.0-rc11# mkdir /testpool/ddiiieeee
Sep 28 04:25:33 cubox kernel: last message repeated 88981 times
Sep 28 04:26:34 cubox kernel: last message repeated 221497 times
Sep 28 04:27:35 cubox kernel: last message repeated 221384 times

Process never returns.

What does ZFS_DEBUG do, I was expecting some more messages (but I suppose the tage errors are those).

"echo t >/proc/sysrq-trigger" output here:

http://lundman.net/ftp/kern3.log

(Not sure why it looks bad at start, that is how it is on the machine /var/log/messages, unless the write corrupted)

behlendorf · 2012-10-05T19:53:43Z

The failure to allocate a tage (0) errors are an issue in the SPL debug log subsystem probably aggravated by the limited memory on your system. You can likely resolve the issue by clearing the debug mask for the SPL internal log. Longer term this should just get wired in to the kernels internal ftrace buffer.

echo 0 >/proc/sys/kernel/spl/debug/mask

lundman · 2012-10-10T07:14:16Z

Ah heh, well, I had hoped to turn on debug, so that perhaps I would get some hints as to where the problem lies. Setting debug/mask to 0 just seems to shut it up.

Anyway, I went with the old trusted 'litter everything with prints' style debugging, changing all dprintf to printk, as well as a few of my own:

 + arc_tempreserve_space: reserve 3031040
throttle(avail_mem 358219776)
. arc_tempreserve_space: throttle error 11
+ arc_tempreserve_space: reserve 3031040
throttle(avail_mem 358219776)
. arc_tempreserve_space: throttle error 11
Oct 10 06:57:07 cubox kernel: last message repeated 449 times

for ever. No change in the number either. Guess it is stuck in a busy loop.
This error comes from

    if (available_memory <= zfs_write_limit_max) {
        ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
        DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim);
        return (EAGAIN);
    }

Because I like to shoot from the hip, I changed the return to "0", and we get:

[2]+  Done                    mkdir /testpool/gaga
root@cubox:~# ls -la /testpool/
total 7
drwxr-xr-x  3 root root    3 Oct 10 07:07 .
drwxr-xr-x 31 root root 4096 Sep 27 07:40 ..
drwxr-xr-x  2 root root    2 Oct 10 07:07 gaga

I would have expected crashes, not that it pretends to work.

root@cubox:~# zfs create testpool/newfs
root@cubox:~# zfs list
NAME             USED  AVAIL  REFER  MOUNTPOINT
testpool         128K   984M    31K  /testpool
testpool/newfs    30K   984M    30K  /testpool/newfs
root@cubox:~# ls -l /testpool/
total 3
drwxr-xr-x 2 root root 2 Oct 10 07:07 gaga
drwxr-xr-x 2 root root 2 Oct 10 07:09 newfs
root@cubox:~# ls -l /testpool/newfs/
total 0
root@cubox:~# mkdir /testpool/newfs/deep
root@cubox:~# 
root@cubox:~# ls -l
total 100
-rw-r--r-- 1 root root 100829 May 23 06:09 stacktrace2.log
root@cubox:~# cp stacktrace2.log /testpool/newfs/deep/
root@cubox:~# sync
root@cubox:~# ls -l  /testpool/newfs/deep
total 99
-rw-r--r-- 1 root root 100829 Oct 10 07:09 stacktrace2.log

I assume it isn't just that easy (cos it never is), so you probably have a deeper understanding of the problem.

# time rsync -a --progress linux-3.4.6 /testpool/
oh crap, bad idea, lots of debug on :)
^C ^C
# ls -lR /testpool/linux-3.4.6/ | wc -l
1249

behlendorf · 2012-10-10T17:42:13Z

@lundman No, it really could be just that easy. Very little of the code is architecture specific so I don't expect any breakage except in regards to memory management. ZFS likes to proactively keep tabs on the available system resources and throttle its behavior accordingly. It seems those heuristics are just going horribly wrong on the more constrained 32-bit system.

This this case it appears the logic which throttles the TXG formation is causing the issue. Handily, there's a kstat file to observe this behavior, can you post the contents of /proc/spl/kstat/zfs/dmu_tx when you hit this issue. You should see the dmu_tx_memory_reclaim increasing rapidly and a few other counters, notice the DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim) line in the code you posted above.

Looking at arc_memory_throttle I'd normally suspect the #if defined(__i386) chunk which looks wrong to me and would force available_memory quite low. But since your running an ARM system that's clear not it in your case. Perhaps more likely is zfs_write_limit_max is being set to high by default. You can decrease this value directly by setting it as a module option. Or better yet just increase the zfs_write_limit_shiftmodule option. By default it's 3 which mean 1/8 of memory, increase it to 4 (1/16) or 5 (1/32) and see if it helps.

I'd also be interested in adding a printk to that if conditional when it fails to print those values. It appear you already did some of this printing the available memory (~300MB) should be plenty so I suspect zfs_write_limit_max should be reduced by default.

lundman · 2012-10-11T00:46:39Z

Oh sorry, yes I forgot those printk's. I have rsynced the kernel, then built it all, and created swap and attached. This version of zfs is running very well (so far).

    if (available_memory <= zfs_write_limit_max) {
        printk("throttle(avail_mem %llu <= %llu write_limit_max))", available_m\
emory, zfs_write_limit_max);
        ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
        DMU_TX_STAT_BUMP(dmu_tx_memory_reclaim);
        return 0;
        return (EAGAIN);
    }

Outputs:

throttle(avail_mem 337838080 <= 8646911284635238400 write_limit_max))

Ah I see, so all systems will less memory than 8646911284635238400 is considered small. You guys with your super computers...

lundman · 2012-10-11T01:18:17Z

And would you know it, no more issues. No more printing from throttle(avail_mem) at least.

I feel GCC should complain like crazy here. Unless the meaning of "unsigned long" means different things when compiled as _KERNEL as when not?

Actually, it can. There's no guarantee that unsigned long means the same thing in user space and kernel space. It depends on the architecture, for ARM, I have no idea without testing it. Also remember you could have a 64-bit kernel and 32-bit user space, that's obvious not the case here but it's another reason they could differ. This is one of the better reasons to stick to the u32/s32, u64/s64, etc types.

Anyway, it's clear we need to make sure this is a 64-bit type. We'll need to promote it as you did to a 'unsigned long long' and update the module_param() chunk at the bottom... there may be other issues like this for ARM....

The extra stacks on the console are a known issue which was fixed in master. Grab the latest source and they'll go away, or you can ignore them for now since they are entirely harmless. It's just some debugging which is this case is related to dedup.

Why don't you kick the tires a little bit more and see how it hangs together. Your blazing new trail here.

ryao · 2012-10-11T04:11:51Z

@lundman The issue is the following:

./module/zfs/dsl_pool.c:unsigned long zfs_write_limit_max = 0;          /* max data payload per txg */
./module/zfs/arc.c:extern uint64_t zfs_write_limit_max;

zfs_write_limit_max will be 32-bits on a 32-bit system and 64-bits on a 64-bit system. However, ARC will always access it as a 64-bit value, causing it to read an adjacent value in the upper 32-bits. This also means that we are clobbering the adjacent value during ARC initialization.

This could explain the issues that people are having on 32-bit systems, which always seemed to revolve around this code. Lacking the time needed to setup a 32-bit system, the best that I could do was eyeball this and unfortunately, this is a subtle detail that is unnoticeable unless looking at multiple files.

Other implementations appear to make this 64-bit, but the Linux kernel neither permits us to specify 64-bit module parameters on 32-bit systems nor would they make sense, so we should just replace uint64_t with unsigned long in the extern declaration.

It would be a good idea to review all variables declared extern uint64_t to see if they are being declared unsigned long. Someone should especially review the variables introduced by c409e46, which is what introduced this bug.

With that said, I have opened pull request #1034 to address this issue. @lundman, I am confident that this fix is correct, but it still needs to be tested. After you have tested it, I would appreciate it if you would also test the following:

ryao@a6d6269

It was originally intended to improve memory requirements on 32-bit systems, although any benefit that it provided was completely negated by this issue.

ryao · 2012-10-11T05:05:09Z

@lundman Looking through my email, it appears that you deleted the message where you said that you found the extern issue. In the future, please don't do that. I spent some time on this after reading throttle(avail_mem 337838080 <= 8646911284635238400 write_limit_max)) that would have been avoided had I known that you already caught the cause of the issue.

lundman · 2012-10-11T05:43:51Z

I didn't delete anything, but there sure is something weird going on. One comments looks like it is from me, but I didn't write it :)

But yes, it is definitely allocating space for 32bit, but arc.c is accessing 64bit. I simply trusted that GCC would warn me about that, but turns out it can only do that at linking time, and we are making kernel models, which relies on insmod to do it. Or such is my understanding. Let me catch up on the proposed patches and I will try them out.

ryao · 2012-10-11T05:53:25Z

Your understanding is correct. This is a perfect example of why global variables can be dangerous in C.

lundman · 2012-10-11T05:55:37Z

I did some grep'ing, externs are few, and no other uint64s around. Changing my fix with yours just to double check.

ryao · 2012-10-11T06:02:30Z

As an additional comment, it might be wise to either remove the #if defined(__i386) ... #endif preprocessor directive from arc.c or change it to be a 32-bit check. The code that it wraps is code that should be present on ARM (and all other 32-bit architectures), but is absent due to the preprocessor directive.

lundman · 2012-10-11T06:28:01Z

It seems to me that the ryao@a6d6269 patch is not entirely correct? Isn't the non-KERNEL code patch leaving an empty 'if' conditional, without a statement? Or is the if conditional immediately following supposed to be the statement? The indentation goes crazy in either case.

#else
    if (zfs_arc_max > 16<<20 && zfs_arc_max < physmem * PAGESIZE)
#endif
    if (zfs_arc_min > 16<<20 && zfs_arc_min <= arc_c_max)
        arc_c_min = zfs_arc_min;

ryao · 2012-10-11T06:39:32Z

Thanks for catching that. ryao/zfs@2074c38 should correct it.

lundman · 2012-10-11T06:57:44Z

Heh ok. So I apologise for the primitive testing, but as a quick test:

# mkdir /testpool/pre-patch
# cd /testpool/pre-patch
# time tar -xf /media/hdd/src/linux-3.4.6.tar.bz2
real    2m54.726s

Then I deleted that, compiled with ryao@2074c38 and rebooted.

# mkdir /testpool/post-patch
# cd /testpool/post-patch
# time tar -xf /media/hdd/src/linux-3.4.6.tar.bz2
real    2m55.152s

Patch most likely does not affect this output, but it also did not break anything. Both version output a large amount of:

Oct 11 06:53:16 cubox kernel: throttle(avail_mem 73011200 <= 83886080 write_limit_max)
Oct 11 06:53:30 cubox kernel: last message repeated 9068 times

So that code is triggered a bit more normal.

Actually, I should probably revert back to returning an error again.

ryao · 2012-10-11T07:05:22Z

How much RAM do you have? It seems like the present issue is that ZFS doesn't scale well to the amount of memory that your system has. Would you provide the contents of /proc/spl/kstat/zfs/arcstats?

Also, is the mkdir issue still present?

ryao · 2012-10-11T07:11:52Z

The advice that FreeBSD gives to people on i386 might be helpful:

http://wiki.freebsd.org/ZFSTuningGuide#i386

The equivalent to vm.kmem_size on Linux is the vmalloc kernel commandline parameter. With that said, it would be useful to have the output of cat /proc/meminfo | grep Vm so we know what vmalloc is on your system. On i386, it is supposed to be 100MB.

lundman · 2012-10-11T07:12:03Z

Most of my testing is on the CuBox, which has 1GB ram. Although, I think I boot with vmalloc=384M if that matters.

Interestingly, I changed the throttle function back to return EAGAIN instead of my "0", and:

# time tar -xf /media/hdd/src/linux-3.4.6.tar.bz2
real    5m43.259s

Which is quite a bit slower. But I suppose it has to retry after all. But more importantly, with the 32/64bit fix, all the ARM/32bit problems are gone. So mkdir is not having issues. I would consider this issue to be closed with the ryao@4b431ae fix.

On the extern issue, I suppose compiling ZFS statically into the kernel would have shown that problem, but I have never tried myself. Can ZFS be compiled into the kernel?

lundman · 2012-10-11T07:13:40Z

VmallocTotal:     385024 kB
VmallocUsed:      171672 kB
VmallocChunk:      31792 kB

I guess I should also increase the vmalloc variable. My background is Sun so I am somewhat new to the linux specific tweaks.

ryao · 2012-10-11T07:16:53Z

I am not sure if compiling ZFS into the kernel would have exposed the extern issue. However, you can compile ZFS into your kernel. You will need to compile the SPL into the kernel too. See:

https://github.com/zfsonlinux/zfs/blob/master/README.markdown

With that said, I think you might want to try tuning the default module parameters. Here are the values that the FreeBSD wiki suggested:

zfs_arc_max=41943040
zfs_vdev_cache_size=5242880

ryao · 2012-10-11T07:22:36Z

@lundman With regard to vmalloc, increasing it is probably a good idea. It might be worthwhile trying to set vmalloc. The kernel has a 4GB address space and you only have 1GB of RAM. Dedicating 1GB of the address space to virtual memory should minimize the effect of external fragmentation in that space with no side effects.

lundman · 2012-10-11T07:28:14Z

Oh neat, so you can make static kernels. Thanks for the information.

I just checked my uboot, and perhaps I don't boot with vmalloc after all, either way, playing with your numbers, I get:

# time tar -xf /media/hdd/src/linux-3.4.6.tar.bz2
real    2m38.221s

Quite the difference. Possible due to that there is not a single warning from throttle given.

ryao · 2012-10-11T07:32:39Z

That is great news. It sounds like we might be able to make ARC pick better default values on systems with less memory.

Would you post the contents of /proc/spl/kstat/zfs/arcstats when the module is loaded with the default settings (and again when it is loaded with those settings)?

lundman · 2012-10-11T07:55:15Z

Ok, setting vmalloc=1G did not work so well, it kept shooting my processes when it ran out of memory. So, 800M test worked, result

real    2m35.740s

A few seconds less.

Loading ZFS with default values:

Hmm old on, need to figure out how to attach files

ryao · 2012-10-11T07:57:36Z

You can wrap the output with three tildas at the beginning and three tildas at the end to form a block quotation.

As for what happened with vmalloc=1GB, that suggests that my understanding of that tunable was incorrect. I will need to look into that, although that will have to wait for when I have more time.

lundman · 2012-10-11T08:00:31Z

Ah they were just a bit long, so I thought attachment was better. Turns out you can't attach anyway.

Default:

4 1 0x01 77 3696 113230886357 137709963974
name                            type data
hits                            4    3199
misses                          4    559
demand_data_hits                4    3
demand_data_misses              4    1
demand_metadata_hits            4    3190
demand_metadata_misses          4    91
prefetch_data_hits              4    0
prefetch_data_misses            4    345
prefetch_metadata_hits          4    6
prefetch_metadata_misses        4    122
mru_hits                        4    1663
mru_ghost_hits                  4    0
mfu_hits                        4    1533
mfu_ghost_hits                  4    0
deleted                         4    42
recycle_miss                    4    0
mutex_miss                      4    0
evict_skip                      4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    0
evict_l2_ineligible             4    6144
hash_elements                   4    519
hash_elements_max               4    519
hash_collisions                 4    34
hash_chains                     4    21
hash_chain_max                  4    1
p                               4    167772160
c                               4    335544320
c_min                           4    41943040
c_max                           4    335544320
size                            4    3617528
hdr_size                        4    177160
data_size                       4    3216896
other_size                      4    223472
anon_size                       4    16384
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    3082240
mru_evict_data                  4    2064896
mru_evict_metadata              4    765440
mru_ghost_size                  4    172032
mru_ghost_evict_data            4    8192
mru_ghost_evict_metadata        4    163840
mfu_size                        4    118272
mfu_evict_data                  4    0
mfu_evict_metadata              4    3584
mfu_ghost_size                  4    16896
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    16896
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1552632
arc_meta_limit                  4    83886080
arc_meta_max                    4    1706832

and with insmod ./module/zfs/zfs.ko zfs_arc_max=41943040 zfs_vdev_cache_size=5242880

4 1 0x01 77 3696 61461128665 87424519525
name                            type data
hits                            4    3113
misses                          4    121
demand_data_hits                4    3
demand_data_misses              4    1
demand_metadata_hits            4    3103
demand_metadata_misses          4    96
prefetch_data_hits              4    0
prefetch_data_misses            4    0
prefetch_metadata_hits          4    7
prefetch_metadata_misses        4    24
mru_hits                        4    1963
mru_ghost_hits                  4    0
mfu_hits                        4    1146
mfu_ghost_hits                  4    0
deleted                         4    42
recycle_miss                    4    0
mutex_miss                      4    0
evict_skip                      4    0
evict_l2_cached                 4    0
evict_l2_eligible               4    0
evict_l2_ineligible             4    6144
hash_elements                   4    77
hash_elements_max               4    77
hash_collisions                 4    6
hash_chains                     4    1
hash_chain_max                  4    1
p                               4    20971520
c                               4    41943040
c_min                           4    5242880
c_max                           4    41943040
size                            4    931576
hdr_size                        4    36472
data_size                       4    673792
other_size                      4    221312
anon_size                       4    16384
anon_evict_data                 4    0
anon_evict_metadata             4    0
mru_size                        4    654336
mru_evict_data                  4    0
mru_evict_metadata              4    304128
mru_ghost_size                  4    163840
mru_ghost_evict_data            4    8192
mru_ghost_evict_metadata        4    155648
mfu_size                        4    3072
mfu_evict_data                  4    0
mfu_evict_metadata              4    3072
mfu_ghost_size                  4    512
mfu_ghost_evict_data            4    0
mfu_ghost_evict_metadata        4    512
l2_hits                         4    0
l2_misses                       4    0
l2_feeds                        4    0
l2_rw_clash                     4    0
l2_read_bytes                   4    0
l2_write_bytes                  4    0
l2_writes_sent                  4    0
l2_writes_done                  4    0
l2_writes_error                 4    0
l2_writes_hdr_miss              4    0
l2_evict_lock_retry             4    0
l2_evict_reading                4    0
l2_free_on_write                4    0
l2_abort_lowmem                 4    0
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    0
l2_hdr_size                     4    0
memory_throttle_count           4    0
memory_direct_count             4    0
memory_indirect_count           4    0
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    931576
arc_meta_limit                  4    10485760
arc_meta_max                    4    1085776

Commit c409e46 introduced a number of module parameters. Of these, ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Closes openzfs#749 Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]>

ryao · 2012-10-11T13:42:27Z

@lundman For what it is worth, your performance will likely become even better once pull request #1013 is merged.

behlendorf · 2012-10-11T18:12:36Z

@ryao Thanks for digging in to this and getting to the real root cause, but I tweaked your patch slightly. This would have been caught as a conflicting type by gcc if the extern declarations were in a header included by both arc.c and dsl_pool.c. However, because they were just declared in arc.c gcc couldn't catch it at compile time. This is a good example of why that shouldn't be done.

These zfs_write_limit_* tunings should really be private to dsl_pool.c and passed in to the ARC as needed but that's a bigger change than I want to make right now. So we'll just fix up the types and move them to a common header.

@lundman Thanks for getting the fix tested so promptly.

Hopefully, this get's us back to things being largely functional for 32-bit systems. I'm sure there's still lots of tuning to be done but we can work through that in other issues.

ryao · 2012-10-11T18:44:47Z

@behlendorf This will make ZFS functional on ARM systems, but i386 systems require some additional work. I asked @amospalla in IRC to test this patch patches for me on his i386 system and found an additional problem that is conveniently wrapped by #if defined(__i386) ... #endif. I outlined what is happening there in issue #831.

Commit c409e46 introduced a number of module parameters. This required several types to be changed to accomidate the required module parameters Linux macros. Unfortunately, arc.c contained its own extern definition of the zfs_write_limit_max variable and its type was not updated to be consistent with its dsl_pool.c counterpart. If the variable had been properly marked extern in a common header, then gcc would have generated a warning and this would not have slipped through. The result of this was that the ARC unconditionally expected zfs_write_limit_max to be 64-bit. Unfortunately, the largest size integer module parameter that Linux supports is unsigned long, which varies in size depending on the host system's native word size. The effect was that on 32-bit systems, ARC incorrectly performed 64-bit operations on a 32-bit value by reading the neighboring 32 bits as the upper 32 bits of the 64-bit value. We correct that by changing the extern declaration to use the unsigned long type and move these extern definitions in to the common arc.h header. This should make ARC correctly treat zfs_write_limit_max as a 32-bit value on 32-bit systems. Reported-by: Jorgen Lundman <[email protected]> Signed-off-by: Richard Yao <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#749

ryao mentioned this issue Oct 11, 2012

Fix zfs_write_limit_max integer size mismatch on 32-bit systems #1034

Closed

ryao mentioned this issue Oct 11, 2012

Excessive memory_throttle_count value on i386 #831

Closed

behlendorf closed this as completed in 7df05a4 Oct 11, 2012

zfs-0.6.0rc8 (repo) mkdir can hang on ARM 32bit system #749

zfs-0.6.0rc8 (repo) mkdir can hang on ARM 32bit system #749

Comments

lundman commented May 21, 2012

behlendorf commented May 21, 2012

lundman commented May 22, 2012

ryao commented May 22, 2012

lundman commented May 22, 2012

behlendorf commented May 22, 2012

lundman commented May 23, 2012

lundman commented May 24, 2012

behlendorf commented May 24, 2012

lundman commented May 25, 2012

lundman commented May 25, 2012

behlendorf commented May 25, 2012

lundman commented May 28, 2012

behlendorf commented May 29, 2012

behlendorf commented May 29, 2012

lundman commented May 29, 2012

lundman commented Sep 28, 2012

behlendorf commented Oct 5, 2012

lundman commented Oct 10, 2012

behlendorf commented Oct 10, 2012

lundman commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

lundman commented Oct 11, 2012

ryao commented Oct 11, 2012

behlendorf commented Oct 11, 2012

ryao commented Oct 11, 2012