Skip to content

c-blake/batch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 

Repository files navigation

While adding a new syscall was one of the original marketing points of the Linux loadable module system, Linux 6.9 made syscalls just a big switch() defeating using modules. At the low cost of one fd per process, this idea could become a device driver (e.g. writing a batch struct to run), but that is pending work & may have very different performance characteristics.

batch: Generic Linux System Call Batching

Kernel <-> user crossings are expensive. Across any such boundary (IPC or network messages are other obvious cases), it makes sense to do as much work per crossing as possible. Batching is one approach. In the Linux user-kernel setting, privilege checking and such is already done inside system call code. So, security implications should be minimal and there should be no need to restrict code uploaders to be privileged or verify code like EBPF. All we need to do to add this to Linux is decide on a convenient API to loop over an array of system calls storing into an array of return values. That's what this package does.

Only minimal control flow in-kernel is given so work is trivially loop-free & bounded by batch size. Specifically, a batch can only jump forward by 1 or more array slots in the array of system calls trivalently on returns { < 0, == 0, > 0 }. Said returns conventionally mean usually { error, success/done/EOF, and more-work-or answer} conditions.

Not all system call targets are allowed since call-return protocols vary from the usual, e.g. fork, exec, or batch itself (to prevent loops spelled as recursion). Blocked system calls get -ENOSYS in the return value slot.

One "fake" syscall is implemented in-line in the sub-call dispatch loop: word copy to allow chaining outputs of one call to inputs of subsequent calls. Any unimplemented/always failing call can skip blocks meant as an error/alternate paths. This allows representing a great many multi-call-programs.

This kind of interface is easy to "emulate" in pure user-space code when a deployment system has no sys_batch available. include/linux/batch.h has such an emulator activated by BATCH_EMUL being set. Such emulation is also useful to benchmark improvement due to the system call.

That is fairly abstract. A demo total.c may help. Another example is file tree walking (ftw) when user code needs file metadata (sizes, times, owners, ..). getdents64 is already a batch interface, but the stat's are not. du is a classic example here. In personal timings, I see ~1.3x speed-ups for mdu.c over BATCH_EMUL=1 mdu (and ~1.7x speed-ups vs GNU du since the latter probably uses an ftw more expensive than ftw.c to do tree depths unbounded by open fd limits).

Another natural example is "path search" wherein a user program attempts several-to-many easily pre-computed paths, stopping at the first one which succeeds/has a syscall return 0. This would be a syscall_t array with many scall3(open, 0, -1, -1, pathX, flags, 0) entries. In reality, examples are limited only by one's imagination, but be forewarned that much system work interacting with real devices is dominated by much larger times & overheads.

It's also not hard to imagine compilers to detect batchable situations automatically and even auto-convert. On the source side this is not so different from auto-vectorization. The target language is also not so far from an assembly language with no backward jumping.

Oh, and, as set up right now, it only works on Linux x86_64 for kernels in the late 4.* to present 6.* version ranges. It might work on earlier 3.x versions, but I haven't tested it on such. For my development convenience, I hacked it up as a module hijacking the afs_syscall slot. Usage should be as easy as:

git clone https://github.com/c-blake/batch $HOME/s/bat
cd $HOME/s/bat/module; ./build
as-root insmod batch.ko
cd ../examples; make
./mdu
BATCH_EMUL=1 ./mdu
du -sbl

You can e.g. run strace ./mdu to see if afs_syscall is being used. You may need to set CONFIG_RANDOMIZE_BASE=n in your kernel config or at least reboot with nokaslr=1 on the kernel command line to get the module inserted.

At present, I would not recommend deploying this on a system with untrusted user code. The deny list hasn't been vetted for security implications or interactions with syscall auditing. It seemed worth sharing/getting feedback upon.