Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use seccomp policy to avoid necessary sync operations #44

Merged
merged 2 commits into from
Nov 25, 2020

Conversation

talex5
Copy link
Contributor

@talex5 talex5 commented Nov 24, 2020

Sync operations are really slow on btrfs. They're also pointless, since if the computer crashes while we're doing a build then we'll just throw it away and start again anyway.

This commit provides a seccomp policy that causes all sync operations to "fail", with errno 0 ("success").

On my machine, this reduces the time to apt-get install -y shared-mime-info from 18.5s to 4.7s.

Based on https://bblank.thinkmo.de/using-seccomp-to-filter-sync-operations.html

@avsm
Copy link
Member

avsm commented Nov 24, 2020

Good plan! I hadn't realised you could force syscalls to succeed using seccomp too.

@talex5
Copy link
Contributor Author

talex5 commented Nov 24, 2020

Hmm, looks like the version of runc in Ubuntu 20.04 is too old for this.

Sync operations are really slow on btrfs. They're also pointless, since
if the computer crashes while we're doing a build then we'll just throw
it away and start again anyway.

This commit provides a seccomp policy that causes all sync operations to
"fail", with errno 0 ("success").

On my machine, this reduces the time to `apt-get install -y shared-mime-info`
from 18.5s to 4.7s.

Based on https://bblank.thinkmo.de/using-seccomp-to-filter-sync-operations.html

Use `--fast-sync` to enable to new behaviour (requires the latest runc).
This should allow `linux32` to work.
match get_machine () with
| "x86_64" -> ["SCMP_ARCH_X86_64"; "SCMP_ARCH_X86"; "SCMP_ARCH_X32"]
| "aarch64" -> ["SCMP_ARCH_AARCH64"; "SCMP_ARCH_ARM"]
| _ -> []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we enumerate this somehow so that it'll fail on an unknown arch? Otherwise we'll run into this when adding riscv-32 in the future. (or could just make a note to remember to update this somewhere when we get around to riscv32)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is logic in https://github.com/avsm/osrelease/blob/master/lib/osrelease.ml that i could release that does all the arch detection (based on opams), if that helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do. But when we add a new multi-arch platform then we'll test it and discover the problem immediately anyway.

@talex5 talex5 merged commit f4eeaaf into ocurrent:master Nov 25, 2020
@talex5 talex5 deleted the fast-sync branch November 25, 2020 10:02
@talex5
Copy link
Contributor Author

talex5 commented Nov 25, 2020

Merging now to fix cluster performance problems. Can be improved later if needed.

patricoferris pushed a commit to patricoferris/obuilder that referenced this pull request Nov 27, 2020
Might help with problems such as this:

```
[11030132.006555] INFO: task ocluster-worker:602217 blocked for more than 120 seconds.
[11030132.015596]       Not tainted 5.4.0-40-generic ocurrent#44-Ubuntu
[11030132.022547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11030132.032061] ocluster-worker D    0 602217      1 0x00004000
[11030132.032069] Call Trace:
[11030132.032092]  __schedule+0x2e3/0x740
[11030132.032106]  ? __switch_to_asm+0x40/0x70
[11030132.032116]  ? __switch_to_asm+0x34/0x70
[11030132.032126]  schedule+0x42/0xb0
[11030132.032130]  schedule_preempt_disabled+0xe/0x10
[11030132.032132]  __mutex_lock.isra.0+0x182/0x4f0
[11030132.032142]  ? try_to_del_timer_sync+0x54/0x80
[11030132.032145]  __mutex_lock_slowpath+0x13/0x20
[11030132.032148]  mutex_lock+0x2e/0x40
[11030132.032199]  btrfs_start_delalloc_roots+0x60/0x280 [btrfs]
[11030132.032238]  flush_space+0x5dd/0x740 [btrfs]
[11030132.032281]  ? lock_extent_buffer_for_io+0x370/0x370 [btrfs]
[11030132.032325]  ? __clear_extent_bit+0x201/0x4a0 [btrfs]
[11030132.032372]  priority_reclaim_metadata_space.isra.0+0x18b/0x220 [btrfs]
[11030132.032429]  ? can_overcommit.part.0+0x5f/0xc0 [btrfs]
[11030132.032466]  btrfs_reserve_metadata_bytes+0x578/0x950 [btrfs]
[11030132.032501]  ? btrfs_truncate_inode_items+0x35e/0xdb0 [btrfs]
[11030132.032505]  ? __mutex_lock.isra.0+0x429/0x4f0
[11030132.032557]  ? __btrfs_block_rsv_release+0x1c1/0x300 [btrfs]
[11030132.032595]  btrfs_block_rsv_refill+0x7d/0xa0 [btrfs]
[11030132.032628]  evict_refill_and_join+0x39/0xd0 [btrfs]
[11030132.032670]  btrfs_evict_inode+0x417/0x4c0 [btrfs]
[11030132.032689]  evict+0xd2/0x1b0
[11030132.032698]  iput+0x148/0x210
[11030132.032708]  dentry_unlink_inode+0xc6/0x110
[11030132.032720]  d_delete+0x76/0x80
[11030132.032727]  vfs_rmdir+0x179/0x1a0
[11030132.032732]  do_rmdir+0x18c/0x1c0
[11030132.032736]  __x64_sys_rmdir+0x17/0x20
[11030132.032744]  do_syscall_64+0x57/0x190
[11030132.032747]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```
talex5 added a commit to talex5/opam-repository that referenced this pull request Dec 30, 2020
CHANGES:

- Add support for nested / multi-stage builds (@talex5 ocurrent/obuilder#48 ocurrent/obuilder#49).
  This allows you to use a large build environment to create a binary and then
  copy that into a smaller runtime environment. It's also useful to get better caching
  if two things can change independently (e.g. you want to build your software and also
  a linting tool, and be able to update either without rebuilding the other).

- Add healthcheck feature (@talex5 ocurrent/obuilder#52).
  - Checks that Docker is running.
  - Does a test build using busybox.

- Clean up left-over runc containers on restart (@talex5 ocurrent/obuilder#53).
  If btrfs crashes and makes the filesystem read-only then after rebooting there will be stale runc directories.
  New jobs with the same IDs would then fail.

- Remove dependency on dockerfile (@talex5 ocurrent/obuilder#51).
  This also allows us more control over the formatting
  (e.g. putting a blank line between stages in multi-stage builds).

- Record log output from docker pull (@talex5 ocurrent/obuilder#46).
  Otherwise, it's not obvious why we've stopped at a pull step, or what is happening.

- Improve formatting of OBuilder specs (@talex5 ocurrent/obuilder#45).

- Use seccomp policy to avoid necessary sync operations (@talex5 ocurrent/obuilder#44).
  Sync operations are really slow on btrfs. They're also pointless,
  since if the computer crashes while we're doing a build then we'll just throw it away and start again anyway.
  Use a seccomp policy that causes all sync operations to "fail", with errno 0 ("success").
  On my machine, this reduces the time to `apt-get install -y shared-mime-info` from 18.5s to 4.7s.
  Use `--fast-sync` to enable to new behaviour (it requires runc 1.0.0-rc92).

- Use a mutex to avoid concurrent btrfs operations (@talex5 ocurrent/obuilder#43).
  Btrfs deadlocks enough as it is. Don't stress it further by trying to do two things at once.

Internal changes:

- Improve handling of file redirections (@talex5 ocurrent/obuilder#46).
  Instead of making the caller do all the work of closing the file descriptors safely, add an `FD_move_safely` mode.

- Travis tests: ensure apt cache is up-to-date (@talex5 ocurrent/obuilder#50).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants