Skip to content

Commit

Permalink
OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Browse files Browse the repository at this point in the history
Authored by: Prakash Surya <[email protected]>
Reviewed by: John Kennedy <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Reviewed by: George Wilson <[email protected]>
Reviewed by: Brad Lewis <[email protected]>
Reviewed by: Igor Kozhukhov <[email protected]>
Reviewed by: Brian Behlendorf <[email protected]>
Approved by: Robert Mustacchi <[email protected]>
Ported-by: Prakash Surya <[email protected]>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: openzfs/openzfs@ece62b6f8d
Closes #6940
  • Loading branch information
Prakash Surya authored and behlendorf committed Dec 28, 2017
1 parent 823d48b commit 2fe61a7
Show file tree
Hide file tree
Showing 9 changed files with 254 additions and 110 deletions.
1 change: 1 addition & 0 deletions include/sys/zil.h
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,7 @@ extern void zil_itx_destroy(itx_t *itx);
extern void zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *tx);

extern void zil_commit(zilog_t *zilog, uint64_t oid);
extern void zil_commit_impl(zilog_t *zilog, uint64_t oid);

extern int zil_vdev_offline(const char *osname, void *txarg);
extern int zil_claim(struct dsl_pool *dp,
Expand Down
39 changes: 32 additions & 7 deletions include/sys/zil_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,38 @@ extern "C" {
#endif

/*
* Possbile states for a given lwb structure. An lwb will start out in
* the "closed" state, and then transition to the "opened" state via a
* call to zil_lwb_write_open(). After the lwb is "open", it can
* transition into the "issued" state via zil_lwb_write_issue(). After
* the lwb's zio completes, and the vdev's are flushed, the lwb will
* transition into the "done" state via zil_lwb_write_done(), and the
* structure eventually freed.
* Possible states for a given lwb structure.
*
* An lwb will start out in the "closed" state, and then transition to
* the "opened" state via a call to zil_lwb_write_open(). When
* transitioning from "closed" to "opened" the zilog's "zl_issuer_lock"
* must be held.
*
* After the lwb is "opened", it can transition into the "issued" state
* via zil_lwb_write_issue(). Again, the zilog's "zl_issuer_lock" must
* be held when making this transition.
*
* After the lwb's zio completes, and the vdev's are flushed, the lwb
* will transition into the "done" state via zil_lwb_write_done(). When
* transitioning from "issued" to "done", the zilog's "zl_lock" must be
* held, *not* the "zl_issuer_lock".
*
* The zilog's "zl_issuer_lock" can become heavily contended in certain
* workloads, so we specifically avoid acquiring that lock when
* transitioning an lwb from "issued" to "done". This allows us to avoid
* having to acquire the "zl_issuer_lock" for each lwb ZIO completion,
* which would have added more lock contention on an already heavily
* contended lock.
*
* Additionally, correctness when reading an lwb's state is often
* achieved by exploiting the fact that these state transitions occur in
* this specific order; i.e. "closed" to "opened" to "issued" to "done".
*
* Thus, if an lwb is in the "closed" or "opened" state, holding the
* "zl_issuer_lock" will prevent a concurrent thread from transitioning
* that lwb to the "issued" state. Likewise, if an lwb is already in the
* "issued" state, holding the "zl_lock" will prevent a concurrent
* thread from transitioning that lwb to the "done" state.
*/
typedef enum {
LWB_STATE_CLOSED,
Expand Down
1 change: 0 additions & 1 deletion include/sys/zio.h
Original file line number Diff line number Diff line change
Expand Up @@ -598,7 +598,6 @@ extern enum zio_checksum zio_checksum_dedup_select(spa_t *spa,
extern enum zio_compress zio_compress_select(spa_t *spa,
enum zio_compress child, enum zio_compress parent);

extern void zio_cancel(zio_t *zio);
extern void zio_suspend(spa_t *spa, zio_t *zio);
extern int zio_resume(spa_t *spa);
extern void zio_resume_wait(spa_t *spa);
Expand Down
17 changes: 16 additions & 1 deletion man/man5/zfs-module-parameters.5
Original file line number Diff line number Diff line change
Expand Up @@ -775,6 +775,21 @@ Disable pool import at module load by ignoring the cache file (typically \fB/etc
Use \fB1\fR for yes (default) and \fB0\fR for no.
.RE

.sp
.ne 2
.na
\fBzfs_commit_timeout_pct\fR (int)
.ad
.RS 12n
This controls the amount of time that a ZIL block (lwb) will remain "open"
when it isn't "full", and it has a thread waiting for it to be committed to
stable storage. The timeout is scaled based on a percentage of the last lwb
latency to avoid significantly impacting the latency of each individual
transaction record (itx).
.sp
Default value: \fB5\fR.
.RE

.sp
.ne 2
.na
Expand Down Expand Up @@ -2120,7 +2135,7 @@ Default value: \fB0\fR.
.ad
.RS 12n
The maximum number of taskq entries that are allowed to be cached. When this
limit is exceeded itx's will be cleaned synchronously.
limit is exceeded transaction records (itxs) will be cleaned synchronously.
.sp
Default value: \fB1048576\fR.
.RE
Expand Down
Loading

0 comments on commit 2fe61a7

Please sign in to comment.