Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data race and deadlock in file_integrity module #8027

Merged
merged 2 commits into from
Aug 22, 2018

Conversation

adriansr
Copy link
Contributor

@adriansr adriansr commented Aug 20, 2018

This PR fixes two issues with auditbeat's file_integrity module:

  • Concurrent map write in [Auditbeat] TestIncludedExcludedFiles fails with race condition #8009
    Auditbeat's file_integrity module had a data race in its recursive watcher implementation for Linux. Was caused by file watches being installed from different goroutines.
    This patch fixes the problem by moving watch installation to the same goroutine that processes inotify's events, guaranteeing that watches are installed in order, which is necessary for consistence.

  • deadlock when file integrity monitor is started
    A deadlock is possible in auditbeat's file_integrity module under Windows: When enough events arrive while watches are being installed, the event channel can fill causing the installation of a watch to block. This patch makes sure that events are received while watches are being installed, and at the same time ensures that no event is lost.

@adriansr adriansr requested review from cwurm and andrewkroh August 20, 2018 21:17
@adriansr adriansr force-pushed the fix/ab/8009 branch 2 times, most recently from 9064961 to e9a75e8 Compare August 20, 2018 21:20
@adriansr adriansr added in progress Pull request is currently in progress. and removed review labels Aug 20, 2018
@adriansr
Copy link
Contributor Author

adriansr commented Aug 20, 2018

There seems to be a problem I didn't catch in local testing done

Auditbeat's file_integrity module had a data race in its recursive
watcher implementation for Linux. Was caused by file watches being
installed from different goroutines.

This patch fixes the problem by moving watch installation to the same
goroutine that processes inotify's events, guaranteeing that watches
are installed in order, which is necessary for consistence.

Fixes elastic#8009
@adriansr adriansr added review and removed in progress Pull request is currently in progress. labels Aug 20, 2018
// Launch a separate goroutine to fetch all events that happen while
// watches are being installed.
go func() {
queueC <- r.enqueueEvents(queueDone)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the writer is done with queueC it should be closed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

defer close(done)

// Generate a lot of events in parallel to Start() so there is a chance of
// events arriving before all watched dirs are Add()-ed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

A deadlock is possible in auditbeat's file_integrity module under
Windows: When enough events arrive while watches are being installed,
the event channel can fill causing the installation of a watch to
block.

This patch makes sure that events are received while watches are being
installed, and at the same time ensures that no event is lost.
@cwurm
Copy link
Contributor

cwurm commented Aug 21, 2018

Regarding the deadlock fix: I wonder if it would be possible to do the watch installation in a goroutine to prevent the deadlock?

Something like this in eventreader_fsnotify.go:

func (r *reader) Start(done <-chan struct{}) (<-chan Event, error) {
  ...
  go r.consumeEvents(done)
  go r.installWatches()
  ...
  return r.eventC, nil
}

My thinking is that that way, Start would return right away, allowing init/run in metricset.go to start consuming events, preventing a block.

I might be missing something.

@adriansr
Copy link
Contributor Author

@cwurm, that's how I implemented it at first. However, it introduces an inconsistency that will trigger failures in tests, as the test may be performing an action right after Start but before a watch is installed, causing an expected event to be lost.

@cwurm
Copy link
Contributor

cwurm commented Aug 21, 2018

@adriansr Ah, shame. LGTM then.

Quick question on the concurrent map write: Would synchronizing read/write methods of FileTree have worked as well? Do you think that's something we should do at some point?

@adriansr
Copy link
Contributor Author

Yes @cwurm, that would have fixed the problem too. I felt it made more sense to confine all the accesses in the same place.

@adriansr adriansr merged commit 51267b4 into elastic:master Aug 22, 2018
@cwurm cwurm added the v6.5.0 label Aug 30, 2018
cwurm pushed a commit that referenced this pull request Sep 5, 2018
…y module (#8169)

* Install watches in single goroutine (#8009)

Auditbeat's file_integrity module had a data race in its recursive
watcher implementation for Linux. Was caused by file watches being
installed from different goroutines.

This patch fixes the problem by moving watch installation to the same
goroutine that processes inotify's events, guaranteeing that watches
are installed in order, which is necessary for consistence.

Fixes #8009

(cherry picked from commit 698cf9c)

* Fix deadlock when file integrity monitor is started

A deadlock is possible in auditbeat's file_integrity module under
Windows: When enough events arrive while watches are being installed,
the event channel can fill causing the installation of a watch to
block.

This patch makes sure that events are received while watches are being
installed, and at the same time ensures that no event is lost.

(cherry picked from commit 51267b4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants