Replace batcher with S3 inventory #131

austinbyers · 2018-08-15T18:17:17Z

to: @ryandeivert
cc: @airbnb/binaryalert-maintainers
size: large
resolves: #18
resolves: #46
resolves: #120

Background

The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.

Changes

Lambda Functions

Remove the batcher Lambda function entirely
Build Lambda functions as a proper Python package to remove the hacky if __package__ import logic
Reduce the S3 connection timeout - the tail end of binary download latencies approach the 60 second default timeout, but there's no need to wait that long before retrying the connection

Terraform

Enable S3 inventory on the binary S3 bucket (Terraform support for this was only recently added)
Remove BatchEnqueueFailures alarm, since the batcher is gone
Remove the throttle alarms - throttles are more common when invoking Lambda via SQS and are automatically retried
Set a concurrency limit for both Lambda functions (analyzer and downloader). This prevents the whole account from running out of concurrency if there are millions of objects in the queue

CLI

There are 3 new CLI commands:
- purge_queue: Purge the analyzer queue, immediately stopping any retroactive analysis
- retro_fast: Add all objects from the latest S3 inventory manifest onto the analysis queue
- retro_slow: Enumerate the bucket manually (like the batcher did before)
Retroactive scans use multiple processes in parallel to send messages to SQS
The deploy command no longer starts a retroactive scan
The monolithic manage.py script has been separated into different components in cli/

Tests

The individual test commands in .travis.yml have been moved to a standalone script tests/ci_tests.sh. This makes it easier for contributors to test their changes in exactly the same way that Travis will
Remove tests/ from coverage measurement. Adding unit tests artificially inflated the coverage measure due to the extra lines of code.

Testing

$ ./manage.py --help
usage: manage.py [-h] [--version] command

positional arguments:
  command     apply          Apply any configuration/package changes with Terraform
              build          Build Lambda packages (saves *.zip files in terraform/)
              cb_copy_all    Copy all binaries from CarbonBlack Response into BinaryAlert
              clone_rules    Clone YARA rules from other open-source projects
              compile_rules  Compile all of the YARA rules into a single binary file
              configure      Update basic configuration, including region, prefix, and downloader settings
              deploy         Deploy BinaryAlert (equivalent to unit_test + build + apply)
              destroy        Teardown all of the BinaryAlert infrastructure
              live_test      Upload test files to BinaryAlert which should trigger YARA matches
              purge_queue    Purge the analysis SQS queue (e.g. to stop a retroactive scan)
              retro_fast     Enumerate the most recent S3 inventory for fast retroactive analysis
              retro_slow     Enumerate the entire S3 bucket for slow retroactive analysis
              unit_test      Run unit tests (*_test.py)

$ ./manage.py configure

$ ./manage.py deploy

$ ./manage.py live_test

$ time ./manage.py retro_fast
Reading inventory/.../EntireBucketDaily/2018-08-13T08-00Z/manifest.json
94679: requirements_top_level.txt
Done!

real	0m20.067s

$ time ./manage.py retro_slow
94682: requirements_top_level.txt
Done!

real	1m10.056s

$ ./manage.py cb_copy_all

$ ./manage.py purge_queue

Note that reading from the inventory (retro_fast) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.

Reviewers

Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of cli/config.py and cli/manager.py (and their unit tests) are unchanged, except for the addition of inventory / queueing logic.

coveralls · 2018-08-15T18:21:39Z

Coverage increased (+0.5%) to 92.189% when pulling 12692fd on austin-remove-batcher into ca049c5 on master.

ryandeivert

a handful of comments but LGTM!

ryandeivert · 2018-08-15T20:23:23Z

.travis.yml

-  - mypy lambda_functions rules *.py --disallow-untyped-defs --ignore-missing-imports --warn-unused-ignores
-  - bandit -r .  # Configuration in .bandit
-  - sphinx-build -W docs/source docs/build
+  - tests/ci_tests.sh


ryandeivert · 2018-08-15T20:24:31Z

cli/__init__.py

@@ -0,0 +1,2 @@
+"""BinaryAlert release version"""
+VERSION = '1.1.0'


I think the proper way to version is to use the dunder __version__ attribute. Reference: https://www.python.org/dev/peps/pep-0396/

Ah, good call! 📞

ryandeivert · 2018-08-15T20:34:10Z

cli/config.py

+            try:
+                self.carbon_black_url = get_input('CarbonBlack URL', self.carbon_black_url)
+                break
+            except InvalidConfigError as error:


It looks list there are a handful of places where you catch InvalidConfigError when calling get_input.. however, get_input would never raise this exception 🤔 . am I missing something?

The property assignment self.carbon_black_url = ... goes through a setter method with validation logic

right yeah sorry I realized that after I left this comment - feel free to ignore :)

ryandeivert · 2018-08-15T20:35:43Z

cli/config.py

+            except InvalidConfigError as error:
+                print('ERROR: {}'.format(error))
+
+        while True:  # Get name prefix.


couldn't the get_input function also take optional values that are acceptable (and a description of what should be entered) and do the looping? just wondering because it appears you repeat this same logic a bunch when calling this method

Good idea! Yeah, the looping logic always annoyed me. It's a little weird, because there are different use cases and validation paths, but I managed to get the loop into get_input like you suggested

ryandeivert · 2018-08-15T20:38:37Z

cli/config.py

+            InvalidConfigError: If any config variable has an invalid value.
+        """
+        # Go through the internal setters which have the validation logic.
+        self.aws_account_id = self.aws_account_id


I was super confused by this, but cool idea!

If it's confusing, we could actually remove this validation check, it isn't really necessary

ryandeivert · 2018-08-15T20:39:33Z

cli/config.py

+                break
+            else:
+                print('ERROR: Please enter exactly "yes" or "no"')
+        self.enable_carbon_black_downloader = 1 if enable_downloader == 'yes' else 0


I assume there's a reason for the int here instead of a boolean value? hcl limitation?

true and false aren't real HCL entities, and my IDE yells at me if I try to use them in terraform.tfvars:

ryandeivert · 2018-08-15T20:45:33Z

cli/exceptions.py

+
+class ManagerError(Exception):
+    """Top-level exception for Manager errors."""
+    pass


pass is unnecessary when a docstring is use in an exception (pro tip: omitting it may improve your test coverage 😉 )

ryandeivert · 2018-08-15T20:58:32Z

cli/manager.py

+        inv_prefix = 'inventory/{}/EntireBucketDaily'.format(bucket.name)
+
+        # Check for each day, starting today, up to 8 days ago
+        for days_ago in range(0, 9):


you don't need the 0 here I don't think.. also is the days range something that could be configurable? I'm thinking it the case that you have a bucket that hasn't had new objects added in a while (so 2 weeks).. it could still have a valid manifest for the objects that exist, outside of this 8 day range.

Good point - range(9) is exactly the same thing, change made.

S3 inventory runs every single day, regardless of whether new objects were added (tested and confirmed). In fact, I'm going to reduce this to 3 days (there should be at most 48 hours between successive inventory reports)

sounds good! thanks for clarifying :)

ryandeivert · 2018-08-15T21:00:37Z

lambda_functions/build.py


-DISPATCH_SOURCE = os.path.join(LAMBDA_DIR, 'dispatcher', 'main.py')
-DISPATCH_ZIPFILE = 'lambda_dispatcher'
+    Libraries are installed in the package root and source code is installed to mirror the repo


…move-batcher

Austin Byers added 7 commits August 7, 2018 15:54

Remove batcher function and related Terraform

203c748

Enable S3 inventory and use it for retroactive analysis

4afd9be

Separate monolithic CLI into different components

d449850

Simplify build process and package import logic

39bf48c

Multiprocess enqueueing and configurable concurrency limits

ea9c0ed

Create a test script; limit coverage to source code

8eae3b7

Increase test coverage of manager class

f4f7265

austinbyers added cli batcher terraform labels Aug 15, 2018

austinbyers added this to the v1.2.0 milestone Aug 15, 2018

austinbyers requested a review from ryandeivert August 15, 2018 18:17

austinbyers mentioned this pull request Aug 15, 2018

New binaries are not prioritized during batch analysis #102

Closed

ryandeivert approved these changes Aug 15, 2018

View reviewed changes

Austin Byers added 3 commits August 15, 2018 14:11

Merge branch 'master' of github.com:airbnb/binaryalert into austin-re…

d2d936e

…move-batcher

Address feedback, fix test script to exit on failure

d1278bd

Unit test bug fix

12692fd

austinbyers merged commit 64807af into master Aug 15, 2018

austinbyers deleted the austin-remove-batcher branch August 15, 2018 22:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace batcher with S3 inventory #131

Replace batcher with S3 inventory #131

austinbyers commented Aug 15, 2018 •

edited

Loading

coveralls commented Aug 15, 2018 •

edited

Loading

ryandeivert left a comment

ryandeivert Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

ryandeivert Aug 15, 2018

austinbyers Aug 15, 2018

ryandeivert Aug 15, 2018

ryandeivert Aug 15, 2018

		@@ -0,0 +1,2 @@
		"""BinaryAlert release version"""
		VERSION = '1.1.0'

Replace batcher with S3 inventory #131

Replace batcher with S3 inventory #131

Conversation

austinbyers commented Aug 15, 2018 • edited Loading

Background

Changes

Lambda Functions

Terraform

CLI

Tests

Testing

Reviewers

coveralls commented Aug 15, 2018 • edited Loading

ryandeivert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

austinbyers commented Aug 15, 2018 •

edited

Loading

coveralls commented Aug 15, 2018 •

edited

Loading