Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace batcher with S3 inventory #131

Merged
merged 10 commits into from
Aug 15, 2018
Merged

Replace batcher with S3 inventory #131

merged 10 commits into from
Aug 15, 2018

Conversation

austinbyers
Copy link
Collaborator

@austinbyers austinbyers commented Aug 15, 2018

to: @ryandeivert
cc: @airbnb/binaryalert-maintainers
size: large
resolves: #18
resolves: #46
resolves: #120

Background

The batcher function for retroactive analysis is error-prone (especially timeouts), can run for a very long time, and can be invoked multiple times, essentially DOSing your BinaryAlert deployment.

Changes

Lambda Functions

  • Remove the batcher Lambda function entirely
  • Build Lambda functions as a proper Python package to remove the hacky if __package__ import logic
  • Reduce the S3 connection timeout - the tail end of binary download latencies approach the 60 second default timeout, but there's no need to wait that long before retrying the connection

Terraform

  • Enable S3 inventory on the binary S3 bucket (Terraform support for this was only recently added)
  • Remove BatchEnqueueFailures alarm, since the batcher is gone
  • Remove the throttle alarms - throttles are more common when invoking Lambda via SQS and are automatically retried
  • Set a concurrency limit for both Lambda functions (analyzer and downloader). This prevents the whole account from running out of concurrency if there are millions of objects in the queue

CLI

  • There are 3 new CLI commands:
    • purge_queue: Purge the analyzer queue, immediately stopping any retroactive analysis
    • retro_fast: Add all objects from the latest S3 inventory manifest onto the analysis queue
    • retro_slow: Enumerate the bucket manually (like the batcher did before)
  • Retroactive scans use multiple processes in parallel to send messages to SQS
  • The deploy command no longer starts a retroactive scan
  • The monolithic manage.py script has been separated into different components in cli/

Tests

  • The individual test commands in .travis.yml have been moved to a standalone script tests/ci_tests.sh. This makes it easier for contributors to test their changes in exactly the same way that Travis will
  • Remove tests/ from coverage measurement. Adding unit tests artificially inflated the coverage measure due to the extra lines of code.

Testing

$ ./manage.py --help
usage: manage.py [-h] [--version] command

positional arguments:
  command     apply          Apply any configuration/package changes with Terraform
              build          Build Lambda packages (saves *.zip files in terraform/)
              cb_copy_all    Copy all binaries from CarbonBlack Response into BinaryAlert
              clone_rules    Clone YARA rules from other open-source projects
              compile_rules  Compile all of the YARA rules into a single binary file
              configure      Update basic configuration, including region, prefix, and downloader settings
              deploy         Deploy BinaryAlert (equivalent to unit_test + build + apply)
              destroy        Teardown all of the BinaryAlert infrastructure
              live_test      Upload test files to BinaryAlert which should trigger YARA matches
              purge_queue    Purge the analysis SQS queue (e.g. to stop a retroactive scan)
              retro_fast     Enumerate the most recent S3 inventory for fast retroactive analysis
              retro_slow     Enumerate the entire S3 bucket for slow retroactive analysis
              unit_test      Run unit tests (*_test.py)

$ ./manage.py configure

$ ./manage.py deploy

$ ./manage.py live_test

$ time ./manage.py retro_fast
Reading inventory/.../EntireBucketDaily/2018-08-13T08-00Z/manifest.json
94679: requirements_top_level.txt
Done!

real	0m20.067s

$ time ./manage.py retro_slow
94682: requirements_top_level.txt
Done!

real	1m10.056s

$ ./manage.py cb_copy_all

$ ./manage.py purge_queue

Note that reading from the inventory (retro_fast) enqueues objects many times faster than enumerating them manually. It takes about 80 seconds to enumerate a million objects (with 32 processes on my laptop). This means a multi-million-object bucket will take a few minutes to enqueue for retroactive analysis, but IMO this is much better (and cheaper) than running the batcher Lambda function for several hours.

Reviewers

Apologies: this change is bigger than I intended - the CLI was becoming painfully difficult to manage. Most of cli/config.py and cli/manager.py (and their unit tests) are unchanged, except for the addition of inventory / queueing logic.

@coveralls
Copy link

coveralls commented Aug 15, 2018

Coverage Status

Coverage increased (+0.5%) to 92.189% when pulling 12692fd on austin-remove-batcher into ca049c5 on master.

Copy link
Contributor

@ryandeivert ryandeivert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a handful of comments but LGTM!

- mypy lambda_functions rules *.py --disallow-untyped-defs --ignore-missing-imports --warn-unused-ignores
- bandit -r . # Configuration in .bandit
- sphinx-build -W docs/source docs/build
- tests/ci_tests.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea!

cli/__init__.py Outdated
@@ -0,0 +1,2 @@
"""BinaryAlert release version"""
VERSION = '1.1.0'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the proper way to version is to use the dunder __version__ attribute. Reference: https://www.python.org/dev/peps/pep-0396/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call! 📞

cli/config.py Outdated
try:
self.carbon_black_url = get_input('CarbonBlack URL', self.carbon_black_url)
break
except InvalidConfigError as error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks list there are a handful of places where you catch InvalidConfigError when calling get_input.. however, get_input would never raise this exception 🤔 . am I missing something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property assignment self.carbon_black_url = ... goes through a setter method with validation logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right yeah sorry I realized that after I left this comment - feel free to ignore :)

cli/config.py Outdated
except InvalidConfigError as error:
print('ERROR: {}'.format(error))

while True: # Get name prefix.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't the get_input function also take optional values that are acceptable (and a description of what should be entered) and do the looping? just wondering because it appears you repeat this same logic a bunch when calling this method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Yeah, the looping logic always annoyed me. It's a little weird, because there are different use cases and validation paths, but I managed to get the loop into get_input like you suggested

InvalidConfigError: If any config variable has an invalid value.
"""
# Go through the internal setters which have the validation logic.
self.aws_account_id = self.aws_account_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was super confused by this, but cool idea!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's confusing, we could actually remove this validation check, it isn't really necessary

cli/config.py Outdated
break
else:
print('ERROR: Please enter exactly "yes" or "no"')
self.enable_carbon_black_downloader = 1 if enable_downloader == 'yes' else 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there's a reason for the int here instead of a boolean value? hcl limitation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true and false aren't real HCL entities, and my IDE yells at me if I try to use them in terraform.tfvars:

screen shot 2018-08-15 at 2 36 56 pm


class ManagerError(Exception):
"""Top-level exception for Manager errors."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass is unnecessary when a docstring is use in an exception (pro tip: omitting it may improve your test coverage 😉 )

cli/manager.py Outdated
inv_prefix = 'inventory/{}/EntireBucketDaily'.format(bucket.name)

# Check for each day, starting today, up to 8 days ago
for days_ago in range(0, 9):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the 0 here I don't think.. also is the days range something that could be configurable? I'm thinking it the case that you have a bucket that hasn't had new objects added in a while (so 2 weeks).. it could still have a valid manifest for the objects that exist, outside of this 8 day range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - range(9) is exactly the same thing, change made.

S3 inventory runs every single day, regardless of whether new objects were added (tested and confirmed). In fact, I'm going to reduce this to 3 days (there should be at most 48 hours between successive inventory reports)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good! thanks for clarifying :)


DISPATCH_SOURCE = os.path.join(LAMBDA_DIR, 'dispatcher', 'main.py')
DISPATCH_ZIPFILE = 'lambda_dispatcher'
Libraries are installed in the package root and source code is installed to mirror the repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

@austinbyers austinbyers merged commit 64807af into master Aug 15, 2018
@austinbyers austinbyers deleted the austin-remove-batcher branch August 15, 2018 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants