-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17511. Add audit/telemetry logging to S3A connector #2807
HADOOP-17511. Add audit/telemetry logging to S3A connector #2807
Conversation
its not merging and I've over-squashed things into the AWS metrics patch. Will need to unroll it |
b67b3ae
to
2cb043b
Compare
2cb043b
to
e1f2a99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the walkthrough @steveloughran, and also for working on this.
I started to review the code, but I have some issues with my aws account now which will be resolved next week so I can't run the integration tests until then.
I found an interesting error btw which I havent seen before when running the tests with mvn clean verify -Dparallel-tests -DtestsThreadCount=8
(NullMetadataStore)
[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 3.307 s <<< FAILURE! - in org.apache.hadoop.fs.s3a.TestListing
[ERROR] testProvidedFileStatusIteratorEnd(org.apache.hadoop.fs.s3a.TestListing) Time elapsed: 2.922 s <<< ERROR!
java.lang.NullPointerException
at org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManager.createRequestHandlers(ActiveAuditManager.java:223)
at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:785)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:478)
at org.apache.hadoop.fs.s3a.AbstractS3AMockTest.setup(AbstractS3AMockTest.java:61)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:239)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
I have a lot of failures like this, and since it's not even integration tests I'm worried that maybe this is something new added by this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if the failures are related, these are not happening on trunk.
I'm going to say the failures are related as its in the auditor code. interesting that you saw and not me. Will look at next week |
29f5f0e
to
665bb01
Compare
I'm going to do a squash of the PR and push up, as yetus has completely given up trying to build this |
665bb01
to
5cd2126
Compare
77bd725
to
b3fc443
Compare
1c7c9ab
to
b6bb916
Compare
💔 -1 overall
This message was automatically generated. |
@steveloughran are any more changes coming in? I'm happy with the state of this CR, 👍 to get it merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review sessions and for working on this Steve, it grew a little big over time :)
We should merge it until there will be more merge conflicts.
LGTM +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, some minor comments, looks really good.
* @param expected expected value | ||
* @return the expected value. | ||
*/ | ||
protected long verifyCounter(final Statistic statistic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can make this private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving protected in case a subclass wants to use it
...op-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/TestAuditSpanLifecycle.java
Outdated
Show resolved
Hide resolved
|
||
@Test | ||
public void testSpanActivation() throws Throwable { | ||
// real activation switches spans in the current thead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "thread"
span2.activate(); | ||
assertActiveSpan(span2); | ||
span2.close(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we close span1 here? Maybe some assertion regarding span1's lifecycle after span2 was closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, added span1 close & assert still in reset span. Did something similarish in the test case underneath (accidentally; I'd navigated to the wrong line)
} | ||
|
||
/** | ||
* Creaate the config from {@link AuditTestSupport#loggingAuditConfig()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Create"
@Test | ||
public void testFileAccessAllowed() throws Throwable { | ||
describe("Enable checkaccess and verify it works with expected" | ||
+ " statitics"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "statistics"
whenRaw(FILE_STATUS_FILE_PROBE)); | ||
} | ||
|
||
private String access(final S3AFileSystem fs, final Path path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move this to the end of the test, after all @Test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done; added javadocs too.
import static org.apache.hadoop.test.LambdaTestUtils.intercept; | ||
|
||
/** | ||
* Test S3A FS Access permit/deny is passed through all the way to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JavaDocs doesn't seem right for this test.
fs.getBucketLocation(); | ||
// which MUST have ended up calling the extension request handler | ||
Assertions.assertThat(SimpleAWSRequestHandler.getInvocationCount()) | ||
.describedAs("Invocatin count of plugged in request handler") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "invocation"
} | ||
|
||
/** | ||
* Overrride point: create the callbacks for S3AInputStream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "Override"
thank's for the reviews, comments, votes etc. |
a34a1ba
to
14d944b
Compare
tests aws london
|
💔 -1 overall
This message was automatically generated. |
legit test regression. The code to determine the principal is returning null
|
Somehow the header test had failed on the principal. Changes
This wasn't just a yetus failure; I replicated it locally. How did my tests pass? They didn't, but I hadn't noticed because the failsafe test run was happening anyway...and the failure of the unit tests was happening in a scrolled of test run I wasn't looking at. That's not good: I've always expected maven to fail as soon as unit tests do. Will investigate separately |
💔 -1 overall
This message was automatically generated. |
Notion of AuditSpan which is created for a given operation; goal is to pass it along everywhere. It's thread local per FS-instance; store operations pick this up in their constructor from the StoreContext. The entryPoint() method in S3A FS has been enhanced to initiate the spans. For this to work, internal code SHALL NOT call those entry points (Done) and all public API points MUST be declared as entry points. This is done, with a marker attribute @AuditEntryPoint to indicate this. The audit span create/deactivate sequence is ~the same as the duration tracking so the operation is generally merged: most of the metrics S3AFS collects are now durations Part of the isolation into spans means that there's explicit operations for mkdirs() and getContentSummary() The auditing is intended to be a plugin point; currently there is the LoggingAuditor which logs at debug - adds an HTTP "referer" header with audit tracing - can be set to raise an exception if the SDK is handed an AWS Request and there's no active span (skipped for the multipart upload part and complete calls as TransferManager in the SDK does that out of span). NoopAuditor which: - does nothing Change-Id: If11a2c48b00db530fb6bc1ad363e24b202acb827 HADOOP-17511 Auditing: getContentSummary and dynamic evaluation (wip) * added getContentSummary as a single span, with some minor speedups * still trying to do best design to wire up dynamically evaluated attributes Currently logs should include the current thread ID; we don't yet pick up and include the thread where the span was created, which is equally important Change-Id: Ieea88e4228da0ac4761d8c006051cd1095c5fce8 HADOOP-17511. Audit Spans to have unique IDs. + an interface ActiveThreadSpanSource to give current thread span. This is to allow integrations with the AWS SDK &C to query the active span for an FS and immediately be able to identify span by ID, for logging etc. Adding a unique ID to all audit spans and supporting active thread span (with activate/deactivate) causes major changes in the no-op code, as suddenly there's a lot more state there across manager, auditor and span. will be testable though. Change-Id: Id4dddcab7b735bd01f3d3b8a8236ff6da8f97671 HADOOP-17511. Audit review * SpanIDs must now be unique over time (helps in log analysis) * All AWS SDK events go to AuditSpan * FS.access() check also goes to Auditor. This is used by Hive Change-Id: Id1ffffd928f2e274f1bac73109d16e6624ba0e9d HADOOP-17511. Audit -timestamp, headers and handlers - timestamp of span creation picked up (epoch millis in UTC) and passed in to referrer - docs on referrer fields - section on privacy implications - referrer header can be disabled (privacy...) - custom request handlers will be created (TODO: tests) Change-Id: I6e94b43a209eee53748ac14270f318352d512fb8 HADOOP-17511: Unit test work There's lots of implicit tests in the functional suites, but this adds tests for * binding of both bundled auditors * adding extra request handlers * wiring up of context to span * and to TransferManager lifecycle events * common context static and dynamic eval * WiP: parsing back of the http referrer header. This gives reasonable coverage of what's going on, though another day's work would round it out. Change-Id: I6b2d0f1dff223875268c18ded481d9e9fea2f250 HADOOP-17511. Unit and integration tests * Tests of audit execution and stats update. * Integration test suite for S3A.access() * Tuning of implementation classes for ease of testing. * Exporting auditor from S3AFS. * More stats collected * Move AuditCheckingException under CredentialInitializationException so that s3a translateException doesn't need changing. * audit p1 & p2 paths moving to be key only * auditing docs includes real s3 logs and breakdown of referrer (TODO update) The main remaining bits of testing would be to take existing code and verify that the headers got through, especially some of the commit logic and job ID. Distcp might if the jobID is in the config. Change-Id: I5723db55ba189f6c400cf29a90aa5605b0d98ad0 HADOOP-17511. Improving auditor binding inside AWS SDK Audit opname in Span callbacks; always set active span from request. This allows the active operation to always be determined, including from any other plugins (Cred provider, signer...) used in same execution. This is also why Auditor is now in StoreContext. Tests for RequestFactory. Change-Id: I9528253cf21253e14714b838d3a8ae85d52ba8b7 HADOOP-17511. checkstyle and more testing Change-Id: If12f8204237eb0d79f2edcff03fc45f31b7d196a HADOOP-17511. Auditing: move AuditSpan and common context to hadoop-common Small move of code, changes in imports more traumatic Change-Id: Ide158d884bd7a873e07f0ddaff8334882eb28595 HADOOP-17511. Auditing * avoiding accidentally deactivating spans * caching of and switching to active span in rename/delete callbacks * consistent access/use of AuditorId for FS ID * which is printed in S3AFileSystem.toString(). * S3Guard Tools doing more for audit; also printing IOStats on -verbose. * Store Context taking/returning an AuditSpanSource, not the full OperationAuditor interface. Change-Id: Ifc5f807a2d3a329b8a1184dd1fcba63205c1f174 HADOOP-17511. Auditing - marker tool. Marker tool always uses a span; this is created in the FS, rather than the marker Change-Id: I03e31dd58c76a41e8a1b73e958b130ed405a29fe HADOOP-17511. Auditing -explicit integration tests. Tests to deliberately create problems and so verify that things fail as expected. Change-Id: If2e863cee54aa303c24a3d02174e466c272f24b2 HADOOP-17511. Auditing code/docs cleanup * Move audit classes in hadoop-common into their own package * move ActiveThreadSpanSource interface there and implement in S3AFS. that's not for public consumption, but it may be good to have there so that abfs can implement the same API. Change-Id: I1b7d924555a1294f7acb3f47dc613adc32ffb003 HADOOP-17511. Auditing: S3 Log parser Pattern with real tests Show everything works with a full parse of the output captured from a log entry of a real test run. This is the most complicated Regexp I have ever written. Change-Id: I090b2dcefad9938bea3b95ef717a7cb2e9eea10c HADOOP-17511 add filtering of header fields; with docs Change-Id: I0da9487a708b5a8fd700ffddd0290d6c0621f3e2 HADOOP-17511. Audit: add o.a.h.fs.audit package with the public classes. fix up yetus complaints, where valid. Change-Id: I98e4f7a9c277c993555db6d62a20f2a00515c5e8 HADOOP-17511 review * Moved HttpReferrerAuditHeader class * Added version string to URL * Explained some design decisions in the architecture doc * Added mukund's comments Change-Id: I356e5428c51f74b25584bfb1674296ac193c81d5
Change-Id: Ic6ce36df6a3f6e12ed4ee8b6829460e8e875121b
3d07882
to
beba5f5
Compare
rebased to trunk again after the AWS region patch from mehakmeet |
Change-Id: I10c39db39f62af5606974ff38d497b36dbfb6823
not sure what is up with yetus there. Submitted again, with some updated docs |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Yetus reports are a bit confused, but the output is good
|
backporting to branch-3.3 if the tests run successfully. Merge has gone in and first test run is happy. |
The S3A connector supports "an auditor", a plugin which is invoked at the start of every filesystem API call, and whose issued "audit span" provides a context for all REST operations against the S3 object store. The standard auditor sets the HTTP Referrer header on the requests with information about the API call, such as process ID, operation name, path, and even job ID. If the S3 bucket is configured to log requests, this information will be preserved there and so can be used to analyze and troubleshoot storage IO. Contributed by Steve Loughran. Change-Id: Ic0a105c194342ed2d529833ecc42608e8ba2f258
The S3A connector supports "an auditor", a plugin which is invoked at the start of every filesystem API call, and whose issued "audit span" provides a context for all REST operations against the S3 object store. The standard auditor sets the HTTP Referrer header on the requests with information about the API call, such as process ID, operation name, path, and even job ID. If the S3 bucket is configured to log requests, this information will be preserved there and so can be used to analyze and troubleshoot storage IO. Contributed by Steve Loughran.
apache#2807) The S3A connector supports "an auditor", a plugin which is invoked at the start of every filesystem API call, and whose issued "audit span" provides a context for all REST operations against the S3 object store. The standard auditor sets the HTTP Referrer header on the requests with information about the API call, such as process ID, operation name, path, and even job ID. If the S3 bucket is configured to log requests, this information will be preserved there and so can be used to analyze and troubleshoot storage IO. Contributed by Steve Loughran. MUST be followed by: CDPD-28457. HADOOP-17822. fs.s3a.acl.default not working after S3A Audit feature (apache#3249) CDPD-24982. HADOOP-17801. No error message reported when bucket doesn't exist in S3AFS Conflicts: hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Listing.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/WriteOperationHelper.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/AbstractS3ACommitter.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/AbstractStoreOperation.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/DeleteOperation.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/RenameOperation.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/StoreContext.java hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/s3guard/DynamoDBMetadataStore.java hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/impl/TestPartialDeleteFailures.java Mostly related to shaded guava. this patch really needs CDPD-10473. HADOOP-16645. S3A Delegation Token extension point to use StoreContext; had to CP a file in, and even then the auditing may not be complete there. Will revisit, even though Knox and Ranger will both need a matching change Change-Id: Ic0a105c194342ed2d529833ecc42608e8ba2f258
The S3A connector supports "an auditor", a plugin which is invoked at the start of every filesystem API call, and whose issued "audit span" provides a context for all REST operations against the S3 object store. The standard auditor sets the HTTP Referrer header on the requests with information about the API call, such as process ID, operation name, path, and even job ID. If the S3 bucket is configured to log requests, this information will be preserved there and so can be used to analyze and troubleshoot storage IO. Contributed by Steve Loughran. Change-Id: Ic0a105c194342ed2d529833ecc42608e8ba2f258
(Squashed and rebased from #2675)
Notion of AuditSpan which is created for a given operation; goal is
to pass it along everywhere.
It's thread local per FS-instance; store operations pick this up in their
constructor from the StoreContext.
The entryPoint() method in S3A FS has been enhanced to initiate the spans.
For this to work, internal code SHALL NOT call those entry points (Done)
and all public API points MUST be declared as entry points.
This is done, with a marker attribute
@AuditEntryPoint
to indicate this.The audit span create/deactivate sequence is ~the same as the duration tracking
so the operation is generally merged: most of the metrics
S3AFS collects are now durations
Part of the isolation into spans means that there's explicit operations
for mkdirs() and getContentSummary()
The auditing is intended to be a plugin point; currently there is
the LoggingAuditor which
NoopAuditor which:
*does nothing
A recent change is that we want every span to have a spanID (string, unique across all spans of that FS instance); even the no-op span has unique IDs.