-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17511. Add audit/telemetry logging to S3A connector #2675
HADOOP-17511. Add audit/telemetry logging to S3A connector #2675
Conversation
Testing? Not yet. It compiles, and I've done a few of the ITests. |
00c3ce7
to
445754f
Compare
Latest PR has mkdir under files working by the look of things
|
Latest release adds operation context info to the referer logs so can generate a trace mapping S3 operations to the FS ops and S3A FS instance making the call. No tie-in to opentelemetry tho'. Future work.
|
HADOOP-15711. Auditing: scope/coverage running all the tests with auditing set to fail if any call is made Production
Test
Still a WiP, needs
+thinking of seeing how to push in telemetry prepareRequest() into AWS SDK logic so all SDK calls have span checking/header addition, and production code would actually be slightly leaner. Needs me to understand the SDK a bit better, then for the audit manager to use an appropriate plugin point in the |
Example log from a bit of a test run; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the audit support, I think it's really useful.
I added some comments with my review.
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/AuditConstants.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/LoggingAuditService.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/audit/NoopSpan.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
Path path = qualify(f); | ||
// Unless that iterator is closed, the iterator wouldn't be closed | ||
// there. | ||
entryPoint(INVOCATION_LIST_LOCATED_STATUS, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are not using try with resource here. maybe it's justified, I'm just pointing it out because maybe it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to explain it above. The iterator is picking it up. We do the same for all ops which return something (listing, input stream, output stream...) which outlives the API. The audit span goes with the object and so is left open. Listings are a troublespot as they are unlikely ever to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest PR is moving back to try-with-resource/some closure mechanism. As close() only deactivates the span, it does not close it.
@bgaborg: thanks for the comments, will take them on. I've done another iteration of this which won't currently compile (pulls in the request factory from my refactoring, to do the prepareRequest() in there rather than throughout the source. Also I'm trying to make that context for the logs really useful by wiring up spark/MR jobs and including that info. We will be able to see which job is doing the IO, rather than just the user. sweet eh? |
1a7729c
to
0399008
Compare
Testing: all those which talk to landsat or common crawl fail -that's what happens when you switch to the new s3 client constructor. Which I didn't think I had done, not on this branch. |
0399008
to
2e7e4e9
Compare
Latest PR moves the prepareRequest() callback of the active span into the AWS SDK. HEAD^ does it in the request factory construction -that's still supported, but if we can keep the AWS SDK wiring -that will be better
But: this does mean we need to solve the problem of endpoints and regions |
dfa4985
to
15cf403
Compare
84162d6
to
480ec15
Compare
This PR is (And will repeatedly be) rebased on top of #2778; as that gets revised this will be force pushed. Latest PR pulls out getContentStatus as it turns out some apps use it too much. The stats and logs will now make clear how much IO it uses. |
💔 -1 overall
This message was automatically generated. |
da731e2
to
dba6ea2
Compare
will have to rebase. (sighs) last version removed some of the callback from the AWS SDK related to copy as I didn't think it was needed/actually worked. it is. The transfer manager initiates its multipart copy in an executor, somehow we need to add the relevant trace data there.
|
dba6ea2
to
8468a7f
Compare
Moves to the builder API for AWS S3 client creation, and offers a similar-style API to the S3AFS and tests, hiding the details of what options are client, what are AWS Conf, and doing the wiring up of S3A stats interfaces to the AWS internals. This will break HBase's HBoss test suite, but I couldn't see a way to cleanly address this. I'll need to work on that code separately Change-Id: I4fb05f4a54eece8fac96abf8d6af9c889392a6c1
Change-Id: I9ea831f5fec48ebcc7d08fe0e5d6918f8dfa2786
Notion of AuditSpan which is created for a given operation; goal is to pass it along everywhere. It's thread local per FS-instance; store operations pick this up in their constructor from the StoreContext. The entryPoint() method in S3A FS has been enhanced to initiate the spans. For this to work, internal code SHALL NOT call those entry points (Done) and all public API points MUST be declared as entry points. Tricky given the lack of layering in this branch. The auditing is intended to be a plugin point; currently there is one which logs at Info (yes, noisy but good for development) Change-Id: Idddbbd3138e52b40e97f9c2383453190a02cd8ec HADOOP-15711. Audit Log. - compiles - activate() is no-op if current thread audit span is the one to activate. - input streams pick up context - some audit-driven tuning of test teardown ! running tests: key omitted coverage is commit operations. One thing to (seriously) consider is creating a WriteOperationsHelper for each writing op, storing the context in it, then activating on each operation. Need to consider full lifecycle. Change-Id: I2998c4980fce3c65984fd728ec98e3299d4c329d HADOOP-17511. Audit Log * mkdir working again * review changed code and tune slightly * strip out some surplus getFileStatus calls from contract test utils. Change-Id: I5124d6e9579abbe650fe53e2531aff50dc2fedcd HADOOP-15711. Auditing Listing is (probably) wired up IOStats integration too Change-Id: I4293c99d7a67a05d659ec3c690daeab5c883a5fb HADOOP-15711. Auditing * Dynamic loading of audit class * IOStatistics integration is better * Fewer (no?) activation of active span in current thread * log audit uses timestamp of creation as partial ID; and unique counter per instance (for better multi-fs logging) * thread pools for s3 buckets include bucket name (gets into log4j logs) Change-Id: Id2f977354851f3dd96c005450d46b8181dbf567e HADOOP-15711. S3A Auditing All FS entry points instrumented for auditing * WriteOperationHelper is no longer a singleton and picks up the active span when created. This lets us use it in places like commit logic without them needing changed everywhere * various helper classes of S3A now extend AbstractStoreOperation so pick up their span automatically. With the WriteOperationHelper code it's *less* critical, but... * before almost every S3 API call, the active span gets to "prepare" the request. That's where headers can be set, etc. Even without header manipulation -this is where checks for calls without a valid span can go. * which I'm trying to do in the logging code, including incrementing the .failed counter. Doesn't seem to be live tho Yes, this is a fairly large patch, but not through complexity, just that every entry point needs to become a span. Life would be much, much easier with an MVC-style structure as its ~impossible to be confident that various public S3AFS API calls aren't being made externally. Change-Id: I22dfe6f5bdb7bae4c4ba14f2a77e460f12ca1f14 HADOOP-17511. Auditing * adds referer header to AWS requests -this gets into the logs (todo: optional) * can declare that lack of a span for an s3 request is a failure (this doesn't get picked up in copy ops BTW) * passes span through submitted callbacks * moves most of the span management logic into ActiveAuditManager * getWriteOperationHelper() now dynamically creates a helper within the active span, rather than a long-lived static instance. Change-Id: Iecb9911a84421cec0e33011db60159c7c6ee79f3 HADOOP-15711. Auditing: scope/coverage running all the tests with auditing set to fail if any call is made unaudited highlights scope for improvement. Production * Critical: nested scopes * Committers: scoping * S3GuardTool spans around lower-level operations. * Multipart listing to become a spanned entry point. Test * create spans before all tests doing low-level things Change-Id: Ic3577681181942af1a0807dc78ff6393bb3d325f HADOOP-17511. audit elements as ?k=v params; MR/job ID. * multiple params go in as key=value * includes UGI principal * and the MR or spark job which created the FS instances/auditor For MR/Distcp, JobID will work. For spark, it's not enough. Assuming that the FS instances are cached across tasks (they should be for performance reasons) the scope is incorrect. We'd really need task setup patching in a Context scope somewhere Change-Id: Idc4e7eec29e0137e002be4ae56efed8798b8e337 HADOOP-17511. auditing ongoing * CommonContext for JobID from spark/MR jobs * RequestFactory from the HADOOP-16848 refactoring -preparation takes place there * new s3a.api package for the new API stuff Change-Id: Ib11e53304506897f2366672b268257e668a403ed HADOOP-15711. Auditing: use RequestFactory everywhere and wrap invocation as appropriate Change-Id: Ibabb5828c9db018cd9262ae814d37a1a1a15fb02 HADOOP-17511. Auditing: request preparation now happens in AWS SDK Change-Id: Ib2be23510db72b6152658523ae7c79117f0e4391 HADOOP-17511 Auditing; with HADOOP-13551 metrics. * Get the new builder working, with metric wireup the test * AuditSpan prepareRequest() ops are called through S3 SDK * List iterators are cleaned up in s3a code and FS Shell; this closes the spans. * Starting on a document Change-Id: I728fa071f81da5dbc52fd517b537004107cdf65b HADOOP-15711. Auditing -trying to get COPY To audit Wiring up through request header to try to attach a context to all requests. But: the context list is *not* passed through the xfer manager Next attempt: state change callbacks. This may work for one thread, but as many blocks are copied across different threads, it lacks complete coverage. Plan: give up on trying to get an audit log entry for copy/rename. The only next step would be to move off xfer manager and do it ourselves, which isn't necessarily bad, but it is extra work. Or: somehow the thread pool/executor could be made "clever", with every thread submitted through the transfer manager (created uniquely for each copy) to pick up the span Change-Id: I424675228a0fd9b45c0b1ce8de24b3c20d9cfdfe HADOOP-15711. Auditing. * lots more of logging and useful stuff at that * Input stream correctly creating requests through the system. This includes moving the input stream away from direct use of the s3 client to through an interface -the S3AFS impl includes the audit span and sets it up. Change-Id: Ic8969e4444333cf557247aff8c1bcaca64ed9ebf HADOOP-15711. Move ~all setup into AuditManager * AuditManager is CompositeService; wraps active audit Service * Unit tests for span lifecycle showed more work was needed there in order to have the "reset span" (naming?) always avaiable, active... * logging auditor moves very low level logs to trace; preparation -> debug Change-Id: I0ef81dd3d26fe2b8eee943c73808e4791c1a1dd7 HADOOP-15711. Auditing: mkdir working; listing getting complex All the mkdir logic is moved into MkdirOperation with callbacks for required FS calls; tests isolate this and verify cost. All within the same audit span. Listing: * the span to use is passed down and cached in the object listing iterator, where it is passed in to the listing callbacks. * the close() of the span is no longer performed when the object listing completes. The complication here is that the delete and rename calls also use the same operations; they don't want their spans closed prematurely. Change-Id: I773e98694c4403857b48ddaa529303498c8d752b HADOOP-15711. Auditing for multipart listing; other failures Find/fix failures triggered by ops happening outside spans, either genuine (multipart, getBucketLocation()) or from test cases which bypass the public entry points. Change-Id: I88aba852161320c9391541522eaba73e4a5f708d Proposed: test suite teardown to reset any span HADOOP-15711. Auditing: redefine AuditSpan lifecycle declare that AuditSpans last until GC'd, moving to making sure that every use deactivates after. Lets me move all of the span knowledge from S3ABlockOutputStream into WriteOperationHelper...should see if the same can be done for other bits. Change-Id: I28627438d2ba9f0a1118f05f3a4a4d7191830db2 HADOOP-15711. Auditing: invocation coverage and enhancement - mkdirs and getFilesStatus are duration; the tracking of that is integrated with the auditing. - StoreContext provides access to the RequestFactory for anything which wants to make AWS requests. -ExecutingStoreOperation extends CallableRaisingIOE<T>. This allows an operation to be passed straight into a closure/executor! Change-Id: Id24c7d3b8025a6d058d0195fea6def45bcfbe9a7 HADOOP-15711. Auditing: S3AFS integrate audit and duration invocations Makes for cleaner code which also measures API duration. Change-Id: I7586b39630972007491d4229d9e28cbc7e39d074
* added getContentSummary as a single span, with some minor speedups * still trying to do best design to wire up dynamically evaluated attributes Currently logs should include the current thread ID; we don't yet pick up and include the thread where the span was created, which is equally important Change-Id: Ieea88e4228da0ac4761d8c006051cd1095c5fce8
* Select is spanned properly * HttpReferrer adds dynamic eval of current thread. * Every entry point in S3A code which creates a span is tagged @AuditSpan.AuditEntryPoint. This is to hint what is/isn't spanned. Public code MUST go through an AuditEntryPoint; internal code within an existing span MUST NOT -doing so will lose the first span. +revert test fix of ITestAssumeRole as it is covered in HADOOP-17476/apache#2777 Change-Id: I2cfdba907df0dcc76629e5af8effeed8d25d5933
* move audit impl classes in private package fs.s3a.audit.impl; logging + no op auditors; ActiveAuditManager * AuditIntegration provides indirect access to these as needed. * HeaderProcessing has its own callbacks class; reverted recent addition of StoreContext getObjectMetadata call. * fixed latest round of style errors * updated docs with updated PNG * tuning referrer entry: moving all args towards ? params as they are logged. * log audit manager lifecycle events + also: S3AFS logs stats @ debug in close Change-Id: If3785aff94489dc2867396c0c14ebe277f1fb48b
Change-Id: Ifd1b217861c646a9c17b56614bd1f6da570e3d15
+ an interface ActiveThreadSpanSource to give current thread span. This is to allow integrations with the AWS SDK &C to query the active span for an FS and immediately be able to identify span by ID, for logging etc. Adding a unique ID to all audit spans and supporting active thread span (with activate/deactivate) causes major changes in the no-op code, as suddenly there's a lot more state there across manager, auditor and span. will be testable though. Change-Id: Id4dddcab7b735bd01f3d3b8a8236ff6da8f97671
af0e305
to
4c41298
Compare
💔 -1 overall
This message was automatically generated. |
PR no longer applies, even with rebase. Will have to close and resubmit |
See #2807 for successor PR |
Superceded by #2807
Notion of AuditSpan which is created for a given operation; goal is
to pass it along everywhere.
It's thread local per FS-instance; store operations pick this up in their
constructor from the StoreContext.
The entryPoint() method in S3A FS has been enhanced to initiate the spans.
For this to work, internal code SHALL NOT call those entry points (Done)
and all public API points MUST be declared as entry points. Tricky given
the lack of layering in this branch.
The auditing is intended to be a plugin point; currently there is one
which logs at Info (yes, noisy but good for development)
Change-Id: Idddbbd3138e52b40e97f9c2383453190a02cd8ec
NOTICE
Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute