-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leveraging Lambda Powertools Logger for querying in Athena or other AWS services #460
Comments
Hey @rickyaws - thanks for creating this detailed feature request much appreciated. Reading from mobile, I don't think I fully understand the main issue -- is that with the timestamp field value? Or is it that plus the known limitation of Kinesis Firehose not adding a newline for JSON Objects which breaks Glue/Athena? If it's the former, have you experimented with a different date format using "Logger(datefmt=)"? There's an open discussion in the Python repo about creating a tutorial to educate customers on structured and canonical logging, and what common logging analytics tools are used across the board --- E.g. ElasticSearch, CloudWatch Log Analytics (cross-account limitation I know), Honeycomb, Datadog, Loggly etc. That's something we've been discussing with the Well-Architected team since that spans multiple technologies as it's an industry practice. For now, I'm interested in understanding if the timestamp is the issue so I can work with you and your customer on fitting this in Athena. We might move this to Python repo as this seems more like docs vs a feature to be added/enhanced. Thanks again! |
Cc @pcolazurdo for a discussion later |
It’s CloudWatch - we don’t do that. That seems to be the direct integration
with Athena.
In other projects I exported CloudWatch Logs -> Lambda for pre-processing
-> Kinesis Firehose -> S3 -> Glue -> Athena. Pre-processing Lambda would do
two things: 1/ only use the payload nothing else, 2/ append a based64
newline to the payload because Kinesis Firehose doesn’t and it breaks Glue
and Athena
…On Thu, 20 May 2021 at 19:53, Michael Warkentin ***@***.***> wrote:
Is the timestamp starting the message actually coming from
lambda-powertools or is that something that cloudwatch is adding during the
export to s3?
[image: image]
<https://user-images.githubusercontent.com/67560/119026037-b8a28100-b972-11eb-96b8-67cb309e8bb7.png>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/awslabs/aws-lambda-powertools-roadmap/issues/16#issuecomment-845340445>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZPQBDCHCQPGBX3NSQKF23TOVECVANCNFSM45HLKISA>
.
|
@heitorlessa Yeah I think your proposal makes sense, its more on documentation than feature per se. I essentially was thinking that since Powertools already was structuring the logs as JSON that it would reduce the Pre-Processing steps that someone would have to do in order to create a centralized logging architecture with the least amount of cost (our customer is cost conscious) through the steps you mention below. So in my mind I thought it would be able to use the JSON SerDer that Athena/Glue have built in to quickly make the schema based on the JSON structure but that timestamp CW adds essentially messes it up. In short I was trying to leverage some type of documentation or use cases to show the customer how they can leverage this format to make their log aggregations easier and operational day to day analysis easier than just using the Python Built In logger and defining all the structure themselves similar to how the Metrics module significantly reduces the overhead of creating Custom Metrics and then being able to use those custom metrics for alarms and notifications.
@mwarkentin -> I was trying to figure out how to query in cloudwatch Insights but was not having great success I will take your example and re-try the only limitation for our use case with insight is that you would have to go into each individual accounts to look at the logs as opposed to managing them all in a central Security account. |
@rickyaws that makes sense.. makes me think that cross-account cloudwatch insights (or sharing log groups with RAM?) could be pretty powerful too (hint hint). :) |
@rickyaws querying with Cloudwatch Logs Insights works great [1] and you can automate queries via APIs, create dashboards, etc. The way Logs Insights work with the CWL string format directly makes querying this very easy. Also check using Contributor Insights as it is a very useful feature (watch out for the costs though) Now, The idea of using Athena directly to parse these logs is going to become very complex (if at all possible) due to two facts:
Trying to use Grok or regex to parse the logs, will make it very hard to maintain and process for the same reasons. As an alternative, you could use the JSON format on the subscription filter which will properly wrap all the output from the lambda logs (including powertools logger) into a single field named "message" as an encoded string, but Athena isn't very powerful to handle this back to json AFAIK. i.e. [ ...
{
"id": "36168440376909863750225378866398863959406407651950067771",
"timestamp": 1621848958630,
"message": "REPORT RequestId: 4fae6371-c5ce-432d-812e-67cd621e2819\tDuration: 298.08 ms\tBilled Duration: 299 ms\tMemory Size: 128 MB\tMax Memory Used: 90 MB\t\nXRAY TraceId: 1-60ab737e-47c4b2fd43ec6df90c90c3c4\tSegmentId: 59a6cccc6f62a3d8\tSampled: true\t\n"
},
{
"id": "36168440376040134687482684563878970946773121553216831544",
"timestamp": 1621848958591,
"message": "{\"level\":\"INFO\",\"location\":\"lambda_handler:76\",\"message\":\"Completed\",\"timestamp\":\"2021-05-24 09:35:58,591+0000\",\"service\":\"crawler_frontapi_save\",\"cold_start\":false,\"function_name\":\"ServerlessCrawlerStack-CrawlerFrontAPISaveFunction-1C5FWAYTV559J\",\"function_memory_size\":\"128\",\"function_arn\":\"arn:aws:lambda:us-east-2:123456789012:function:ServerlessCrawlerStack-CrawlerFrontAPISaveFunction-1C5FWAYTV559J\",\"function_request_id\":\"4fae6371-c5ce-432d-812e-67cd621e2819\",\"parse_id\":\"#2021-05-24T09:35:58.491645\",\"url_site\":\"\",\"xray_trace_id\":\"1-60ab737e-47c4b2fd43ec6df90c90c3c4\"}\n"
},
...] [1] As an example about the power of CloudWatch Logs Insights, this is a very useful query:
|
After doing some additional research, I don't think Athena will be a good alternative to query structured logs due to the variety of fields that can be produced on each output and there is nothing in Lambda Powertools itself that could improve on this situation. If any, additional work should be done on Kinesis Firehose to export only specific records and of these, only specific fields. I suggest closing this request as it is my understanding this is not something that Lambda Powertools itself could or should handle. |
Agree, thanks a lot @pcolazurdo for taking this for a spin! @rickyaws suggestion moving forward is to centralise these logs into ElasticSearch, and to add +1 to CloudWatch Log Insights cross-account search. If ElasticSearch/ELK isn't an option here due to management, you can use Kinesis to centralize CloudWatch Logs into another account - helps with management but increases costs too. Before we close it, I'll transfer it to Python repo, and update Logger FAQ to have this info more broadly available. Thank you all! |
Thanks for opening your first issue here! We'll come back to you as soon as we can. |
Just added to docs and will be available in the next docs release (hopefully next week). Tagging to share once it's available. |
Runtime:
Python
Is your feature request related to a problem? Please describe
IHAC who was looking into the benefits of the logger formatting but was having a hard time getting Athena to ingest the logs from Cloudwatch and create a schema out of them and was wondering if there was any consideration on the logger format to be more friendly to data ingestion by other services outside of Cloudwatch. For example as it is now this is how its outputting to cloudwatch with the powertools logger on:
2021-05-10T15:26:57.772Z {"level":"INFO","location":"checkOrderInput:50","message":"Checking required fields ['contact', 'address', 'id', 'expedite', 'installDate', 'addressValidationOverride', 'product', 'asset']","timestamp":"2021-05-10 15:26:57,771+0000","service":"new-order","xray_trace_id":"1-609950be-18655ee7e321f53ab8b4f629"}
with this format you could extract two columns
or use Grok serDer in athena with regex to try to grab a pattern, in either case you would still need to run this through some type of data processing no different than if you used the built in python logger with custom configurations but slightly more challenging due to the JSON structure.
Describe the solution you'd like
Either for Powertools Logger to just output eachline as a JSON object (removing the timestamp at the beginning of the log so Athena/Glue can just use the built in JsonSerDer to parse it and create columns or some documentation or examples on how to leverage this logger formatting in queries and creating Metric Filters of them.
Describe alternatives you've considered
Using a messy DDL statement in Athena using Grok SerDer
CREATE EXTERNAL TABLE
ugi
(loglevel
string COMMENT 'from deserializer',timestamp
string COMMENT 'from deserializer',service
string COMMENT 'from deserializer',traceid
string COMMENT 'from deserializer')ROW FORMAT SERDE
'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
'input.format'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://ugi/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='powertoollogs',
'averageRecordSize'='191',
'classification'='powertoollogs',
'compressionType'='none',
'grokPattern'='(?"level":"(.{4,10})),([^.]+)(?"timestamp":"(.{10,28}))"([^.]+)(?"service":"(.{3,10}))",([^.]+)(?xray_trace_id":"(.{30,40}))"',
'objectCount'='1',
'recordCount'='1',
'sizeKey'='191',
'typeOfData'='file')
Just using the built-in python logger and use the traditional CW-Firehose-S3 architecture for streaming the logs into an S3 bucket in Athena to avoid parsing the JSON structure.
If you provide guidance, is this something you'd like to contribute?
I am not the best developer but sure! I could help
Additional context
Providing some examples of how others have leverage the powertools logger or any use cases where this logger has made operational tasks easier would be very valuable and easier to sell to customers. Right now outside of the nice structured uniform formatting it creates in Cloudwatch logs I do not see another benefit of then using this data efficiently
The text was updated successfully, but these errors were encountered: