-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[serverless] add cspm metering functionality #162019
[serverless] add cspm metering functionality #162019
Conversation
f831c5c
to
36e17a9
Compare
8d7fccc
to
619ae1b
Compare
a91b3f1
to
4d00c37
Compare
e87f7b6
to
e69f8f0
Compare
cfe526d
to
23073e3
Compare
b015f35
to
3e272d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, just some small things
@@ -9,4 +9,4 @@ | |||
const namespace = 'elastic-system'; | |||
const USAGE_SERVICE_BASE_API_URL = `http://usage-api.${namespace}/api`; | |||
const USAGE_SERVICE_BASE_API_URL_V1 = `${USAGE_SERVICE_BASE_API_URL}/v1`; | |||
export const USAGE_SERVICE_USAGE_URL = `${USAGE_SERVICE_BASE_API_URL_V1}/usage`; | |||
export const USAGE_SERVICE_USAGE_URL = `${USAGE_SERVICE_BASE_API_URL_V1}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this on purpose? I believe we need the /usage
: https://github.com/elastic/usage-api/blob/5414a1c5f170b81552deb2140307e135ad93c7e0/api/user-v1-spec.yml#L50
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you right:)
meteringCallback: MeteringCallback; | ||
} | ||
|
||
export interface SecurityMetadataTaskStartContract { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this to: SecurityUsageReportingTaskStartContract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done:)
|
||
let usageReportResponse: Response | undefined; | ||
|
||
try { | ||
usageReportResponse = await usageReportingService.reportUsage(usageRecords); | ||
} catch (e) { | ||
this.logger.warn(JSON.stringify(e)); | ||
this.logger.info(`usage records report was sent successfully`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking it might make sense to log the response and some of the usage record data on success for compliance and billing debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's good idea to add the response, regarding the usage record data, we already have it in line 127
taskId: this.taskId, | ||
lastSuccessfulReport, | ||
}); | ||
this.logger.info(`received usage records: ${JSON.stringify(usageRecords)}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about the extra resources/cost we would incur here, especially if it's a lot of records. maybe logger.debug
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
aggs: { | ||
unique_resources: { | ||
cardinality: { | ||
field: 'resource.id', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
precision treshold should be added here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this comment in the code for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned to add a value as part of this PR, I am going add it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Job Ido!
For the first iteration the most important comments are:
- Necessary Decoupling of benchmarks and bucket
- Necessary documentation of the core of the metering function
aggs: { | ||
unique_resources: { | ||
cardinality: { | ||
field: 'resource.id', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this comment in the code for now
}; | ||
} | ||
|
||
interface BenchmarkBucket { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, It doesn't seem like there's going to be a bucket per benchmark.
For example, KSPM and CSPM belong to the same bucket. Additionally, CNVM isn't benchmark related at all.
I suggest decoupling benchmarks and buckets in terminology and implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KSPM and CSPM belong to the same bucket
wdym? We are going to charge per sulotion, so we will have a billing bucket for KSPM, CSPM and Vul mngm. This PR handling CPSM only.
I mean here to the CSPM benchmark which for now is cis_aws
(and soon we will support GCP as well) I did it to provide more context on the charge, it used in the cause
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my understanding, we have a single bucket for KSPM + CSPM + CNVM as @MikePaquette described here https://github.com/elastic/security-team/issues/6497#issuecomment-1627719606
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds weird to me, in this case KSPM resource will be equal from price perspective as an EC2 instance on vulnerability mngm?
Anyway, this implementation method will enable us to support also the method you mentioned via the billing transform function and will enable also decouple the collectors tasks which I think will be easier to maintain it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to disagree.
We know Kibana, Kibana tests and so on way better than we know whatever infrastructure that would be provided by the transform function. I think we should keep the transform function as lean and stupid as possible.
Additionally, Bucket is a metering specific term that isn't related to benchmarks.
We have no reason for counting separately which complicates the metering for no reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already updated the code to one single bucket for cloud security
const projectId = cloudSetup?.serverless?.projectId || 'missing project id'; | ||
logger.error('no project id found'); | ||
|
||
const response = await esClient.search<unknown, Benchmarks>(getFindingsByResourceAggQuery()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed why querying isn't enough, and additional processing in the code is necessary. Please include the reasoning here as a comment with relevant links and an (very short) explanation why the alternative approach of querying directly isn't used. (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eyalkraft I researched that and learned that behind the scenes term aggregation using the same logic as cardinality. (terms-agg-doc-count-error)
So as you suggested in our discussion, querying is indeed enough -without another iteration in Kibana.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good 👍
Is the code updated then and I didn't realize it? Why are there still iterations over the query results in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the benchmarks...
https://github.com/elastic/kibana/pull/162019/files#r1271927221
taskTitle: 'Security Solution - CSPM Metring Periodic Tasks', | ||
meteringCallback: cspmMetringCallback, | ||
interval: '3600s', // 1 hour | ||
periodSeconds: 3600, // equal to task interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the actual task interval variable here? to make sure things won't brake in case it changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
? cspmBenchmarks.reduce( | ||
(accumulator, benchmarkBucket: BenchmarkBucket) => { | ||
accumulator.benchmarks.push(benchmarkBucket.key); | ||
accumulator.sumResources += benchmarkBucket.unique_resources.value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, not sure why the breaking down by benchmarks to later sum all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this was to provide more context about the charge, which was used in the cause field.
The total number of benchmarks that the customer may have will be in future no more than three or four (today it will be a maximum of ~2) , so it will be a very small array iteration and I believe it is worth it.
|
||
const usageRecords = { | ||
id: `${TASK_TYPE_PREFIX}:${CSPM_METRING_TASK_NAME}`, | ||
usage_timestamp: minTimestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this should be the minTimestamp
, and not the time when the billing query happens.
What are the implications of this value?
What happens if a user uninstalls all the benchmarks and this value is the same (the last finding time before the un-installment) for the entire 24 hours after?
What is this value used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usage_timestamp
This represents when the usage occurred, or the beginning of the range of usage that this record captures. If the usage is based on an aggregation of other data stored in another system, it should be based on the earliest of those records so we can tie this usage record back to the source data (for troubleshooting or presentation purposes).
taken from usage record schema v2.
run: async () => { | ||
return this.runTask(taskInstance, core, meteringCallback); | ||
}, | ||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO what?
@@ -109,29 +104,35 @@ export class SecurityUsageReportingTask { | |||
) => { | |||
// if task was not `.start()`'d yet, then exit | |||
if (!this.wasStarted) { | |||
this.logger.debug('[runTask()] Aborted. Task not started yet'); | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't this be worth logging?
@eyalkraft, I planned to document the work, can you elaborate more about what you mean in "the core of the metering function" |
Talking about the file |
aae097f
to
3be9a0f
Compare
270abfb
to
61b3a9a
Compare
cded412
to
903e6a2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Required changes:
- On no resources, don't send anything to the metering service (don't send an empty array). Make sure that not sending is equivalent to sending count 0.
Following PR:
- tests
f0ecd32
to
2886e68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please add the tests ASAP
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
solves
First usage of the shared task manager class for the security serverless project:
How to test it?
yarn start --serverless=security
(Make sure Elasticsearch is already running before).Look for the specified logger:
plugins.securitySolutionServerless.serverless-security:cspm-usage-reporting-task:1
Review the logs associated with the specified logger to gather the required information.