-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Azure Monitor output plugin #4089
Conversation
e4f0f68
to
aecf5bb
Compare
|
||
``` | ||
# Configuration for sending aggregate metrics to Azure Monitor | ||
[[outputs.azuremonitor]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required but consider renaming azure_monitor
.
## specified, the plugin will attempt to retrieve the resource ID | ||
## of the VM via the instance metadata service (optional if running | ||
## on an Azure VM with MSI) | ||
#resourceId = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Compute/virtualMachines/<vm-name>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use snake_case for all options.
} | ||
|
||
if resp.StatusCode >= 300 || resp.StatusCode < 200 { | ||
return nil, fmt.Errorf("Post Error. HTTP response code:%d message:%s, content: %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a "GET Error"?
} | ||
|
||
if resp.StatusCode >= 300 || resp.StatusCode < 200 { | ||
return nil, fmt.Errorf("Post Error. HTTP response code:%d message:%s reply:\n%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GET Error, also convert to a single line error string since these will end up in the logging output.
// t.Logf("metadata is \n%v", metadata) | ||
// } | ||
|
||
//fmt.Printf("metadata is \n%v", metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to clear out this code.
metadataService *AzureInstanceMetadata | ||
instanceMetadata *VirtualMachineMetadata | ||
msiToken *msiToken | ||
msiResource string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a package const
instanceMetadata *VirtualMachineMetadata | ||
msiToken *msiToken | ||
msiResource string | ||
bearerToken string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use directly from msiToken
msiToken *msiToken | ||
msiResource string | ||
bearerToken string | ||
expiryWatermark time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this being used but where is it set?
expiryWatermark time.Duration | ||
|
||
oauthConfig *adal.OAuthConfig | ||
adalToken adal.OAuthTokenProvider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I see being set but not used.
period time.Duration | ||
delay time.Duration | ||
periodStart time.Time | ||
periodEnd time.Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two times can just be locals
return nil, fmt.Errorf("Error authenticating: %v", err) | ||
} | ||
|
||
metricsEndpoint := fmt.Sprintf("https://%s.monitoring.azure.com%s/metrics", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to be careful here -- As custom metrics is in preview on the Azure Monitor side, we don't support all regions of public Azure. We are only going to be available for a few regions as part of the reviews so not all these endpoints will exist.
return nil, fmt.Errorf("Error authenticating: %v", err) | ||
} | ||
|
||
metricsEndpoint := fmt.Sprintf("https://%s.monitoring.azure.com%s/metrics", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For any of the endpoints that the output plugin is communicating with (ex. https://.monitoring.azure.com, we should ideally move all these endpoints into single place in the code, not sprinkle them across all files. This will help future proof the plugin to more easily support other Azure clouds down the road (Azure Germany, Azure China, Azure US Government) which will all have their own endpoints.
aecf5bb
to
40c37aa
Compare
@danielnelson I've updated this branch to use the running output aggregator pattern you suggested. I moved the original PR code to the |
Looks good, like how you dealt with not having the filter for old metrics. Gives me some ideas for improving this with the normal aggregators. |
46e7131
to
4b8d0ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding feedback on the Azure Monitor output plugin
#resource_id = "/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Compute/virtualMachines/<vm_name>" | ||
## Azure region to publish metrics against. Defaults to eastus. | ||
## Leave blank to automatically query the region via MSI. | ||
#region = "useast" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use "eastus" as the default, not "useast". "useast" will not work -- it won't resolve to any monitoring endpoint.
} | ||
|
||
const ( | ||
defaultRegion string = "eastus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have a default region constant. This value either needs to come from instance metadata or come from user configuration. The region must match the region of the Azure resource ID and can't be guessed.
return err | ||
} | ||
|
||
req.Header.Set("Authorization", "Bearer "+a.msiToken.AccessToken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are not using MSI, where are we setting a.adalToken?
} | ||
defer resp.Body.Close() | ||
|
||
if resp.StatusCode >= 300 || resp.StatusCode < 200 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only doing fmt.Errorf for status codes in the [300, 200) range. We should also follow this pattern for any errors encountered in the 4xx or 5xx range
Data: &azureMonitorData{ | ||
BaseData: &azureMonitorBaseData{ | ||
Metric: m.Name(), | ||
Namespace: "default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason we choose to set the metric namespace to be "default"? This might be good for users to override via config but as a default, it might be better to go with a value which has "Telegraf" in the namespace.
for _, m := range azmetrics { | ||
// Azure Monitor accepts new batches of points in new-line delimited | ||
// JSON, following RFC 4288 (see https://github.com/ndjson/ndjson-spec). | ||
jsonBytes, err := json.Marshal(&m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large can azmetrics be? I believe the Azure Monitor metric API has a max request body size of 4MB. If we exceed this limit, we should issue multiple POST requests
# timeout = "5s" | ||
|
||
## Whether or not to use managed service identity. | ||
#use_managed_service_identity = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: In the sample configuration, we should make it clear that when use_managed_service_identify is false, it's required that the user supply resource_id, region, azure_subscription, azure_tenant, azure_client_id, and azure_client_secret. These become mandatory parameters.
useMsi bool `toml:"use_managed_service_identity"` | ||
ResourceID string `toml:"resource_id"` | ||
Region string `toml:"region"` | ||
Timeout internal.Duration `toml:"Timeout"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Should this be lower case "timeout" instead of "Timeout"?
AzureTenantID string `toml:"azure_tenant"` | ||
AzureClientID string `toml:"azure_client_id"` | ||
AzureClientSecret string `toml:"azure_client_secret"` | ||
StringAsDimension bool `toml:"string_as_dimension"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is StringAsDimension? There isn't such example in the sample config?
return &AzureMonitor{ | ||
StringAsDimension: false, | ||
Timeout: internal.Duration{Duration: time.Second * 5}, | ||
Region: defaultRegion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, we shouldn't treat region special with a default region. This should be treated as the rest of the configuration -- either it comes from instance metadata or from user supplied config.
@danielnelson - I saw the milestone changed from 1.7 to 1.8. What is the timeline for 1.8? |
I believe 1.8 will be finished around the end of August? Does that sounds right @russorat? |
continue | ||
} | ||
for id := range a.cache[tbucket] { | ||
a.cache[tbucket][id].updated = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the interval is lower than the 1m aggregation period, this can get set to false before the metric has had a chance to be returned in Push()
since you may have multiple calls to Reset()
before the metric is returned. This causes the plugin not to write any metrics unless you set the set the flush_interval to at least 1m
.
|
||
// Pull region and resource identifier | ||
err := a.GetInstanceMetadata() | ||
if err != nil && a.ResourceID == "" && a.Region == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be ||
instead of &&
, we should always return if err != nil
, so that we are sure to see all errors and I think we need both of these set.
Also, if Region and ResourceID are set beforehand, maybe we can skip this function completely?
} | ||
|
||
// GetInstanceMetadata retrieves metadata about the current Azure VM | ||
func (a *AzureMonitor) GetInstanceMetadata() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally this function would return region, resource, error
. The calling function would combine these with the plugin settings and create the url. Also, consider passing in the client and making this and the function above a free function.
if err != nil && a.ResourceID == "" && a.Region == "" { | ||
return fmt.Errorf("E! No resource id specified, and Azure Instance metadata service not available. If not running on an Azure VM, provide a value for resource_id") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have the final discovered URL, I think we should add a log message with the final version at debug level.
3639d05
to
50854bf
Compare
Identity](https://docs.microsoft.com/en-us/azure/active-directory/msi-overview) | ||
for more details. Only available on ARM-based resources. | ||
|
||
**Note: As shown above, the last option (#5) is the preferred way to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should be number 4
**Note: As shown above, the last option (#5) is the preferred way to | ||
authenticate when running Telegraf on Azure VMs. The VMs will need to be given | ||
access to the Azure Monitor to publish custom metrics. Instructions on how to | ||
grant access can be found [here]()** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add the link target. I suggest just turning the whole sentence into a link:
[Instructions on how to grant access](http://example.org).
If Telegraf is not running on a virtual machine or the VM Instance Metadata service is not available, the following variables are required for the output to function. | ||
|
||
* region | ||
* resourceId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resource_id
|
||
This plugin will send custom metrics to Azure Monitor. | ||
Azure Monitor has a metric resolution of one minute. | ||
To handle this in Telegraf, the Azure Monitor output plugin will automatically aggregates metrics into one minute buckets, which are then sent to Azure Monitor on every flush interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you are trying to do the "semantic linefeeds" style, but can you make sure to wrap all the lines at no more than 78 chars. As an aside, I don't really care for this style, I find it hard to read the plain text and the diff advantage is small as changes can be handled using --word-diff.
|
||
### Configuration: | ||
|
||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this can you run telegraf --usage azure_monitor
and use the output (minus the Description text).
return err | ||
} | ||
|
||
// req, err := http.NewRequest("POST", a.url, bytes.NewBuffer(body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove commented out code
// refresh the token if needed. | ||
req, err = autorest.CreatePreparer(a.auth.WithAuthorization()).Prepare(req) | ||
if err != nil { | ||
return fmt.Errorf("E! [outputs.azure_monitor] Unable to fetch authentication credentials: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an error, not a log message, so don't add the log level or module: unable to fetch authentication...
.
} | ||
|
||
if resp.StatusCode < 200 || resp.StatusCode > 299 { | ||
return fmt.Errorf("E! Failed to write: %v", string(rbody)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove log level, start with lowercase. I would not include the body as it could be very long.
continue | ||
} | ||
for id := range a.cache[tbucket] { | ||
a.cache[tbucket][id].updated = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should reset this in Push, this will be more resilient against reordered calls even though they shouldn't happen in the current implementation: Push -> Add -> Reset
. Also, Push feels like the more appropriate place to clear the update flag.
@@ -0,0 +1,275 @@ | |||
package azure_monitor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us know when the tests are ready.
08d6e68
to
5de442b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears all that's left is to add your tests and update your branch
fields fields | ||
wantErr bool | ||
}{ | ||
// TODO: Add test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests
7760b87
to
fb70450
Compare
Required for all PRs:
This is a new output plugin for Azure Monitor. I will be adding more unit tests soon.