Skip to content

Commit

Permalink
Gangams/release ciprod06112021 (#581)
Browse files Browse the repository at this point in the history
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
  • Loading branch information
12 people authored Jun 11, 2021
1 parent de2fca1 commit 8500537
Show file tree
Hide file tree
Showing 64 changed files with 3,251 additions and 2,281 deletions.
37 changes: 32 additions & 5 deletions Documentation/OSMPrivatePreview/ReadMe.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
Note - This is private preview. For any support issues, please reach out to us at [[email protected]](mailto:[email protected]). Please don't open a support ticket.

This private preview supports Open Service Mesh on [AKS](https://docs.microsoft.com/azure/aks/servicemesh-osm-about) & Azure [Arc on k8s](http://docs.microsoft.com/azure/azure-arc/kubernetes/tutorial-arc-enabled-osm).

# Azure Monitor Container Insights Open Service Mesh Monitoring

Azure Monitor container insights now supporting preview of [Open Service Mesh(OSM)](https://docs.microsoft.com/azure/aks/servicemesh-osm-about) Monitoring. As part of this support, customer can:
1. Filter & view inventory of all the services that are part of your service mesh.
2. Visualize and monitor requests between services in your service mesh, with request latency, error rate & resource utilization by services.
3. Provides connection summary for OSM infrastructure running on AKS.
3. Provides connection summary for OSM infrastructure running on AKS or Azure Arc for k8s.

## How to onboard Container Insights OSM monitoring?
OSM exposes Prometheus metrics which Container Insights can collect, for container insights agent to collect OSM metrics follow the following steps.

### AKS
1. Follow this [link](https://docs.microsoft.com/en-us/azure/aks/servicemesh-osm-about?pivots=client-operating-system-linux#register-the-aks-openservicemesh-preview-feature) as a prereq before enabling the addon.

2. Enable AKS OSM addon on your
Expand All @@ -27,9 +29,29 @@ osm metrics enable --namespace "test1, test2"
* Download the configmap from [here](https://github.com/microsoft/Docker-Provider/blob/ci_prod/kubernetes/container-azm-ms-osmconfig.yaml)
* Add the namespaces you want to monitor in configmap `monitor_namespaces = ["namespace1", "namespace2"]`
* Run the following kubectl command: kubectl apply -f<configmap_yaml_file.yaml>
* Example: `kubectl apply -f container-azm-ms-agentconfig.yaml`
* Example: `kubectl apply -f container-azm-ms-osmconfig.yaml`
4. The configuration change can take upto 15 mins to finish before taking effect, and all omsagent pods in the cluster will restart. The restart is a rolling restart for all omsagent pods, not all restart at the same time.

### Azure Arc for Kuberentes
This section assumes that you already have your kubernetes distribution connected via Azure Arc. If not learn more [here.](https://docs.microsoft.com/en-us/azure/azure-arc/kubernetes/quickstart-connect-cluster)

1. Install Arc enabled Open Service mesh on your Arc cluster. Learn more [here](http://docs.microsoft.com/azure/azure-arc/kubernetes/tutorial-arc-enabled-osm#install-arc-enabled-open-service-mesh-osm-on-an-arc-enabled-kubernetes-cluster)
2. Install Azure Monitor Container Insights on Arc. If not installed already. Learn more how to install [here](https://docs.microsoft.com/azure/azure-monitor/containers/container-insights-enable-arc-enabled-clusters)
3. Ensure that prometheus_scraping is set to true in the OSM configmap.
3. Ensure that the application namespaces that you wish to be monitored are onboarded to the mesh. Follow the guidance available [here.](http://docs.microsoft.com/azure/azure-arc/kubernetes/tutorial-arc-enabled-osm#onboard-namespaces-to-the-service-mesh)
4. To enable namespace(s), download the osm client library [here](https://docs.microsoft.com/en-us/azure/aks/servicemesh-osm-about?pivots=client-operating-system-linux#osm-service-quotas-and-limits-preview) & then enable metrics on namespaces
```bash
# With osm
osm metrics enable --namespace test
osm metrics enable --namespace "test1, test2"

```
4. On your Azure Monitor Container Insights for Arc.
* Download the configmap from [here](https://github.com/microsoft/Docker-Provider/blob/ci_prod/kubernetes/container-azm-ms-osmconfig.yaml)
* Add the namespaces you want to monitor in configmap `monitor_namespaces = ["namespace1", "namespace2"]`
* Run the following kubectl command: kubectl apply -f<configmap_yaml_file.yaml>
* Example: `kubectl apply -f container-azm-ms-osmconfig.yaml`
5. The configuration change can take upto 15 mins to finish before taking effect, and all omsagent pods in the cluster will restart. The restart is a rolling restart for all omsagent pods, not all restart at the same time.

## Validate the metrics flow
1. Query cluster's Log Analytics workspace InsightsMetrics table to see metrics are flowing or not
Expand All @@ -41,8 +63,9 @@ InsightsMetrics

## How to consume OSM monitoring dashboard?
1. Access your AKS cluster & Container Insights through this [link.](https://aka.ms/azmon/osmux)
2. Go to reports tab and access Open Service Mesh (OSM) workbook.
3. Select the time-range & namespace to scope your services. By default, we only show services deployed by customers and we exclude internal service communication. In case you want to view that you select Show All in the filter. Please note OSM is managed service mesh, we show all internal connections for transparency.
* For **Azure Arc for k8s**, access Container Insights through this [link.](https://aka.ms/azmon/osmarcux)
3. Go to reports tab and access Open Service Mesh (OSM) workbook.
4. Select the time-range & namespace to scope your services. By default, we only show services deployed by customers and we exclude internal service communication. In case you want to view that you select Show All in the filter. Please note OSM is managed service mesh, we show all internal connections for transparency.

![alt text](https://github.com/microsoft/Docker-Provider/blob/saarorOSMdoc/Documentation/OSMPrivatePreview/Image1.jpg)
### Requests Tab
Expand All @@ -51,6 +74,8 @@ InsightsMetrics
3. You can view total requests, request error rate & P90 latency.
4. You can drill-down to destination and view trends for HTTP error/success code, success rate, Pods resource utilization, latencies at different percentiles.

![image](https://user-images.githubusercontent.com/31900410/119195241-2e712000-ba39-11eb-8cb0-2d7d16e26d1b.png)

### Connections Tab
1. This tab provides you a summary of all the connections between your services in Open Service Mesh.
2. Outbound connections: Total number of connections between Source and destination services.
Expand All @@ -68,4 +93,6 @@ InsightsMetrics
2. When source or destination is osmcontroller we show no latency & for internal services we show no resource utilization.
3. When both prometheus scraping using pod annotations and OSM monitoring are enabled on the same set of namespaces, the default set of metrics (envoy_cluster_upstream_cx_total, envoy_cluster_upstream_cx_connect_fail, envoy_cluster_upstream_rq, envoy_cluster_upstream_rq_xx, envoy_cluster_upstream_rq_total, envoy_cluster_upstream_rq_time_bucket, envoy_cluster_upstream_cx_rx_bytes_total, envoy_cluster_upstream_cx_tx_bytes_total, envoy_cluster_upstream_cx_active) will be collected twice. You can follow [this](https://docs.microsoft.com/en-us/azure/azure-monitor/containers/container-insights-prometheus-integration#prometheus-scraping-settings) documentation to exclude these namespaces from pod annotation scraping using the setting monitor_kubernetes_pods_namespaces to work around this issue.

4. For monitoring on **Azure Arc on k8s** currently there is a separate link to access OSM workbook. We plan to have one single link to access workbook on both platforms by 10th June 2021.

This is private preview, the goal for us is to get feedback. Please feel free to reach out to us at [[email protected]](mailto:[email protected]) for any feedback and questions!
18 changes: 18 additions & 0 deletions ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ additional questions or comments.

Note : The agent version(s) below has dates (ciprod<mmddyyyy>), which indicate the agent build dates (not release dates)

### 06/11/2021 -
##### Version microsoft/oms:ciprod06112021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021 (linux)
##### Version microsoft/oms:win-ciprod06112021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod06112021 (windows)
- Linux Agent
- Removal of base omsagent dependency
- Using MDSD version 1.10.1 as base agent for all the supported LA data types
- Ruby version upgrade to 2.6 i.e. same version as windows agent
- Upgrade FluentD gem version to 1.12.2
- All the Ruby Fluentd Plugins upgraded to v1 as per Fluentd guidance
- Fluent-bit tail plugin Mem_Buf_limit is configurable via ConfigMap
- Windows Agent
- CA cert changes for airgapped clouds
- Send perf metrics to MDM from windows daemonset
- FluentD gem version upgrade from 1.10.2 to 1.12.2 to make same version as Linux Agent
- Doc updates
- README updates related to OSM preview release for Arc K8s
- README updates related to recommended alerts

### 05/20/2021 -
##### Version microsoft/oms:ciprod05202021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod05202021 (linux)
##### No Windows changes with this release, win-ciprod04222021 still current.
Expand Down
2 changes: 1 addition & 1 deletion alerts/recommended_alerts_ARM/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Completed job count|Calculates number of jobs completed more than six hours ago.

### How to enable with a Resource Manager template
1. Download one or all of the available templates that describe how to create the alert.
2. Create and use a [parameters file](https://review.docs.microsoft.com/azure/azure-resource-manager/templates/parameter-files) as a JSON to set the values required to create the alert rule.
2. Create and use a [parameters file](https://docs.microsoft.com/en-us/azure/azure-resource-manager/templates/parameter-files) as a JSON to set the values required to create the alert rule.
3. Deploy the template from the Azure portal, PowerShell, or Azure CLI.

For step by step procedures on how to enable alerts via Resource manager, please go [here.](https://aka.ms/ci_alerts_arm)
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
@default_service_interval = "1"
@default_buffer_chunk_size = "1"
@default_buffer_max_size = "1"
@default_mem_buf_limit = "10"

def is_number?(value)
true if Integer(value) rescue false
Expand All @@ -19,6 +20,7 @@ def substituteFluentBitPlaceHolders
interval = ENV["FBIT_SERVICE_FLUSH_INTERVAL"]
bufferChunkSize = ENV["FBIT_TAIL_BUFFER_CHUNK_SIZE"]
bufferMaxSize = ENV["FBIT_TAIL_BUFFER_MAX_SIZE"]
memBufLimit = ENV["FBIT_TAIL_MEM_BUF_LIMIT"]

serviceInterval = (!interval.nil? && is_number?(interval) && interval.to_i > 0 ) ? interval : @default_service_interval
serviceIntervalSetting = "Flush " + serviceInterval
Expand All @@ -32,8 +34,12 @@ def substituteFluentBitPlaceHolders
tailBufferMaxSize = tailBufferChunkSize
end

tailMemBufLimit = (!memBufLimit.nil? && is_number?(memBufLimit) && memBufLimit.to_i > 10) ? memBufLimit : @default_mem_buf_limit
tailMemBufLimitSetting = "Mem_Buf_Limit " + tailMemBufLimit + "m"

text = File.read(@td_agent_bit_conf_path)
new_contents = text.gsub("${SERVICE_FLUSH_INTERVAL}", serviceIntervalSetting)
new_contents = new_contents.gsub("${TAIL_MEM_BUF_LIMIT}", tailMemBufLimitSetting)
if !tailBufferChunkSize.nil?
new_contents = new_contents.gsub("${TAIL_BUFFER_CHUNK_SIZE}", "Buffer_Chunk_Size " + tailBufferChunkSize + "m")
else
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
#!/usr/local/bin/ruby
# frozen_string_literal: true

require_relative "tomlrb"
#this should be require relative in Linux and require in windows, since it is a gem install on windows
@os_type = ENV["OS_TYPE"]
if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
require "tomlrb"
else
require_relative "tomlrb"
end

require_relative "/etc/fluent/plugin/constants"
require_relative "ConfigParseErrorLogger"
require_relative "microsoft/omsagent/plugin/constants"

@configMapMountPath = "/etc/config/settings/alertable-metrics-configuration-settings"
@configVersion = ""
Expand Down Expand Up @@ -124,6 +131,10 @@ def populateSettingValuesFromConfigMap(parsedConfig)
end
end

def get_command_windows(env_variable_name, env_variable_value)
return "[System.Environment]::SetEnvironmentVariable(\"#{env_variable_name}\", \"#{env_variable_value}\", \"Process\")" + "\n" + "[System.Environment]::SetEnvironmentVariable(\"#{env_variable_name}\", \"#{env_variable_value}\", \"Machine\")" + "\n"
end

@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"]
puts "****************Start MDM Metrics Config Processing********************"
if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version, so hardcoding it
Expand All @@ -137,19 +148,37 @@ def populateSettingValuesFromConfigMap(parsedConfig)
end
end

# Write the settings to file, so that they can be set as environment variables
file = File.open("config_mdm_metrics_env_var", "w")
if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
# Write the settings to file, so that they can be set as environment variables in windows container
file = File.open("setmdmenv.ps1", "w")

if !file.nil?
file.write("export AZMON_ALERT_CONTAINER_CPU_THRESHOLD=#{@percentageCpuUsageThreshold}\n")
file.write("export AZMON_ALERT_CONTAINER_MEMORY_RSS_THRESHOLD=#{@percentageMemoryRssThreshold}\n")
file.write("export AZMON_ALERT_CONTAINER_MEMORY_WORKING_SET_THRESHOLD=\"#{@percentageMemoryWorkingSetThreshold}\"\n")
file.write("export AZMON_ALERT_PV_USAGE_THRESHOLD=#{@percentagePVUsageThreshold}\n")
file.write("export AZMON_ALERT_JOB_COMPLETION_TIME_THRESHOLD=#{@jobCompletionThresholdMinutes}\n")
# Close file after writing all MDM setting environment variables
file.close
puts "****************End MDM Metrics Config Processing********************"
if !file.nil?
commands = get_command_windows("AZMON_ALERT_CONTAINER_CPU_THRESHOLD", @percentageCpuUsageThreshold)
file.write(commands)
commands = get_command_windows("AZMON_ALERT_CONTAINER_MEMORY_WORKING_SET_THRESHOLD", @percentageMemoryWorkingSetThreshold)
file.write(commands)
# Close file after writing all environment variables
file.close
puts "****************End MDM Metrics Config Processing********************"
else
puts "Exception while opening file for writing MDM metric config environment variables"
puts "****************End MDM Metrics Config Processing********************"
end
else
puts "Exception while opening file for writing MDM metric config environment variables"
puts "****************End MDM Metrics Config Processing********************"
# Write the settings to file, so that they can be set as environment variables in linux container
file = File.open("config_mdm_metrics_env_var", "w")

if !file.nil?
file.write("export AZMON_ALERT_CONTAINER_CPU_THRESHOLD=#{@percentageCpuUsageThreshold}\n")
file.write("export AZMON_ALERT_CONTAINER_MEMORY_RSS_THRESHOLD=#{@percentageMemoryRssThreshold}\n")
file.write("export AZMON_ALERT_CONTAINER_MEMORY_WORKING_SET_THRESHOLD=\"#{@percentageMemoryWorkingSetThreshold}\"\n")
file.write("export AZMON_ALERT_PV_USAGE_THRESHOLD=#{@percentagePVUsageThreshold}\n")
file.write("export AZMON_ALERT_JOB_COMPLETION_TIME_THRESHOLD=#{@jobCompletionThresholdMinutes}\n")
# Close file after writing all MDM setting environment variables
file.close
puts "****************End MDM Metrics Config Processing********************"
else
puts "Exception while opening file for writing MDM metric config environment variables"
puts "****************End MDM Metrics Config Processing********************"
end
end
16 changes: 11 additions & 5 deletions build/common/installer/scripts/tomlparser.rb
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,10 @@
@enrichContainerLogs = false
@containerLogSchemaVersion = ""
@collectAllKubeEvents = false
@containerLogsRoute = ""

@containerLogsRoute = "v2" # default for linux
if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
@containerLogsRoute = "v1" # default is v1 for windows until windows agent integrates windows ama
end
# Use parser to parse the configmap toml file to a ruby structure
def parseConfigMap
begin
Expand Down Expand Up @@ -162,8 +164,12 @@ def populateSettingValuesFromConfigMap(parsedConfig)
#Get container logs route setting
begin
if !parsedConfig[:log_collection_settings][:route_container_logs].nil? && !parsedConfig[:log_collection_settings][:route_container_logs][:version].nil?
@containerLogsRoute = parsedConfig[:log_collection_settings][:route_container_logs][:version]
puts "config::Using config map setting for container logs route"
if !parsedConfig[:log_collection_settings][:route_container_logs][:version].empty?
@containerLogsRoute = parsedConfig[:log_collection_settings][:route_container_logs][:version]
puts "config::Using config map setting for container logs route: #{@containerLogsRoute}"
else
puts "config::Ignoring config map settings and using default value since provided container logs route value is empty"
end
end
rescue => errorStr
ConfigParseErrorLogger.logError("Exception while reading config map settings for container logs route - #{errorStr}, using defaults, please check config map for errors")
Expand Down Expand Up @@ -256,7 +262,7 @@ def get_command_windows(env_variable_name, env_variable_value)
file.write(commands)
commands = get_command_windows('AZMON_CLUSTER_COLLECT_ALL_KUBE_EVENTS', @collectAllKubeEvents)
file.write(commands)
commands = get_command_windows('AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE', @containerLogsRoute)
commands = get_command_windows('AZMON_CONTAINER_LOGS_ROUTE', @containerLogsRoute)
file.write(commands)
commands = get_command_windows('AZMON_CONTAINER_LOG_SCHEMA_VERSION', @containerLogSchemaVersion)
file.write(commands)
Expand Down
Loading

0 comments on commit 8500537

Please sign in to comment.