Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add log rotation settings for fluentd logs #577

Merged
merged 1 commit into from
Jun 11, 2021

Conversation

ganga1980
Copy link
Contributor

By default fluentd doesnt rotate its log file, see - https://docs.fluentd.org/deployment/logging#log-rotation-setting and this makes the fluentd.log growing infinitely. Added the log rotation settings, 20MB and 5 generations to avoid infinite growth. In case of windows, growth of the fluentd log file is very low, ~i.e. 500 bytes per an hour and this is not critical, will add the task for fluentd in windows agent to add log rotation settings in next agent release.

@ganga1980 ganga1980 requested a review from a team June 11, 2021 17:00
@daweim0 daweim0 merged commit 50b99ff into ci_dev Jun 11, 2021
ganga1980 added a commit that referenced this pull request Jun 11, 2021
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
ganga1980 added a commit that referenced this pull request Oct 8, 2021
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
ganga1980 added a commit that referenced this pull request Feb 1, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix merge issue

* fix logger exception

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
pfrcks added a commit that referenced this pull request Mar 19, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
ganga1980 added a commit that referenced this pull request May 20, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

* Remove health type from DCR onboarding & add private link support for windows agent in msi mode (#727)

* add private link support for windows agent in msi auth

* remove Microsoft-KubeHealth

* add private link support for windows msi

* fix bug

* fix bug

* fix bug

* fix bug

* check platform specific tags (#730) (#731)

* PodReadyPercentage metric bug fix (#734)

* update windows to ruby 2.7 (#732)

Co-authored-by: Amol Agrawal <[email protected]>

* Improve CI/CD for multi-arch (#733)

* selective push + trivy test

* keep size down

* improve CI and PR builds

* improve checks

* remove IMAGE_TAG build_arg from prod pipeline

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/ts updates for msi (#736)

* ts updates for msi based onboarding

* ts updates for msi based onboarding

* fix typo

* fix typo

* improve log message

* Sarah/health deprecation (#735)

Removes all health feature related code

* check platform specific tags (#738)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi test instructions (#739)

* instructions for msi test validation

* readme updates

* readme updates

* readme updates

* readme updates

* Add CI Windows Build to MultiArch Dev pipeline (#740)

* test image in pools

* update dev pipeline - 1

* update dev -1

* fix job names

* correct paths

* test pool name

* update pool name

* updated urls

* speed up installs

* add base build

* fix paths

* do both builds

* fix bug

* add pool for common

* fix bug

* create path

* temp remove metadata windows

* fix bug

* fix docker command

* almost there

* login to acr

* create windows metadata file

* address PR comments I

Co-authored-by: Amol Agrawal <[email protected]>

* Add Windows phase (#741)

* build and release windows for prod

Co-authored-by: Amol Agrawal <[email protected]>

* Sarah/add onboarding templates (#742)

* add onboarding templates for legacy auth

* fix download (#749)

Co-authored-by: Amol Agrawal <[email protected]>

* force run trivy stage (#745)

- scans for HIGH, MEDIUM, CRITICAL CVEs with fixes available in / and /usr/lib
- breaks build if CVEs with existing fixes found
- adds trivyignore to accomodate CVEs which are understood and should not get flagged
- adds CVEs to trivyignore to unblock builds; CVEs will be fixed and removed from trivyignore in later PRs

Co-authored-by: Amol Agrawal <[email protected]>

* update telegraf to 1.22.2 to fix vulns (#752)

* update telegraf to 1.22.2 to fix vulns

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/arc k8s aad msi auth  (#743)

* arc k8s msi

* wip

* extension identity role

* imds sidecar integration for arc k8s

* imds sidecar integration for arc k8s

* imds endpoint for windows

* imds endpoint for windows

* wip

* fix exception

* rename param name

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* revert unneeded yaml changes

* revert unneeded yaml changes

* wip

* wip

* working

* working

* working

* add implementation for msi token for windows mdm metrics

* fix comment

* arc k8s msi onboarding templates

* fix template bug

* fix template bug

* fix template bug

* rename flag name

* fix template bug

* make useAADAuth specific to arc k8s

* set k8sport at machine scope for windows

* fix bug

* fix bug

* update rbac for arc k8s imds

* bump chart version for conformance test run

* conf test updates for msi auth

* cli extension whl file

* add containerinsights solution in msi auth mode

* unify tags

* revert test chart and image versions

* remove test whl file and fix conf test

* conf test updates for addon-token-adapter

* remove container insights solution add for msi auth

* add missing arm template param

* Gangams/ws2022 support (#756)

* use hyperv isolation

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* doc and script updates

* add common as dependency for multi-arc job

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* separate jobs for ltsc2019 & ltsc2022

* separate jobs for ltsc2019 & ltsc2022

* update dev image docker file & script

* remove unnecessary task

* update prod pipeline yaml for windows multi-arc image

* test yamls for ltsc2019 & ltsc2022

* fix pr checker fail

* fix repoImageWindows path in windows pipeline

* remove passing imagetag for prod

* CA Cert Fix for Mariner Hosts in Air Gap (#751)

* add cifs & fuse file systems to ignore list (#750)

* Data collection script (#759)

* Add files via upload

* Add files via upload

* Delete AKSInsightsLogCollection.sh

* Create README.md

* Add files via upload

* move script to subfolder LogCollection

* Update README.md

* Rename AKSInsightsLogCollection.sh to AgentLogCollection.sh

* Microsoft mandatory file (#763)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* Adding v2 schema options (#762)

* Adding v2 schema options

Adding commented out section in log collection settings for v2 schema

* adding documentation link

* Agent release for ciprod05192022 and win-ciprod05192022  (#765)

* Making changes for the release ciprod05192022 (except release notes)

* Adding release notes

* Remove unnecessary spaces

* Updating release notes for configmap v2 and disk usage metrics

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Nina <[email protected]>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: Auston Li <[email protected]>
ganga1980 added a commit that referenced this pull request Sep 22, 2022
* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

* Remove health type from DCR onboarding & add private link support for windows agent in msi mode (#727)

* add private link support for windows agent in msi auth

* remove Microsoft-KubeHealth

* add private link support for windows msi

* fix bug

* fix bug

* fix bug

* fix bug

* check platform specific tags (#730) (#731)

* PodReadyPercentage metric bug fix (#734)

* update windows to ruby 2.7 (#732)

Co-authored-by: Amol Agrawal <[email protected]>

* Improve CI/CD for multi-arch (#733)

* selective push + trivy test

* keep size down

* improve CI and PR builds

* improve checks

* remove IMAGE_TAG build_arg from prod pipeline

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/ts updates for msi (#736)

* ts updates for msi based onboarding

* ts updates for msi based onboarding

* fix typo

* fix typo

* improve log message

* Sarah/health deprecation (#735)

Removes all health feature related code

* check platform specific tags (#738)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi test instructions (#739)

* instructions for msi test validation

* readme updates

* readme updates

* readme updates

* readme updates

* Add CI Windows Build to MultiArch Dev pipeline (#740)

* test image in pools

* update dev pipeline - 1

* update dev -1

* fix job names

* correct paths

* test pool name

* update pool name

* updated urls

* speed up installs

* add base build

* fix paths

* do both builds

* fix bug

* add pool for common

* fix bug

* create path

* temp remove metadata windows

* fix bug

* fix docker command

* almost there

* login to acr

* create windows metadata file

* address PR comments I

Co-authored-by: Amol Agrawal <[email protected]>

* Add Windows phase (#741)

* build and release windows for prod

Co-authored-by: Amol Agrawal <[email protected]>

* Sarah/add onboarding templates (#742)

* add onboarding templates for legacy auth

* fix download (#749)

Co-authored-by: Amol Agrawal <[email protected]>

* force run trivy stage (#745)

- scans for HIGH, MEDIUM, CRITICAL CVEs with fixes available in / and /usr/lib
- breaks build if CVEs with existing fixes found
- adds trivyignore to accomodate CVEs which are understood and should not get flagged
- adds CVEs to trivyignore to unblock builds; CVEs will be fixed and removed from trivyignore in later PRs

Co-authored-by: Amol Agrawal <[email protected]>

* update telegraf to 1.22.2 to fix vulns (#752)

* update telegraf to 1.22.2 to fix vulns

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/arc k8s aad msi auth  (#743)

* arc k8s msi

* wip

* extension identity role

* imds sidecar integration for arc k8s

* imds sidecar integration for arc k8s

* imds endpoint for windows

* imds endpoint for windows

* wip

* fix exception

* rename param name

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* revert unneeded yaml changes

* revert unneeded yaml changes

* wip

* wip

* working

* working

* working

* add implementation for msi token for windows mdm metrics

* fix comment

* arc k8s msi onboarding templates

* fix template bug

* fix template bug

* fix template bug

* rename flag name

* fix template bug

* make useAADAuth specific to arc k8s

* set k8sport at machine scope for windows

* fix bug

* fix bug

* update rbac for arc k8s imds

* bump chart version for conformance test run

* conf test updates for msi auth

* cli extension whl file

* add containerinsights solution in msi auth mode

* unify tags

* revert test chart and image versions

* remove test whl file and fix conf test

* conf test updates for addon-token-adapter

* remove container insights solution add for msi auth

* add missing arm template param

* Gangams/ws2022 support (#756)

* use hyperv isolation

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* doc and script updates

* add common as dependency for multi-arc job

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* separate jobs for ltsc2019 & ltsc2022

* separate jobs for ltsc2019 & ltsc2022

* update dev image docker file & script

* remove unnecessary task

* update prod pipeline yaml for windows multi-arc image

* test yamls for ltsc2019 & ltsc2022

* fix pr checker fail

* fix repoImageWindows path in windows pipeline

* remove passing imagetag for prod

* CA Cert Fix for Mariner Hosts in Air Gap (#751)

* add cifs & fuse file systems to ignore list (#750)

* Data collection script (#759)

* Add files via upload

* Add files via upload

* Delete AKSInsightsLogCollection.sh

* Create README.md

* Add files via upload

* move script to subfolder LogCollection

* Update README.md

* Rename AKSInsightsLogCollection.sh to AgentLogCollection.sh

* Microsoft mandatory file (#763)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* Adding v2 schema options (#762)

* Adding v2 schema options

Adding commented out section in log collection settings for v2 schema

* adding documentation link

* Agent release for ciprod05192022 and win-ciprod05192022  (#765)

* Making changes for the release ciprod05192022 (except release notes)

* Adding release notes

* Remove unnecessary spaces

* Updating release notes for configmap v2 and disk usage metrics

* trivy image scan (#770)

* do trivy image check in azure pipelines

* remove pr-checker github action

Co-authored-by: Amol Agrawal <[email protected]>

* Prometheus sidecar memory optimization  (#769)

Don't start telegraf, mdsd, and fluent-bit in the prometheus sidecar if it has no work to do (monitor_kubernetes_pods = false and no OSM namespaces to scrape). This part is just a resource-usage optimization.

Adding the newly created environment variables in a file as adding them to bashrc makes it inaccessible if being run in a non-interactive environment. This happens in case of livenessprobe.sh.

* Gangams/fix telegraf issue (#773)

* avoid imds token call during start up

* avoid imds token call during start up

* Make metrics endpoint variable on ArcA cluster (#772)

* add integration for azure subnet ip usage (#774)

* add integration for azure cni subnet ip usage

* exclude unfixed cve & remove fixed one

* Gangams/rs hyper scale 2022 ready (#753)

* watch and multiproc implementation

* fix weird bug

* multiproc support for fluentd

* working

* fix log lines

* refactor code

* cache telemetry

* nodecount telemetry

* bug fix

* further optimize

* bugfix related typo

* node allocatable cache

* wincontainerinventory in multiproc

* disable health

* config events on different core

* add ts to logs

* move kube perf records to separate plugin

* refactor

* minor update

* remove commented code

* mdm state file

* mdm state file

* podmdm to separate plugin

* bug fixes

* bug fixes

* bug fixes

* podmdm plugin

* bug fixes

* bug fixes

* remove unneeded log lines

* more improvements

* clean up

* clean up

* add requestId header for mdm metrics

* latest mdsd and fix for threading issue in out mdm

* rs specific config for large cluster

* optimize out mdm

* bug fix

* use large queue limit for kube perf

* 5k preview rs limits

* handle resourceversion empty or 0 scenrio

* handle pagination api call failures

* fix bug

* preview image for internal customer validation

* preview image

* wip

* wip

* fix trailing whitespaces

* fix bug

* remove unused envvars in yaml

* revert minor things

* telemetry tags for preview release

* revert preview image tags

* revert unintended change

* fix bug

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* preview image tag with latest ci_dev changes

* change back to use prod image in docker files

* fix unit test failures

* exclude unfixed cve until this get fixed

* fix minor issue

* increase retries to handle transient errors

* changes related to june 2022 release (#778)

* Gangams/ARM Template updates for the DCR API version and stream group (#784)

* update to use stream group

* update DCR api version & stream group

* Bump Newtonsoft.Json in /build/windows/installer/certificategenerator (#785)

Bumps [Newtonsoft.Json](https://github.com/JamesNK/Newtonsoft.Json) from 12.0.3 to 13.0.1.
- [Release notes](https://github.com/JamesNK/Newtonsoft.Json/releases)
- [Commits](JamesNK/Newtonsoft.Json@12.0.3...13.0.1)

---
updated-dependencies:
- dependency-name: Newtonsoft.Json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gangams/fix file access exceptions (#787)

* fix file access exception

* move insights metrics conf to common

* clear file content before writing content

* add timestamp to debug logs

* release updates for linux agent

* Adhere to containers security guidance (#783)

- move away from dockerhub images to MCR images
- parameterize images in dockerfiles
- use azure pipelines variables to pass appropriate MCR images during buildtime

Co-authored-by: Amol Agrawal <[email protected]>

* update to DCR & DCR-A api version 2021-04-01 (#789)

* fix telegraf vulns (#795)

Co-authored-by: Amol Agrawal <[email protected]>

* Address vulnerabilities through package updates (#794)

- Updates to ruby 3.1.1
- Uses RVM as ruby manager instead of the brightbox ppa
- Updates fluentd to 1.14.6
- Use default JSON gem instead of yajl-json
- Consume tomlrb as a gem instead of committed source code

Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/fix log loss inode reuse (#796)

* use ignore_older fbit default and option for configurability

* fix minor comment

* fix minor comment

* merge conflict (#799)

* update vulns (#800)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/fix permission assignments in test scripts (#802)

* restrict rw permissions to owner

* remove usage of worldwrite file permissions

* remove worldwrite file permission

* remove worldwrite file permission

* Gangams/rs vpa (#801)

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* use image which has support for only scaling limits

* rename omsagent-rs-vpa to omsagent-vpa

* add vpa configmap

* use updated version of addon-resizer

* collect omsagent-rs limits telemetry if VPA enabled

* ignore new unfixed vulnerabilities

* fix bug

* fix bug

* fix bug

* bug fix

* fix bug

* fix bug

* rename env var name

* use the addon-resizer and collect requests and limits telemetry

* fix bug

* minor update

* User/amagraw/fix milli bytes bug (#805)



Co-authored-by: Amol Agrawal <[email protected]>

* update to use GA labels (#806)

* ciprod08102022 release

* bump rs memory limit

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Nina <[email protected]>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: Auston Li <[email protected]>
Co-authored-by: Janvi Jatakia <[email protected]>
Co-authored-by: MSFTXiangyu <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: bragi92 <[email protected]>
ganga1980 added a commit that referenced this pull request Nov 28, 2022
* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

* Remove health type from DCR onboarding & add private link support for windows agent in msi mode (#727)

* add private link support for windows agent in msi auth

* remove Microsoft-KubeHealth

* add private link support for windows msi

* fix bug

* fix bug

* fix bug

* fix bug

* check platform specific tags (#730) (#731)

* PodReadyPercentage metric bug fix (#734)

* update windows to ruby 2.7 (#732)

Co-authored-by: Amol Agrawal <[email protected]>

* Improve CI/CD for multi-arch (#733)

* selective push + trivy test

* keep size down

* improve CI and PR builds

* improve checks

* remove IMAGE_TAG build_arg from prod pipeline

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/ts updates for msi (#736)

* ts updates for msi based onboarding

* ts updates for msi based onboarding

* fix typo

* fix typo

* improve log message

* Sarah/health deprecation (#735)

Removes all health feature related code

* check platform specific tags (#738)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi test instructions (#739)

* instructions for msi test validation

* readme updates

* readme updates

* readme updates

* readme updates

* Add CI Windows Build to MultiArch Dev pipeline (#740)

* test image in pools

* update dev pipeline - 1

* update dev -1

* fix job names

* correct paths

* test pool name

* update pool name

* updated urls

* speed up installs

* add base build

* fix paths

* do both builds

* fix bug

* add pool for common

* fix bug

* create path

* temp remove metadata windows

* fix bug

* fix docker command

* almost there

* login to acr

* create windows metadata file

* address PR comments I

Co-authored-by: Amol Agrawal <[email protected]>

* Add Windows phase (#741)

* build and release windows for prod

Co-authored-by: Amol Agrawal <[email protected]>

* Sarah/add onboarding templates (#742)

* add onboarding templates for legacy auth

* fix download (#749)

Co-authored-by: Amol Agrawal <[email protected]>

* force run trivy stage (#745)

- scans for HIGH, MEDIUM, CRITICAL CVEs with fixes available in / and /usr/lib
- breaks build if CVEs with existing fixes found
- adds trivyignore to accomodate CVEs which are understood and should not get flagged
- adds CVEs to trivyignore to unblock builds; CVEs will be fixed and removed from trivyignore in later PRs

Co-authored-by: Amol Agrawal <[email protected]>

* update telegraf to 1.22.2 to fix vulns (#752)

* update telegraf to 1.22.2 to fix vulns

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/arc k8s aad msi auth  (#743)

* arc k8s msi

* wip

* extension identity role

* imds sidecar integration for arc k8s

* imds sidecar integration for arc k8s

* imds endpoint for windows

* imds endpoint for windows

* wip

* fix exception

* rename param name

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* revert unneeded yaml changes

* revert unneeded yaml changes

* wip

* wip

* working

* working

* working

* add implementation for msi token for windows mdm metrics

* fix comment

* arc k8s msi onboarding templates

* fix template bug

* fix template bug

* fix template bug

* rename flag name

* fix template bug

* make useAADAuth specific to arc k8s

* set k8sport at machine scope for windows

* fix bug

* fix bug

* update rbac for arc k8s imds

* bump chart version for conformance test run

* conf test updates for msi auth

* cli extension whl file

* add containerinsights solution in msi auth mode

* unify tags

* revert test chart and image versions

* remove test whl file and fix conf test

* conf test updates for addon-token-adapter

* remove container insights solution add for msi auth

* add missing arm template param

* Gangams/ws2022 support (#756)

* use hyperv isolation

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* doc and script updates

* add common as dependency for multi-arc job

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* separate jobs for ltsc2019 & ltsc2022

* separate jobs for ltsc2019 & ltsc2022

* update dev image docker file & script

* remove unnecessary task

* update prod pipeline yaml for windows multi-arc image

* test yamls for ltsc2019 & ltsc2022

* fix pr checker fail

* fix repoImageWindows path in windows pipeline

* remove passing imagetag for prod

* CA Cert Fix for Mariner Hosts in Air Gap (#751)

* add cifs & fuse file systems to ignore list (#750)

* Data collection script (#759)

* Add files via upload

* Add files via upload

* Delete AKSInsightsLogCollection.sh

* Create README.md

* Add files via upload

* move script to subfolder LogCollection

* Update README.md

* Rename AKSInsightsLogCollection.sh to AgentLogCollection.sh

* Microsoft mandatory file (#763)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* Adding v2 schema options (#762)

* Adding v2 schema options

Adding commented out section in log collection settings for v2 schema

* adding documentation link

* Agent release for ciprod05192022 and win-ciprod05192022  (#765)

* Making changes for the release ciprod05192022 (except release notes)

* Adding release notes

* Remove unnecessary spaces

* Updating release notes for configmap v2 and disk usage metrics

* trivy image scan (#770)

* do trivy image check in azure pipelines

* remove pr-checker github action

Co-authored-by: Amol Agrawal <[email protected]>

* Prometheus sidecar memory optimization  (#769)

Don't start telegraf, mdsd, and fluent-bit in the prometheus sidecar if it has no work to do (monitor_kubernetes_pods = false and no OSM namespaces to scrape). This part is just a resource-usage optimization.

Adding the newly created environment variables in a file as adding them to bashrc makes it inaccessible if being run in a non-interactive environment. This happens in case of livenessprobe.sh.

* Gangams/fix telegraf issue (#773)

* avoid imds token call during start up

* avoid imds token call during start up

* Make metrics endpoint variable on ArcA cluster (#772)

* add integration for azure subnet ip usage (#774)

* add integration for azure cni subnet ip usage

* exclude unfixed cve & remove fixed one

* Gangams/rs hyper scale 2022 ready (#753)

* watch and multiproc implementation

* fix weird bug

* multiproc support for fluentd

* working

* fix log lines

* refactor code

* cache telemetry

* nodecount telemetry

* bug fix

* further optimize

* bugfix related typo

* node allocatable cache

* wincontainerinventory in multiproc

* disable health

* config events on different core

* add ts to logs

* move kube perf records to separate plugin

* refactor

* minor update

* remove commented code

* mdm state file

* mdm state file

* podmdm to separate plugin

* bug fixes

* bug fixes

* bug fixes

* podmdm plugin

* bug fixes

* bug fixes

* remove unneeded log lines

* more improvements

* clean up

* clean up

* add requestId header for mdm metrics

* latest mdsd and fix for threading issue in out mdm

* rs specific config for large cluster

* optimize out mdm

* bug fix

* use large queue limit for kube perf

* 5k preview rs limits

* handle resourceversion empty or 0 scenrio

* handle pagination api call failures

* fix bug

* preview image for internal customer validation

* preview image

* wip

* wip

* fix trailing whitespaces

* fix bug

* remove unused envvars in yaml

* revert minor things

* telemetry tags for preview release

* revert preview image tags

* revert unintended change

* fix bug

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* preview image tag with latest ci_dev changes

* change back to use prod image in docker files

* fix unit test failures

* exclude unfixed cve until this get fixed

* fix minor issue

* increase retries to handle transient errors

* changes related to june 2022 release (#778)

* Gangams/ARM Template updates for the DCR API version and stream group (#784)

* update to use stream group

* update DCR api version & stream group

* Bump Newtonsoft.Json in /build/windows/installer/certificategenerator (#785)

Bumps [Newtonsoft.Json](https://github.com/JamesNK/Newtonsoft.Json) from 12.0.3 to 13.0.1.
- [Release notes](https://github.com/JamesNK/Newtonsoft.Json/releases)
- [Commits](JamesNK/Newtonsoft.Json@12.0.3...13.0.1)

---
updated-dependencies:
- dependency-name: Newtonsoft.Json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gangams/fix file access exceptions (#787)

* fix file access exception

* move insights metrics conf to common

* clear file content before writing content

* add timestamp to debug logs

* release updates for linux agent

* Adhere to containers security guidance (#783)

- move away from dockerhub images to MCR images
- parameterize images in dockerfiles
- use azure pipelines variables to pass appropriate MCR images during buildtime

Co-authored-by: Amol Agrawal <[email protected]>

* update to DCR & DCR-A api version 2021-04-01 (#789)

* fix telegraf vulns (#795)

Co-authored-by: Amol Agrawal <[email protected]>

* Address vulnerabilities through package updates (#794)

- Updates to ruby 3.1.1
- Uses RVM as ruby manager instead of the brightbox ppa
- Updates fluentd to 1.14.6
- Use default JSON gem instead of yajl-json
- Consume tomlrb as a gem instead of committed source code

Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/fix log loss inode reuse (#796)

* use ignore_older fbit default and option for configurability

* fix minor comment

* fix minor comment

* merge conflict (#799)

* update vulns (#800)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/fix permission assignments in test scripts (#802)

* restrict rw permissions to owner

* remove usage of worldwrite file permissions

* remove worldwrite file permission

* remove worldwrite file permission

* wip: data collection interval

* wip:namespace filtering

* wip

* wip

* fix naming

* Gangams/rs vpa (#801)

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* use image which has support for only scaling limits

* rename omsagent-rs-vpa to omsagent-vpa

* add vpa configmap

* use updated version of addon-resizer

* collect omsagent-rs limits telemetry if VPA enabled

* ignore new unfixed vulnerabilities

* fix bug

* fix bug

* fix bug

* bug fix

* fix bug

* fix bug

* rename env var name

* use the addon-resizer and collect requests and limits telemetry

* fix bug

* minor update

* rename variable names

* rename variable names

* add telemetry

* fix bugs

* fix bugs

* fix bugs

* fix bug

* fix bug

* fix bugs

* fix bug

* add known cve to ignore list

* more optimizations

* refactor extensionSettings

* fix minor issue

* arm templates for arc k8s

* use total cache for ns filtering

* use the preview images for private preview release

* fix merge issues

* fix merge issues

* create dcr in cluster rg

* update existing templates with data collection settings

* implement namespaces filtering mode

* implement namespaces filtering mode

* implement namespaces filtering mode

* implement namespaces filtering mode

* implement namespaces filtering mode

* clean up

* better naming

* better naming

* better naming

* better naming

* better naming

* refactor code

* add telemetry

* update template params

* fix pr feedback

* fix pr feedback

* make naming consistent

* make naming consistent

* make naming consistent

* make naming consistent

* make naming consistent

* fix pr feedback

* fix pr feedback

* fix renaming variable

* add known cve to the ignore list

* minor log updates

Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Nina <[email protected]>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: Auston Li <[email protected]>
Co-authored-by: Janvi Jatakia <[email protected]>
Co-authored-by: MSFTXiangyu <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: bragi92 <[email protected]>
jatakiajanvi12 pushed a commit that referenced this pull request Dec 2, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
jatakiajanvi12 pushed a commit that referenced this pull request Dec 2, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
jatakiajanvi12 pushed a commit that referenced this pull request Dec 2, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix merge issue

* fix logger exception

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
jatakiajanvi12 pushed a commit that referenced this pull request Dec 2, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
jatakiajanvi12 added a commit that referenced this pull request Dec 2, 2022
* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

* Remove health type from DCR onboarding & add private link support for windows agent in msi mode (#727)

* add private link support for windows agent in msi auth

* remove Microsoft-KubeHealth

* add private link support for windows msi

* fix bug

* fix bug

* fix bug

* fix bug

* check platform specific tags (#730) (#731)

* PodReadyPercentage metric bug fix (#734)

* update windows to ruby 2.7 (#732)

Co-authored-by: Amol Agrawal <[email protected]>

* Improve CI/CD for multi-arch (#733)

* selective push + trivy test

* keep size down

* improve CI and PR builds

* improve checks

* remove IMAGE_TAG build_arg from prod pipeline

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/ts updates for msi (#736)

* ts updates for msi based onboarding

* ts updates for msi based onboarding

* fix typo

* fix typo

* improve log message

* Sarah/health deprecation (#735)

Removes all health feature related code

* check platform specific tags (#738)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi test instructions (#739)

* instructions for msi test validation

* readme updates

* readme updates

* readme updates

* readme updates

* Add CI Windows Build to MultiArch Dev pipeline (#740)

* test image in pools

* update dev pipeline - 1

* update dev -1

* fix job names

* correct paths

* test pool name

* update pool name

* updated urls

* speed up installs

* add base build

* fix paths

* do both builds

* fix bug

* add pool for common

* fix bug

* create path

* temp remove metadata windows

* fix bug

* fix docker command

* almost there

* login to acr

* create windows metadata file

* address PR comments I

Co-authored-by: Amol Agrawal <[email protected]>

* Add Windows phase (#741)

* build and release windows for prod

Co-authored-by: Amol Agrawal <[email protected]>

* Sarah/add onboarding templates (#742)

* add onboarding templates for legacy auth

* fix download (#749)

Co-authored-by: Amol Agrawal <[email protected]>

* force run trivy stage (#745)

- scans for HIGH, MEDIUM, CRITICAL CVEs with fixes available in / and /usr/lib
- breaks build if CVEs with existing fixes found
- adds trivyignore to accomodate CVEs which are understood and should not get flagged
- adds CVEs to trivyignore to unblock builds; CVEs will be fixed and removed from trivyignore in later PRs

Co-authored-by: Amol Agrawal <[email protected]>

* update telegraf to 1.22.2 to fix vulns (#752)

* update telegraf to 1.22.2 to fix vulns

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/arc k8s aad msi auth  (#743)

* arc k8s msi

* wip

* extension identity role

* imds sidecar integration for arc k8s

* imds sidecar integration for arc k8s

* imds endpoint for windows

* imds endpoint for windows

* wip

* fix exception

* rename param name

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* revert unneeded yaml changes

* revert unneeded yaml changes

* wip

* wip

* working

* working

* working

* add implementation for msi token for windows mdm metrics

* fix comment

* arc k8s msi onboarding templates

* fix template bug

* fix template bug

* fix template bug

* rename flag name

* fix template bug

* make useAADAuth specific to arc k8s

* set k8sport at machine scope for windows

* fix bug

* fix bug

* update rbac for arc k8s imds

* bump chart version for conformance test run

* conf test updates for msi auth

* cli extension whl file

* add containerinsights solution in msi auth mode

* unify tags

* revert test chart and image versions

* remove test whl file and fix conf test

* conf test updates for addon-token-adapter

* remove container insights solution add for msi auth

* add missing arm template param

* Gangams/ws2022 support (#756)

* use hyperv isolation

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* doc and script updates

* add common as dependency for multi-arc job

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* separate jobs for ltsc2019 & ltsc2022

* separate jobs for ltsc2019 & ltsc2022

* update dev image docker file & script

* remove unnecessary task

* update prod pipeline yaml for windows multi-arc image

* test yamls for ltsc2019 & ltsc2022

* fix pr checker fail

* fix repoImageWindows path in windows pipeline

* remove passing imagetag for prod

* CA Cert Fix for Mariner Hosts in Air Gap (#751)

* add cifs & fuse file systems to ignore list (#750)

* Data collection script (#759)

* Add files via upload

* Add files via upload

* Delete AKSInsightsLogCollection.sh

* Create README.md

* Add files via upload

* move script to subfolder LogCollection

* Update README.md

* Rename AKSInsightsLogCollection.sh to AgentLogCollection.sh

* Microsoft mandatory file (#763)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* Adding v2 schema options (#762)

* Adding v2 schema options

Adding commented out section in log collection settings for v2 schema

* adding documentation link

* Agent release for ciprod05192022 and win-ciprod05192022  (#765)

* Making changes for the release ciprod05192022 (except release notes)

* Adding release notes

* Remove unnecessary spaces

* Updating release notes for configmap v2 and disk usage metrics

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Nina <[email protected]>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: Auston Li <[email protected]>
jatakiajanvi12 added a commit that referenced this pull request Dec 2, 2022
* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

* Gangams/cluster creation scripts (#414)

* onprem k8s script

* script updates

* scripts for creating non-aks clusters

* fix minor text update

* updates

* script updates

* fix

* script updates

* fix scripts to install docker

* fix: Pin to a particular version of ltsc2019 by SHA (#427)

* enable collecting npm metrics (optionally) (#425)

* enable collecting npm metrics (optionally)

* fix default enrichment value

* fix adx

* Saaror patch 3 (#426)

* Create README.MD

Creating content for Kubecon lab

* Update README.MD

* Update README.MD

* Gangams/add containerd support to windows agent (#428)

* wip

* wip

* wip

* wip

* bug fix related to uri

* wip

* wip

* fix bug with ignore cert validation

* logic to ignore cert validation

* minor

* fix minor debug log issue

* improve log message

* debug message

* fix bug with nullorempty check

* remove debug statements

* refactor parsers

* add debug message

* clean up

* chart updates

* fix formatting issues

* Gangams/arc k8s metrics  (#413)

* cluster identity token

* wip

* fix exception

* fix exceptions

* fix exception

* fix bug

* fix bug

* minor update

* refactor the code

* more refactoring

* fix bug

* typo fix

* fix typo

* wait for 1min after token renewal request

* add proxy support for arc k8s mdm endpoint

* avoid additional get call

* minor line ending fix

* wip

* have separate log for arc k8s cluster identity

* fix bug on creating crd resource

* remove update permission since not required

* fixed some bugs

* fix pr feedback

* remove list since its not required

* fix: Reverting back to ltsc2019 tag (#429)

* more kubelet metrics (#430)

* more kubelet metrics

* celan up new config

* fix nom issue when config is empty (#432)

* support multiple docker paths when docker root is updated thru knode (#433)

* Gangams/doc and other related updates (#434)

* bring back nodeslector changes for windows agent ds

* readme updates

* chart updates for azure cluster resourceid and region

* set cluster region during onboarding for managed clusters

* wip

* fix for onboarding script

* add sp support for the login

* update help

* add sp support for powershell

* script updates for sp login

* wip

* wip

* wip

* readme updates

* update the links to use ci_prod branch

* fix links

* fix image link

* some more readme updates

* add missing serviceprincipal in ps scripts (#435)

* fix telemetry bug (#436)

* Gangams/readmeupdates non aks 09162020 (#437)

* changes for ciprod09162020 non-aks release

* fix script to handle cross sub scenario

* fix minor comment

* fix date in version file

* fix pr comments

* Gangams/fix weird conflicts (#439)

* separate build yamls for ci_prod branch (#415) (#416)

* [Merge] dev to prod for ciprod08072020 release (#424)

* separate build yamls for ci_prod branch (#415)

* re-enable adx path (#420)

* Gangams/release changes (#419)

* updates related to release

* updates related to release

* fix the incorrect version

* fix pr feedback

* fix some typos in the release notes

* fix for zero filled metrics (#423)

* consolidate windows agent image docker files (#422)

* consolidate windows agent image docker files

* revert docker file consolidation

* revert readme updates

* merge back windows dockerfiles

* image tag update

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>

* fix quote issue for the region (#441)

* fix cpucapacity/limit bug (#442)

* grwehner/pv-usage-metrics (#431)

- Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace.
- Send PV usage percentage to MDM if over the configurable threshold.
- Add PV usage recommended alert template.

* add new custom metric regions (#444)

* add new custom metric regions

* fix commas

* add 'Terminating' state (#443)

* Gangams/sept agent release tasks (#445)

* turnoff mdm nonsupported cluster types

* enable validation of server cert for ai ruby http client

* add kubelet operations total and total error metrics

* node selector label change

* label update

* wip

* wip

* wip

* revert quotes

* grwehner/pv-collect-volume-name (#448)

Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric

* Changes for september agent release (#449)

Moving from v1beta1 to v1 for health CRD
Adding timer for zero filling
Adding zero filling for PV metrics

* Gangams/arc k8s related scripts, charts and doc updates (#450)

* checksum annotations

* script update for chart from mcr

* chart updates

* update chart version to match with chart release

* script updates

* latest chart updates

* version updates for chart release

* script updates

* script updates

* doc updates

* doc updates

* update comments

* fix bug in ps script

* fix bug in ps script

* minor update

* release process updates

* use consistent name across scripts

* use consistent names

* Install CA certs from wireserver (#451)

* grwehner/pv-volume-name-in-mdm (#452)

Add volume name for PV to mdm dimensions and zero fill it

* Release changes for 10052020 release (#453)

* Release changes for 10052020 release

* remove redundant kubelet metrics as part of PR feedback

* Update onboarding_instructions.md (#456)

* Update onboarding_instructions.md

Updated the documentation to reflect where to update the config map.

* Update onboarding_instructions.md

* Update onboarding_instructions.md

* Update onboarding_instructions.md

Updated the link

* chart update for sept2020 release (#457)

* add missing version update in the script (#458)

* November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459)

* activate one agent, adx schema v2, win perf issue, syslog deactivation

* update chart

* remove hiphen for params in chart (#462)

Merging as its a simple fix (remove hiphen)

* Changes for cutting a new build for ciprod10272020 release (#460)

* using latest stable version of msys2 (#465)

* fixing the windows-perf-dups (#466)

* chart updates related to new microsoft/charts repo (#467)

* Changes for creating 11092020 release (#468)

* MDM exception aggregation (#470)

* grwehner/mdm custom metric regions (#471)

Remove custom metrics region check for public cloud

* updaitng rs limit to 1gb (#474)

* grwehner/pv inventory (#455)

Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA

* Gangams/fix for build release pipeline issue (#476)

* use isolated cdpx acr

* correct comment

* add pv fluentd plugin config to helm rs config (#477)

* add pv fluentd plugin to helm rs config

* helm rbac permissions for pv api calls

* Gangams/fix rs ooming (#473)

* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment

* Gangams/enable arc onboarding to ff (#478)

* wip

* updates

* trigger login if the ctx cloud not same as specified cloud

* add missed commit

* Convert PV type dictionary to json for telemetry so it shows up in logs (#480)

* fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482)

* fix ci envvar collection in large pods (#483)

* grwehner/jan agent tasks (#481)

- Windows agent fix to use log filtering settings in config map.
- Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails.
- Remove env variable for workspace key for windows agent

* updating fbit version and cpu limit (#485)

* reverting to older version (#487)

* Gangams/add fbsettings configurable via configmap (#486)

* wip

* fbit config settings

* add config warn message

* handle one config provided but not other

* fixed pr feedback

* fix copy paste error

* rename config parameter names

* fix typo

* fix fbit crash in helm path

* fix nil check

* Gangams/jan agent release tasks (#484)

* wip

* explicit amd64 affinity for hybrid workloads

* fix space issue

* wip

* revert vscode setting file

* remove per container logs in ci (#488)

* updates for ciprod01112021 release (#489)

* new yaml files (#491)

* Use cloud-specific instrumentation keys (#494)

If APPLICATIONINSIGHTS_AUTH_URL is set/non-empty then the agent will now grab a custom IKey from a URL stored in APPLICATIONINSIGHTS_AUTH_URL

* upgrade apt to latest version (#492)

* upgrade apt to latest version

* fix pr feedback

* Gangams/add support for extension msi for arc k8s cluster (#495)

* wip

* add env var for the arc k8s extension name

* chart update

* extension msi updates

* fix bug

* revert chart and image to prod version

* minor text changes

* image tag to prod

* wip

* wip

* wip

* wip

* final updates

* fix whitespaces

* simplify crd yaml

* Gangams/arm template arc k8s extension (#496)

* arm templates for arc k8s extension

* update to use official extension type name

* update

* add identity property

* add proxyendpointurl parameter

* add default values

* Gangams/aks monitoring via policy (#497)

* enable monitoring through policy

* wip

* handle tags

* wip

* add alias

* wip

* working

* updates

* working

* with deployment name

* doc updates

* doc updates

* fix typo in the docs

* revert to use operatingSystem from osImage for node os telemety (#498)

* Container log v2 schema changes (#499)

* make pod name in mdsd definition as str for consistency. msgp has no type checking, as it has type metadata in it the message itself.

* Add priority class to the daemonsets (#500)

* Add priority class to the daemonsets

Add a priority class for omsagent and have the daemonsets use this
to be sure to schedule the pods.

Daemonset pods are constrained in scheduling to run on specific
nodes.  This is done by the daemonset controller.  When a node shows
up it will create a pod with a strong affinity to that node.  When a
node goes away, it will delete the pod with the node affinity to that
node.

Kubernetes pod scheduling does not know it is a daemonset but it does
know it is tied to a specific node.  With default scheduling, it is
possible for the pods to be "frozen out" of a node because the node
already is full.  This can happen because "normal" pods may already
exist and are looking for a node to get scheduled on when a node is
added to the cluster.  The daemonset controller will only first
create the pod for the node at around the same time.  The kubernetes
scheduler is running async from all of this and thus there can be a
race as to who gets scheduled on the node.

The pod priority class (and thus the pod priority) is a way to indicate
that the pod has a higher scheduling priority than a default pod.

By default, all pods are at priority 0.  Higher numbers are higher
priority.  Setting the priority to something greater than zero will
allow the omsagent daemonsets to win a race against "normal" pods for
scheduled resources on a node - and will also allow for graceful
eviction in the case the node is too full.

Without this, omsagent can be left out of node in clusters that are
very busy, especially in dynamic scaling situations.

I did not test the windows pod as we have no windows clusters.

* CR feedback

* fix node metric issue (#502)

* Bug fixes for Feb release (#504)

* bug fix for mdm metrics with no limits

* fix exception bug

* Gangams/feb 2021 agent bug fix (#505)

* fix npe in getKubeServiceRecords

* use image fields from spec

* fix typo

* cover all cases

* handle scenario only digest specified

* changes for release -ciprod02232021 (#506)

* Gangams/e2e test framework (#503)

* add agent e2e fw and tests

* doc and script updates

* add validation script

* doc updates

* yaml updates

* fix typo

* doc updates

* more doc updates

* add ISTEST for helm chart to use arc conf

* refactor test code

* fix pr feedback

* fix pr feedback

* fix pr feedback

* fix pr feedback

* scrape new kubelet pod count metric name (#508)

* Adding explicit json output to az commands as the script fails if az is configured with Table output #409 (#513)

* Gangams/arc proxy contract and token renewal updates (#511)

* fix issue with crd status updates

* handle renewal token delays

* add proxy contract

* updates for proxy cert for linux

* remove proxycert related changes

* fix whitespace issue

* fix whitespace issue

* remove proxy in arm template

* doc updates for microsoft charts repo release (#512)

* doc updates for microsoft charts repo release

* wip

* Update enable-monitoring.sh (#514)

Line 314 and 343 seems to have trailing spaces for some subscriptions which is exiting the script even for valid scenarios

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Prometheus scraping from sidecar and OSM changes (#515)

* add liveness timeout for exec (#518)

* chart and other updates (#519)

* Saaror osmdoc (#523)

* Create ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Add files via upload

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* Update ReadMe.md

* telemetry bug fix (#527)

* Fix conflicting logrotate settings (#526)

The node and the omsagent container both have a cron.daily file to rotate certain logs daily. These settings are the same for some files in /var/log (mounted from the node with read/write access), causing the rotation to fail when both try to rotate at the same time. So then the /var/log/*.1 file is written to forever. Since these files are always written to and never rotated, it causes high memory usage on the node after a while.

This fix removes the container logrotate settings for /var/log, which the container does not write to.

* bug fix (#528)

* Gangams/arc ev2 deployment (#522)

* ev2 deployment for arc k8s extension

* fix charts path issue

* rename scripts tar

* add notifications

* fix line endings

* fix line endings

* update with prod repo

* fix file endings

* added liveness and telemetry for telegraf (#517)

* added liveness and telemetry for telegraf

* code transfer

* removed windows liveness probe

* done

* Windows metric fix (#530)

* changes

* about to remove container fix

* moved caching code to existing loop

* removed un-necessary changes

* removed a few more un-necessary changes

* added windows node check

* fixed a bug

* everything works confirmed

* OSM doc update (#533)

* Adding MDM metrics for threshold violation (#531)

* Rashmi/april agent 2021 (#538)

* add Read_from_Head config for all fluentbit tail plugins (#539)

See the commit message of: fluent/fluent-bit@70e33fa
for details explaining the fluentbit change and what Read_from_Head does when set to true.

* fix programdata mount issue on containerd win nodes (#542)

* Update sidecar mem limits  (#541)

* David/release 4 22 2021 (#544)

* updating image tag and agent version

* updated liveness probe

* updated release notes again

* fixed date in version file

* 1m, 1m, 1s by default (#543)

* 1m, 1m, 1s by default

* setting default through a different method

* David/aad stage 1 release (#556)

* update to latest omsagent, add eastus2 to mdsd regions

* copied oneagent bits to a CI repository release

* mdsd inmem mode

* yaml for cl scale test

* yaml for cl scale test

* reverting dockerProviderVersion version to 15.0.0

* prepping for release (updated image version, dockerProviderVersion, and release notes

* container log scaletest yamls

* forgot to update image version in chart

* fixing windows tag in dockerfile, changing release notes wording

* missed windows tag in one more place

* forgot to change the windows dockerProviderVersion back

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Update ReleaseNotes.md (#558)

fix imagetag in the release notes

* Add wait time for telegraf and also force mdm egress to use tls 1.2 (#560)

* Add wait time for telegraf and also force mdm egress to use tls 1.2

* add wait for all telegraf dependencies across all containers (ds & rs)

* remove ssl change so we dont include as part of the other fix until we test with att nodes.

* partially disabled telegraf liveness probe check, we'll still have telemetry but the probe won't fail if telegraf isn't running (#561)

* changes for 05202021 release (#563)

* changes for 05202021 release

* fixed typos

* Rashmi/jedi wireserver (#566)

* Update ReadMe.md (#565)

* Update ReadMe.md

* Update ReadMe.md

Included feedback from OSM team and Fixed

* Gangams/aad stage2 full switch to mdsd (#559)

* full switch to mdsd, upgrade to ruby v1 & omsagent removal

* add odsdirect as fallback option

* cleanup

* cleanup

* move customRegion to stage3

* updates related to containerlog route

* make xml eventschema consistent

* add buffer settings

* address HTTPServerException deprecation in ruby 2.6

* update to official mdsd version

* fix log message issue

* fix pr feedback

* get ridoff unused code from omscommon

* fix pr feedback

* fix pr feedback

* clean up

* clean up

* fix missing conf

* Send perf metrics to MDM from windows daemonset (#568)

* updating json gem to address CVE-2020-10663 (#567)

* updating json gem to address CVE-2020-10663

* updating json gem to address CVE-2020-10663

* update recommended alerts readme (#570)

@dcbrown16 pointed out that this page links to the wrong document in [this issue](#475). The content in the currently linked page is identitical to the page which should be linked, so it's a simple fix.

* trying again to fix the json gem (#571)

* trying again to fix the json gem

* removing installation of newer json gem

* Addressing PR comments for - #568 (#569)

* Mem_Buf_limit  is configurable via ConfigMap (#574)

* add log rotation settings for fluentd logs (#577)

* Gangams/release 06112021 (#578)

* updates related to ciprod06112021 release

* minor update

* release note update (#579)

* Make sidecar fluentbit chunk size configurable (#573)

* Fix vulnerabilities (#583)

* test

* test1

* test-2

* test-3

* 3

* 4

* test

* 2

* 3

* 4

* 5

* 6

* rename gem for windows

* fix

* fix

* Windows build optimization (#582)

* fix windows build failure due to msys2 version

* Fix telegraf startup issue when endpoint is unreachable (#587)

* revert fbit tail plugins defaults to std defaults (#586)

* fixed another bug (#593)

* feat: add new metrics to MDM for allocatable % calculation of cpu and memory usage (#584)

* feat: allocatable cpu and memory % metrics for MDM

* maybe

* linux is working

* windwos....

* some more

* comment

* better

* syntax

* ruby

* revert omsagent.yaml

* comments

* pr feedback

* pr feedback

* testing msys2 version update

* better

* update adx sdk for perf issue (#601)

* remove md check

* Gangams/release notes update for hotfix (#596)

* release notes updates

* release notes updates for ciprod06112021-1

* Cherry picking hotfix changes to ci_dev (#605)

* release changes (#607)

* Gangams/aad stage3 msi auth (#585)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* refactor the windows agent ingestion token code

* code cleanup

* fix build errors

* code clean up

* code clean up

* code clean up

* code clean up

* more refactoring

* fix bug

* fix bug

* add debug logs

* add nil checks

* revert changes

* revert yaml change since this added in aks side

* fix pr feedback

* fix pr feedback

* refine retry code

* update mdsd env as per official build

* cleanup

* update env vars per mdsd

* update with mdsd official build

* skip cert gen & renewal incase of aad msi auth

* add nil check

* cherry windows agent nodeip issue

* fix merge issue

Co-authored-by: rashmichandrashekar <[email protected]>

* Gangams/remove chart version dependency (#589)

* remove chart version dependency

* remove unused code

* fix resource type

* fix

* handle weird cli chars

* update release process

* Gangams/july 2021 release tasks 3 (#613)

* use artifact and pipeline creds for image push

* minor update

* add vuln fix here so that pr can be merged

* remove un-used output plugin (#614)

* fix telegraf telemetry and improve fluentd liveness (#611)

* fix telegraf telemetry and improve fluentd liveness

* address identified vuln with libsystemd0

* fix exported image file extension

* Gangams/july 2021 release tasks 2 (#612)

* tail rs mdsd err logs

* configure mdsd log rotation

* log rotation for mdsd log files

* Fix out_oms.go dependency vulnerabilities (#623)

* revert libsystemd0 update (#616)

* updates for ci-prod release instructions (#619)

* cherry pick changes from ci_prod (#622)

* Support az login for passwords starting with dash ('-') (#626)

Co-authored-by: Vladimir Babichev <[email protected]>

* Gangams/add telemetry fbit settings (#628)

* add telemetry to track fbit settings

* add telemetry to track fbit settings

* check onboarding status (#629)

* Gangams/arc k8s conformance test updates (#617)

* conf test updates

* clean up

* wip

* update with mcr cidev image

* handle log path

* cleanup

* clean up

* wip

* working

* update for mcr image

* minor

* image update

* handle latency of connected cluster resource creation

* update conftest image

* upgrade golang version for windows in pipeline build and locally (#630)

* Updating a link in Readme.md (#632)

The link to the build pipelines now goes directly to our build pipelines (instead of to all github-private pipelines)

* Updating omsagent yaml to have parity with omsagent yaml file in AKS RP (#615)

* Unit test tooling (#625)

Added tooling and examples for unit tests

* run unit tests after a merge too (#634)

* flag stale PRs & issues

* Adding script to collect logs (for troubleshooting) (#636)

* added script for collecting logs

* added windows daemonset and prometheus sidecar, as well as some explanatory prints

* added kubectl describe and kubectl logs output

* changed message to make it more clear some erros are expected

* Sarah/ev2 (#640)

* ev2 artifacts for release pipeline

* update parameters reference

* add artifacts tar file

* changes to rollout and service model

* change agentimage path

* adding agentimage to artifact script

* removing charts from tarball

* change script to use blob storage

* change blob variables

* echo variables

* change blob uri

* use release id for blob prefix

* change to delete blob file

* add check for if blob storage file exists

* fix script errors

* update check for file in storage

* change true check

* comments and change storage account info to pipeline variables

* Changes for windows tar file

* PR changes

* documenting fbit tail plugin configmap settings. (#638)

* documenting fbit tail plugin configmap settings.

* Install unzip package on shell extension (#642)

* Changing installation in ev2 script (#644)

* Adjust release pipeline to use cdpx acr (#647)

* Adjust release pipeline to use cdpx acr

* Adjust release pipeline to use cdpx acr

* Update CDPX ACR path

* Add check for cdpx repo variable

* Sarah/ev2 prod (#649)

* Ev2 changes for prod

* CDPX repo naming change (#652)

* Sarah/ev2 update (#654)

* remove acr name from repo path

* add check to make sure tag does not exist in mcr repo

* change tag syntax for mcr repo check (#655)

* Gangams/optimize win livenessprobe (#653)

* livenessprobe optimization

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* optimize windows agent liveness probe

* Gangams/addon token adapter image tag to telemetry (#656)

* addon token adapter image tag

* addon token adapter image tag

* Sarah/ev2 helm (#658)

* Use MSI for Arc Release

* Use CIPROD_ACR AME subscription for shell extension

* remove extra line endings

* Sarah/ev2 pipeline (#661)

* testing build artifact dir changes

* add .pipelines directory and omsagent.yaml to build artifacts

* add charts directory to build artifacts (#662)

* Sarah/remove cdpx creds (#664)

* don't use cdpx acr creds from kv

* add e2etest.yaml to build output

* keep cdpx creds for now

* chart updates for rbac api version change (#660)

* chart updates for rbac api version change

* include windows ds for arc

* proxy support (for non-aks) (#665)

* changes related to aad msi auth feature

* use existing envvars

* fix imds token expiry interval

* initial proxy support

* merge?

* cleaning up some files which should've merged differently

* proxy should be working, but most tables don't have any data. About to merge, maybe whatever was wrong is now fixed

* linux AMA proxy works

* about to merge

* proxy support appears to be working, final mdsd build location will still change

* removing some unnecessary changes

* forgot to remove one last change

* redirected mdsd stderr to stdout instead of stdin

* addressing proxy password location comment

Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/agent release ciprod10082021 & win-ciprod10082021 (#666)

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* updates for the release ciprod10082021 and win-ciprod10082021

* use buildcommand for prod pipeline (#668)

* fixed merge issues. (#671) (#672)

* fix merge conflicts

* update with newimage tag

* changes related to mdsd version update (#673) (#674)

* Sarah/enable metrics (#675)

* add user assigned msi to yaml for pipeline

* update placeholders

* Gangams/chart updates oct2021 release (#676)

* chart updates for oct2021 release

* wip

* wip

* wip

* Gangams/msi mode mdsd crash fix (#677)

* update mdsd version which has fix for crash in msi mode

* image tag updates

* update to use extension GA api version (#679)

* Gangams/arm template msi onboarding (#659)

* wip

* wip

* working

* working

* working

* working

* working

* working

* shorten dcr prefix to DCR- to handle default workspace name length

* use MSCI- prefix similar to MSVMI- for dcr

* Gangams/conf test updates to handle sidecar (#681)

* wip

* test updates

* fix pr feedback

* fix pr feedback

* Fix scan break due to latest trivy changes

* Anjohans/configurable database name (#663)

* First cut at an implementation

* Reverting a change

* Moving a few lines to better align with cluster URI config

* Moving a few lines to better align with cluster URI config

* Adding an extra check that won't hurt

* Getting ADX database name from config rather than from secret

* Reverse the mangling done by editor

* Fixes to the code for reading the db name setting

* More fixes to the rb code for settings

* Tweaked and tested

* Code review

* Review follow-up

* Remove whitespace

* Gangams/troubelshooting script for arc k8s (#682)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* doc updates

* doc updates

* wip

* wip

* update repo for issues

* fix minor one

* Sarah/remove cdpx creds (#685)

* remove download of cdpx creds

* fix: subtract number instead of string + update fluentd version 1.14.2 to fix security vulnerability (#686)

* fix: change default value to a number so that substraction happens correctly

* update fluentd version to 1.14.2

* extra end statement

* safely set to float

* big decimal precision

* revert omsagent

* keep telemetry

* Faster Linux builds (part 1) (#687)

* moved docker image arg later on to enable docker build caching

* fixing image tag (doh)

* Sarah/fluentbit windows log (#688)

* upgrade fluentbit version for windows

* saving progress--fluent bit log tailing working for windows

* use configmap values for fluent-bit.conf where necessary and make necessary files common

* revert certificategenerator

* remove tomlparser-agent-config from linux folder

* clean up fluent.conf

* clean up fluent-bit.conf

* revert image tag

* fix agent tag

* make fluent bit flush interval configurable

* clean up unecessary conf files

* remove unecessary parts of fluent and fluent-bit conf

* log level back to info

* add fbit env variables for omsagent-win

* moving db files to var directory

* default to port 10250 & containerd for linux agent (#699)

* default to port 10250 & containerd

* fix pr feedback

* Updating pod annotation for latest agent version (#697)

* fix windows build failure due to msys2 version (#700)

* fix windows build failure due to msys2 version

* 20211130.0.0

* Jan agent tasks (#698)

* remove v1 fallback hidden option (#705)

* collect telemetry containerlog records with emptystamp (#703)

* collect telemetry containerlog records with emptystamp

* collect telemetry containerlog records with emptystamp

* Fixing telegraf bug for placeholder name (#706)

* Gangams/jan 2022 release tasks 3 (#702)

* add telemetry related to windows containers records

* add telemetry related to windows containers records

* containercount telemetry

* add explicit exit code in ps scripts

* node count telemetry

* telemetry for win cirecord 64KB or more

* metric to track wintelegraf metrics with tags 64kb

* metric to track wintelegraf metrics with tags 64kb

* fix pr feedback

* Gangams/jan 2022 release tasks 2 (#701)

* mdsd proc cpu and memory telemetry

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* write ai logs to file and telemetry for mdsd proc

* fix pr feedback

* use name_prefix

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* remove mdsd telemetry changes

* release updates for ciprod01312022 & win-ciprod01312022release (#707)

* release updates for ciprod01312022 release

* release updates for ciprod01312022 release

* fix pr feedback

* fix logger exception (#709)

* Gangams/chart version update for jan release (#710)

* chart updates for jan2022 release

* add missing agentversion annotations

* fix agentversion annotation issue in chart (#712)

* adx bug + misc (#714)

* fix golang dependencies

* fix adx bug

* exclude telegraf

* fix space

* include both

* exclude files specifically

* fix build break (#715)

* fix build break

* update all places

* Explicitly use win-2019 to unblock windows PRs builds

* Fixing telegraf vulnerability (#716)

* cherry picked changes from 03112022 release (#719)

* cherry picked changes from 03112022 release

* Gangams/http proxy support (#717)

* add proxy cert support

* add proxy cert support

* add proxy cert support

* add proxy cert support

* remove arbitery username and pwd requirement

* remove arbitery username and pwd requirement

* add proxy support for mdm

* mdsd dev build

* proxy changes

* fix typo

* mdsd dev build

* add libcurl specific things

* working mdsd proxy build

* mdsd official master build

* handle proxy endpoint which endswith /

* latest official mdsd build

* add telemetry to track proxy ca cert

* build multi-arch images (#704)

* build multi-arch linux images
* new pipelines to build multi-arch images

Co-authored-by: Amol Agrawal <[email protected]>

* add missing artifacts (#720)

* add missing artifacts

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi  onboarding arm template updates for AKS (#721)

* msi arm template updates

* handle space in location

* minor fixes (#722)

Co-authored-by: Amol Agrawal <[email protected]>

* specify go patch version (#723)

* specify go minor version

Co-authored-by: Amol Agrawal <[email protected]>

* User/amagraw/ciprod release 20220317 (#724)

* ciprod release march changes

Co-authored-by: Amol Agrawal <[email protected]>

* Remove health type from DCR onboarding & add private link support for windows agent in msi mode (#727)

* add private link support for windows agent in msi auth

* remove Microsoft-KubeHealth

* add private link support for windows msi

* fix bug

* fix bug

* fix bug

* fix bug

* check platform specific tags (#730) (#731)

* PodReadyPercentage metric bug fix (#734)

* update windows to ruby 2.7 (#732)

Co-authored-by: Amol Agrawal <[email protected]>

* Improve CI/CD for multi-arch (#733)

* selective push + trivy test

* keep size down

* improve CI and PR builds

* improve checks

* remove IMAGE_TAG build_arg from prod pipeline

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/ts updates for msi (#736)

* ts updates for msi based onboarding

* ts updates for msi based onboarding

* fix typo

* fix typo

* improve log message

* Sarah/health deprecation (#735)

Removes all health feature related code

* check platform specific tags (#738)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/msi test instructions (#739)

* instructions for msi test validation

* readme updates

* readme updates

* readme updates

* readme updates

* Add CI Windows Build to MultiArch Dev pipeline (#740)

* test image in pools

* update dev pipeline - 1

* update dev -1

* fix job names

* correct paths

* test pool name

* update pool name

* updated urls

* speed up installs

* add base build

* fix paths

* do both builds

* fix bug

* add pool for common

* fix bug

* create path

* temp remove metadata windows

* fix bug

* fix docker command

* almost there

* login to acr

* create windows metadata file

* address PR comments I

Co-authored-by: Amol Agrawal <[email protected]>

* Add Windows phase (#741)

* build and release windows for prod

Co-authored-by: Amol Agrawal <[email protected]>

* Sarah/add onboarding templates (#742)

* add onboarding templates for legacy auth

* fix download (#749)

Co-authored-by: Amol Agrawal <[email protected]>

* force run trivy stage (#745)

- scans for HIGH, MEDIUM, CRITICAL CVEs with fixes available in / and /usr/lib
- breaks build if CVEs with existing fixes found
- adds trivyignore to accomodate CVEs which are understood and should not get flagged
- adds CVEs to trivyignore to unblock builds; CVEs will be fixed and removed from trivyignore in later PRs

Co-authored-by: Amol Agrawal <[email protected]>

* update telegraf to 1.22.2 to fix vulns (#752)

* update telegraf to 1.22.2 to fix vulns

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/arc k8s aad msi auth  (#743)

* arc k8s msi

* wip

* extension identity role

* imds sidecar integration for arc k8s

* imds sidecar integration for arc k8s

* imds endpoint for windows

* imds endpoint for windows

* wip

* fix exception

* rename param name

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* arc msi imdsd container changes

* revert unneeded yaml changes

* revert unneeded yaml changes

* wip

* wip

* working

* working

* working

* add implementation for msi token for windows mdm metrics

* fix comment

* arc k8s msi onboarding templates

* fix template bug

* fix template bug

* fix template bug

* rename flag name

* fix template bug

* make useAADAuth specific to arc k8s

* set k8sport at machine scope for windows

* fix bug

* fix bug

* update rbac for arc k8s imds

* bump chart version for conformance test run

* conf test updates for msi auth

* cli extension whl file

* add containerinsights solution in msi auth mode

* unify tags

* revert test chart and image versions

* remove test whl file and fix conf test

* conf test updates for addon-token-adapter

* remove container insights solution add for msi auth

* add missing arm template param

* Gangams/ws2022 support (#756)

* use hyperv isolation

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* multi-arc image support

* doc and script updates

* add common as dependency for multi-arc job

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* merge into single job for perf evaluation

* separate jobs for ltsc2019 & ltsc2022

* separate jobs for ltsc2019 & ltsc2022

* update dev image docker file & script

* remove unnecessary task

* update prod pipeline yaml for windows multi-arc image

* test yamls for ltsc2019 & ltsc2022

* fix pr checker fail

* fix repoImageWindows path in windows pipeline

* remove passing imagetag for prod

* CA Cert Fix for Mariner Hosts in Air Gap (#751)

* add cifs & fuse file systems to ignore list (#750)

* Data collection script (#759)

* Add files via upload

* Add files via upload

* Delete AKSInsightsLogCollection.sh

* Create README.md

* Add files via upload

* move script to subfolder LogCollection

* Update README.md

* Rename AKSInsightsLogCollection.sh to AgentLogCollection.sh

* Microsoft mandatory file (#763)

Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>

* Adding v2 schema options (#762)

* Adding v2 schema options

Adding commented out section in log collection settings for v2 schema

* adding documentation link

* Agent release for ciprod05192022 and win-ciprod05192022  (#765)

* Making changes for the release ciprod05192022 (except release notes)

* Adding release notes

* Remove unnecessary spaces

* Updating release notes for configmap v2 and disk usage metrics

* trivy image scan (#770)

* do trivy image check in azure pipelines

* remove pr-checker github action

Co-authored-by: Amol Agrawal <[email protected]>

* Prometheus sidecar memory optimization  (#769)

Don't start telegraf, mdsd, and fluent-bit in the prometheus sidecar if it has no work to do (monitor_kubernetes_pods = false and no OSM namespaces to scrape). This part is just a resource-usage optimization.

Adding the newly created environment variables in a file as adding them to bashrc makes it inaccessible if being run in a non-interactive environment. This happens in case of livenessprobe.sh.

* Gangams/fix telegraf issue (#773)

* avoid imds token call during start up

* avoid imds token call during start up

* Make metrics endpoint variable on ArcA cluster (#772)

* add integration for azure subnet ip usage (#774)

* add integration for azure cni subnet ip usage

* exclude unfixed cve & remove fixed one

* Gangams/rs hyper scale 2022 ready (#753)

* watch and multiproc implementation

* fix weird bug

* multiproc support for fluentd

* working

* fix log lines

* refactor code

* cache telemetry

* nodecount telemetry

* bug fix

* further optimize

* bugfix related typo

* node allocatable cache

* wincontainerinventory in multiproc

* disable health

* config events on different core

* add ts to logs

* move kube perf records to separate plugin

* refactor

* minor update

* remove commented code

* mdm state file

* mdm state file

* podmdm to separate plugin

* bug fixes

* bug fixes

* bug fixes

* podmdm plugin

* bug fixes

* bug fixes

* remove unneeded log lines

* more improvements

* clean up

* clean up

* add requestId header for mdm metrics

* latest mdsd and fix for threading issue in out mdm

* rs specific config for large cluster

* optimize out mdm

* bug fix

* use large queue limit for kube perf

* 5k preview rs limits

* handle resourceversion empty or 0 scenrio

* handle pagination api call failures

* fix bug

* preview image for internal customer validation

* preview image

* wip

* wip

* fix trailing whitespaces

* fix bug

* remove unused envvars in yaml

* revert minor things

* telemetry tags for preview release

* revert preview image tags

* revert unintended change

* fix bug

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* use same batchtime for both mdm & podinventory records

* preview image tag with latest ci_dev changes

* change back to use prod image in docker files

* fix unit test failures

* exclude unfixed cve until this get fixed

* fix minor issue

* increase retries to handle transient errors

* changes related to june 2022 release (#778)

* Gangams/ARM Template updates for the DCR API version and stream group (#784)

* update to use stream group

* update DCR api version & stream group

* Bump Newtonsoft.Json in /build/windows/installer/certificategenerator (#785)

Bumps [Newtonsoft.Json](https://github.com/JamesNK/Newtonsoft.Json) from 12.0.3 to 13.0.1.
- [Release notes](https://github.com/JamesNK/Newtonsoft.Json/releases)
- [Commits](JamesNK/Newtonsoft.Json@12.0.3...13.0.1)

---
updated-dependencies:
- dependency-name: Newtonsoft.Json
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Gangams/fix file access exceptions (#787)

* fix file access exception

* move insights metrics conf to common

* clear file content before writing content

* add timestamp to debug logs

* release updates for linux agent

* Adhere to containers security guidance (#783)

- move away from dockerhub images to MCR images
- parameterize images in dockerfiles
- use azure pipelines variables to pass appropriate MCR images during buildtime

Co-authored-by: Amol Agrawal <[email protected]>

* update to DCR & DCR-A api version 2021-04-01 (#789)

* fix telegraf vulns (#795)

Co-authored-by: Amol Agrawal <[email protected]>

* Address vulnerabilities through package updates (#794)

- Updates to ruby 3.1.1
- Uses RVM as ruby manager instead of the brightbox ppa
- Updates fluentd to 1.14.6
- Use default JSON gem instead of yajl-json
- Consume tomlrb as a gem instead of committed source code

Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Ganga Mahesh Siddem <[email protected]>

* Gangams/fix log loss inode reuse (#796)

* use ignore_older fbit default and option for configurability

* fix minor comment

* fix minor comment

* merge conflict (#799)

* update vulns (#800)

Co-authored-by: Amol Agrawal <[email protected]>

* Gangams/fix permission assignments in test scripts (#802)

* restrict rw permissions to owner

* remove usage of worldwrite file permissions

* remove worldwrite file permission

* remove worldwrite file permission

* Gangams/rs vpa (#801)

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* add vpa sidecar container

* use image which has support for only scaling limits

* rename omsagent-rs-vpa to omsagent-vpa

* add vpa configmap

* use updated version of addon-resizer

* collect omsagent-rs limits telemetry if VPA enabled

* ignore new unfixed vulnerabilities

* fix bug

* fix bug

* fix bug

* bug fix

* fix bug

* fix bug

* rename env var name

* use the addon-resizer and collect requests and limits telemetry

* fix bug

* minor update

* User/amagraw/fix milli bytes bug (#805)



Co-authored-by: Amol Agrawal <[email protected]>

* update to use GA labels (#806)

* ciprod08102022 release

* bump rs memory limit

Co-authored-by: Ganga Mahesh Siddem <[email protected]>
Co-authored-by: bragi92 <[email protected]>
Co-authored-by: Vishwanath <[email protected]>
Co-authored-by: saaror <[email protected]>
Co-authored-by: rashmichandrashekar <[email protected]>
Co-authored-by: Grace Wehner <[email protected]>
Co-authored-by: deagraw <[email protected]>
Co-authored-by: David Michelman <[email protected]>
Co-authored-by: Michael Sinz <[email protected]>
Co-authored-by: Nicolas Yuen <[email protected]>
Co-authored-by: seenu433 <[email protected]>
Co-authored-by: Tsubasa Nomura <[email protected]>
Co-authored-by: Vladimir <[email protected]>
Co-authored-by: Vladimir Babichev <[email protected]>
Co-authored-by: sarahpeiffer <[email protected]>
Co-authored-by: Anders Johansen <[email protected]>
Co-authored-by: Amol Agrawal <[email protected]>
Co-authored-by: Nina <[email protected]>
Co-authored-by: microsoft-github-policy-service[bot] <77245923+microsoft-github-policy-service[bot]@users.noreply.github.com>
Co-authored-by: Auston Li <[email protected]>
Co-authored-by: Janvi Jatakia <[email protected]>
Co-authored-by: MSFTXiangyu <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: bragi92 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants