From 9287425be5792c4d3ea92a9708dc411d79350cf4 Mon Sep 17 00:00:00 2001 From: Ganga Mahesh Siddem Date: Wed, 13 Jan 2021 14:05:36 -0800 Subject: [PATCH] Gangams/changes for release ciprod01112021 (#490) * separate build yamls for ci_prod branch (#415) * re-enable adx path (#420) * Gangams/release changes (#419) * updates related to release * updates related to release * fix the incorrect version * fix pr feedback * fix some typos in the release notes * fix for zero filled metrics (#423) * consolidate windows agent image docker files (#422) * consolidate windows agent image docker files * revert docker file consolidation * revert readme updates * merge back windows dockerfiles * image tag update * Gangams/cluster creation scripts (#414) * onprem k8s script * script updates * scripts for creating non-aks clusters * fix minor text update * updates * script updates * fix * script updates * fix scripts to install docker * fix: Pin to a particular version of ltsc2019 by SHA (#427) * enable collecting npm metrics (optionally) (#425) * enable collecting npm metrics (optionally) * fix default enrichment value * fix adx * Saaror patch 3 (#426) * Create README.MD Creating content for Kubecon lab * Update README.MD * Update README.MD * Gangams/add containerd support to windows agent (#428) * wip * wip * wip * wip * bug fix related to uri * wip * wip * fix bug with ignore cert validation * logic to ignore cert validation * minor * fix minor debug log issue * improve log message * debug message * fix bug with nullorempty check * remove debug statements * refactor parsers * add debug message * clean up * chart updates * fix formatting issues * Gangams/arc k8s metrics (#413) * cluster identity token * wip * fix exception * fix exceptions * fix exception * fix bug * fix bug * minor update * refactor the code * more refactoring * fix bug * typo fix * fix typo * wait for 1min after token renewal request * add proxy support for arc k8s mdm endpoint * avoid additional get call * minor line ending fix * wip * have separate log for arc k8s cluster identity * fix bug on creating crd resource * remove update permission since not required * fixed some bugs * fix pr feedback * remove list since its not required * fix: Reverting back to ltsc2019 tag (#429) * more kubelet metrics (#430) * more kubelet metrics * celan up new config * fix nom issue when config is empty (#432) * support multiple docker paths when docker root is updated thru knode (#433) * Gangams/doc and other related updates (#434) * bring back nodeslector changes for windows agent ds * readme updates * chart updates for azure cluster resourceid and region * set cluster region during onboarding for managed clusters * wip * fix for onboarding script * add sp support for the login * update help * add sp support for powershell * script updates for sp login * wip * wip * wip * readme updates * update the links to use ci_prod branch * fix links * fix image link * some more readme updates * add missing serviceprincipal in ps scripts (#435) * fix telemetry bug (#436) * Gangams/readmeupdates non aks 09162020 (#437) * changes for ciprod09162020 non-aks release * fix script to handle cross sub scenario * fix minor comment * fix date in version file * fix pr comments * Gangams/fix weird conflicts (#439) * separate build yamls for ci_prod branch (#415) (#416) * [Merge] dev to prod for ciprod08072020 release (#424) * separate build yamls for ci_prod branch (#415) * re-enable adx path (#420) * Gangams/release changes (#419) * updates related to release * updates related to release * fix the incorrect version * fix pr feedback * fix some typos in the release notes * fix for zero filled metrics (#423) * consolidate windows agent image docker files (#422) * consolidate windows agent image docker files * revert docker file consolidation * revert readme updates * merge back windows dockerfiles * image tag update Co-authored-by: Vishwanath Co-authored-by: rashmichandrashekar Co-authored-by: Vishwanath Co-authored-by: rashmichandrashekar * fix quote issue for the region (#441) * fix cpucapacity/limit bug (#442) * grwehner/pv-usage-metrics (#431) - Send persistent volume usage and capacity metrics to LA for PVs with PVCs at the pod level; config to include or exclude kube-system namespace. - Send PV usage percentage to MDM if over the configurable threshold. - Add PV usage recommended alert template. * add new custom metric regions (#444) * add new custom metric regions * fix commas * add 'Terminating' state (#443) * Gangams/sept agent release tasks (#445) * turnoff mdm nonsupported cluster types * enable validation of server cert for ai ruby http client * add kubelet operations total and total error metrics * node selector label change * label update * wip * wip * wip * revert quotes * grwehner/pv-collect-volume-name (#448) Collect and send the volume name as another tag for pvUsedBytes in InsightsMetrics, so that it can be displayed in the workload workbook. Does not affect the PV MDM metric * Changes for september agent release (#449) Moving from v1beta1 to v1 for health CRD Adding timer for zero filling Adding zero filling for PV metrics * Gangams/arc k8s related scripts, charts and doc updates (#450) * checksum annotations * script update for chart from mcr * chart updates * update chart version to match with chart release * script updates * latest chart updates * version updates for chart release * script updates * script updates * doc updates * doc updates * update comments * fix bug in ps script * fix bug in ps script * minor update * release process updates * use consistent name across scripts * use consistent names * Install CA certs from wireserver (#451) * grwehner/pv-volume-name-in-mdm (#452) Add volume name for PV to mdm dimensions and zero fill it * Release changes for 10052020 release (#453) * Release changes for 10052020 release * remove redundant kubelet metrics as part of PR feedback * Update onboarding_instructions.md (#456) * Update onboarding_instructions.md Updated the documentation to reflect where to update the config map. * Update onboarding_instructions.md * Update onboarding_instructions.md * Update onboarding_instructions.md Updated the link * chart update for sept2020 release (#457) * add missing version update in the script (#458) * November release fixes - activate one agent, adx schema v2, win perf issue, syslog deactivation (#459) * activate one agent, adx schema v2, win perf issue, syslog deactivation * update chart * remove hiphen for params in chart (#462) Merging as its a simple fix (remove hiphen) * Changes for cutting a new build for ciprod10272020 release (#460) * using latest stable version of msys2 (#465) * fixing the windows-perf-dups (#466) * chart updates related to new microsoft/charts repo (#467) * Changes for creating 11092020 release (#468) * MDM exception aggregation (#470) * grwehner/mdm custom metric regions (#471) Remove custom metrics region check for public cloud * updaitng rs limit to 1gb (#474) * grwehner/pv inventory (#455) Add fluentd plugin to request persistent volume info from the kubernetes api and send to LA * Gangams/fix for build release pipeline issue (#476) * use isolated cdpx acr * correct comment * add pv fluentd plugin config to helm rs config (#477) * add pv fluentd plugin to helm rs config * helm rbac permissions for pv api calls * Gangams/fix rs ooming (#473) * optimize kpi * optimize kube node inventory * add flags for events, deployments and hpa * have separate function parseNodeLimits * refactor code * fix crash * fix bug with service name * fix bugs related to get service name * update oom fix test agent * debug logs * fix service label issue * update to latest agent and enable ephemeral annotation * change stream size to 200 from 250 * update yaml * adjust chunksizes * add ruby gc env * yaml changes for cioomtest11282020-3 * telemetry to track pods latency * service count telemetry * rename variables * wip * nodes inventory telemetry * configmap changes * add emit streams in configmap * yaml updates * fix copy and paste bug * add todo comments * fix node latency telemetry bug * update yaml with latest test image * fix bug * upping rs memory change * fix mdm bug with final emit stream * update to latest image * fix pr feedback * fix pr feedback * rename health config to agent config * fix max allowed hpa chunk size * update to use 1k pod chunk since validated on 1.18+ * remove debug logs * minor updates * move defaults to common place * chart updates * final oomfix agent * update to use prod image so that can be validated with build pipeline * fix typo in comment * Gangams/enable arc onboarding to ff (#478) * wip * updates * trigger login if the ctx cloud not same as specified cloud * add missed commit * Convert PV type dictionary to json for telemetry so it shows up in logs (#480) * fix 2 windows tasks - 1) Dont log to termination log 2) enable ADX route for containerlogs in windows (for O365) (#482) * fix ci envvar collection in large pods (#483) * grwehner/jan agent tasks (#481) - Windows agent fix to use log filtering settings in config map. - Error handling for kubelet_utils get_node_capacity in case /metrics/cadvsior endpoint fails. - Remove env variable for workspace key for windows agent * updating fbit version and cpu limit (#485) * reverting to older version (#487) * Gangams/add fbsettings configurable via configmap (#486) * wip * fbit config settings * add config warn message * handle one config provided but not other * fixed pr feedback * fix copy paste error * rename config parameter names * fix typo * fix fbit crash in helm path * fix nil check * Gangams/jan agent release tasks (#484) * wip * explicit amd64 affinity for hybrid workloads * fix space issue * wip * revert vscode setting file * remove per container logs in ci (#488) * updates for ciprod01112021 release Co-authored-by: Vishwanath Co-authored-by: rashmichandrashekar Co-authored-by: bragi92 Co-authored-by: saaror <31900410+saaror@users.noreply.github.com> Co-authored-by: Grace Wehner --- .pipelines/get-aad-app-creds-from-kv.sh | 14 + ...rom-cdpx-and-push-to-ci-acr-linux-image.sh | 15 +- ...m-cdpx-and-push-to-ci-acr-windows-image.sh | 14 +- ReleaseNotes.md | 78 ++ .../scripts/td-agent-bit-conf-customizer.rb | 11 +- build/common/installer/scripts/tomlparser.rb | 4 +- build/linux/installer/conf/container.conf | 2 - build/linux/installer/conf/kube.conf | 26 +- .../installer/datafiles/base_container.data | 3 +- .../scripts/tomlparser-agent-config.rb | 220 ++++++ .../scripts/tomlparser-health-config.rb | 73 -- build/version | 6 +- .../installer/certificategenerator/Program.cs | 8 +- build/windows/installer/conf/fluent.conf | 12 +- .../installer/scripts/livenessprobe.cmd | 24 +- charts/azuremonitor-containers/Chart.yaml | 2 +- .../templates/omsagent-daemonset-windows.yaml | 4 + .../templates/omsagent-daemonset.yaml | 4 + .../templates/omsagent-rbac.yaml | 2 +- .../templates/omsagent-rs-configmap.yaml | 34 +- charts/azuremonitor-containers/values.yaml | 44 +- kubernetes/linux/Dockerfile | 3 +- kubernetes/linux/main.sh | 36 +- kubernetes/linux/setup.sh | 2 +- kubernetes/omsagent.yaml | 69 +- kubernetes/windows/Dockerfile | 2 +- kubernetes/windows/main.ps1 | 18 +- .../onboarding/managed/disable-monitoring.ps1 | 34 +- .../onboarding/managed/disable-monitoring.sh | 17 + .../onboarding/managed/enable-monitoring.ps1 | 49 +- .../onboarding/managed/enable-monitoring.sh | 40 +- .../onboarding/managed/upgrade-monitoring.sh | 21 +- .../health/omsagent-template-aks-engine.yaml | 2 - scripts/preview/health/omsagent-template.yaml | 2 - source/plugins/ruby/CustomMetricsUtils.rb | 12 +- source/plugins/ruby/KubernetesApiClient.rb | 387 +++++----- source/plugins/ruby/constants.rb | 116 +-- source/plugins/ruby/filter_cadvisor2mdm.rb | 15 +- source/plugins/ruby/filter_inventory2mdm.rb | 3 +- source/plugins/ruby/filter_telegraf2mdm.rb | 3 +- source/plugins/ruby/in_kube_events.rb | 18 +- source/plugins/ruby/in_kube_nodes.rb | 410 ++++++---- source/plugins/ruby/in_kube_podinventory.rb | 720 ++++++++++-------- source/plugins/ruby/in_kube_pvinventory.rb | 253 ++++++ .../plugins/ruby/in_kubestate_deployments.rb | 424 ++++++----- source/plugins/ruby/in_kubestate_hpa.rb | 421 +++++----- .../ruby/kubernetes_container_inventory.rb | 53 +- source/plugins/ruby/out_mdm.rb | 51 +- source/plugins/ruby/podinventory_to_mdm.rb | 4 +- 49 files changed, 2428 insertions(+), 1357 deletions(-) create mode 100644 build/linux/installer/scripts/tomlparser-agent-config.rb delete mode 100644 build/linux/installer/scripts/tomlparser-health-config.rb create mode 100644 source/plugins/ruby/in_kube_pvinventory.rb diff --git a/.pipelines/get-aad-app-creds-from-kv.sh b/.pipelines/get-aad-app-creds-from-kv.sh index 8ef56cddb..a0ba464cc 100755 --- a/.pipelines/get-aad-app-creds-from-kv.sh +++ b/.pipelines/get-aad-app-creds-from-kv.sh @@ -11,6 +11,8 @@ do KV) KV=$VALUE ;; KVSECRETNAMEAPPID) AppId=$VALUE ;; KVSECRETNAMEAPPSECRET) AppSecret=$VALUE ;; + KVSECRETNAMECDPXAPPID) CdpxAppId=$VALUE ;; + KVSECRETNAMECDPXAPPSECRET) CdpxAppSecret=$VALUE ;; *) esac done @@ -27,4 +29,16 @@ az keyvault secret download --file ~/acrappsecret --vault-name ${KV} --name ${A echo "downloaded the appsecret from KV:${KV} and KV secret:${AppSecret}" +echo "key vault secret name for cdpx appid:${KVSECRETNAMECDPXAPPID}" + +echo "key vault secret name for cdpx appsecret:${KVSECRETNAMECDPXAPPSECRET}" + +az keyvault secret download --file ~/cdpxacrappid --vault-name ${KV} --name ${CdpxAppId} + +echo "downloaded the appid from KV:${KV} and KV secret:${CdpxAppId}" + +az keyvault secret download --file ~/cdpxacrappsecret --vault-name ${KV} --name ${CdpxAppSecret} + +echo "downloaded the appsecret from KV:${KV} and KV secret:${CdpxAppSecret}" + echo "end: get app id and secret from specified key vault" diff --git a/.pipelines/pull-from-cdpx-and-push-to-ci-acr-linux-image.sh b/.pipelines/pull-from-cdpx-and-push-to-ci-acr-linux-image.sh index 638d3a937..3844ea185 100755 --- a/.pipelines/pull-from-cdpx-and-push-to-ci-acr-linux-image.sh +++ b/.pipelines/pull-from-cdpx-and-push-to-ci-acr-linux-image.sh @@ -25,12 +25,21 @@ ACR_APP_ID=$(cat ~/acrappid) ACR_APP_SECRET=$(cat ~/acrappsecret) echo "end: read appid and appsecret" +echo "start: read appid and appsecret for cdpx" +CDPX_ACR_APP_ID=$(cat ~/cdpxacrappid) +CDPX_ACR_APP_SECRET=$(cat ~/cdpxacrappsecret) +echo "end: read appid and appsecret which has read access on cdpx acr" + + +# Name of CDPX_ACR should be in this format :Naming convention: 'cdpx' + service tree id without '-' + two digit suffix like'00'/'01 +# suffix 00 primary and 01 secondary, and we only use primary +# This configured via pipeline variable echo "login to cdpxlinux acr:${CDPX_ACR}" -docker login $CDPX_ACR --username $ACR_APP_ID --password $ACR_APP_SECRET +docker login $CDPX_ACR --username $CDPX_ACR_APP_ID --password $CDPX_ACR_APP_SECRET echo "login to cdpxlinux acr completed: ${CDPX_ACR}" echo "pull agent image from cdpxlinux acr: ${CDPX_ACR}" -docker pull ${CDPX_ACR}/artifact/3170cdd2-19f0-4027-912b-1027311691a2/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} +docker pull ${CDPX_ACR}/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} echo "pull image from cdpxlinux acr completed: ${CDPX_ACR}" echo "CI Release name is:"$CI_RELEASE @@ -41,7 +50,7 @@ echo "CI ACR : ${CI_ACR}" echo "CI AGENT REPOSITORY NAME : ${CI_AGENT_REPO}" echo "tag linux agent image" -docker tag ${CDPX_ACR}/artifact/3170cdd2-19f0-4027-912b-1027311691a2/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} ${CI_ACR}/public/azuremonitor/containerinsights/${CI_AGENT_REPO}:${imagetag} +docker tag ${CDPX_ACR}/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} ${CI_ACR}/public/azuremonitor/containerinsights/${CI_AGENT_REPO}:${imagetag} echo "login ciprod acr":$CI_ACR docker login $CI_ACR --username $ACR_APP_ID --password $ACR_APP_SECRET diff --git a/.pipelines/pull-from-cdpx-and-push-to-ci-acr-windows-image.sh b/.pipelines/pull-from-cdpx-and-push-to-ci-acr-windows-image.sh index 066410af5..095a00039 100755 --- a/.pipelines/pull-from-cdpx-and-push-to-ci-acr-windows-image.sh +++ b/.pipelines/pull-from-cdpx-and-push-to-ci-acr-windows-image.sh @@ -25,12 +25,20 @@ ACR_APP_ID=$(cat ~/acrappid ) ACR_APP_SECRET=$(cat ~/acrappsecret) echo "end: read appid and appsecret" +echo "start: read appid and appsecret for cdpx" +CDPX_ACR_APP_ID=$(cat ~/cdpxacrappid) +CDPX_ACR_APP_SECRET=$(cat ~/cdpxacrappsecret) +echo "end: read appid and appsecret which has read access on cdpx acr" + +# Name of CDPX_ACR should be in this format :Naming convention: 'cdpx' + service tree id without '-' + two digit suffix like'00'/'01 +# suffix 00 primary and 01 secondary, and we only use primary +# This configured via pipeline variable echo "login to cdpxwindows acr:${CDPX_ACR}" -docker login $CDPX_ACR --username $ACR_APP_ID --password $ACR_APP_SECRET +docker login $CDPX_ACR --username $CDPX_ACR_APP_ID --password $CDPX_ACR_APP_SECRET echo "login to cdpxwindows acr:${CDPX_ACR} completed" echo "pull image from cdpxwin acr: ${CDPX_ACR}" -docker pull ${CDPX_ACR}/artifact/3170cdd2-19f0-4027-912b-1027311691a2/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} +docker pull ${CDPX_ACR}/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} echo "pull image from cdpxwin acr completed: ${CDPX_ACR}" echo "CI Release name:"$CI_RELEASE @@ -40,7 +48,7 @@ imagetag="win-"$CI_RELEASE$CI_IMAGE_TAG_SUFFIX echo "agentimagetag="$imagetag echo "tag windows agent image" -docker tag ${CDPX_ACR}/artifact/3170cdd2-19f0-4027-912b-1027311691a2/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} ${CI_ACR}/public/azuremonitor/containerinsights/${CI_AGENT_REPO}:${imagetag} +docker tag ${CDPX_ACR}/official/${CDPX_REPO_NAME}:${CDPX_AGENT_IMAGE_TAG} ${CI_ACR}/public/azuremonitor/containerinsights/${CI_AGENT_REPO}:${imagetag} echo "login to ${CI_ACR} acr" docker login $CI_ACR --username $ACR_APP_ID --password $ACR_APP_SECRET diff --git a/ReleaseNotes.md b/ReleaseNotes.md index ddfd01314..2afe16481 100644 --- a/ReleaseNotes.md +++ b/ReleaseNotes.md @@ -10,6 +10,84 @@ additional questions or comments. ## Release History Note : The agent version(s) below has dates (ciprod), which indicate the agent build dates (not release dates) +### 01/11/2021 - +##### Version microsoft/oms:ciprod01112021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01112021 (linux) +##### Version microsoft/oms:win-ciprod01112021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod01112021 (windows) +##### Code change log +- Fixes for Linux Agent Replicaset Pod OOMing issue +- Update fluentbit (1.14.2 to 1.6.8) for the Linux Daemonset +- Make Fluentbit settings: log_flush_interval_secs, tail_buf_chunksize_megabytes and tail_buf_maxsize_megabytes configurable via configmap +- Support for PV inventory collection +- Removal of Custom metric region check for Public cloud regions and update to use cloud environment variable to determine the custom metric support +- For daemonset pods, add the dnsconfig to use ndots: 3 from ndots:5 to optimize the number of DNS API calls made +- Fix for inconsistency in the collection container environment variables for the pods which has high number of containers +- Fix for disabling of std{out;err} log_collection_settings via configmap issue in windows daemonset +- Update to use workspace key from mount file rather than environment variable for windows daemonset agent +- Remove per container info logs in the container inventory +- Enable ADX route for windows container logs +- Remove logging to termination log in windows agent liveness probe + + +### 11/09/2020 - +##### Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod11092020 (linux) +##### Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod11092020 (windows) +##### Code change log +- Fix for duplicate windows metrics + +### 10/27/2020 - +##### Version microsoft/oms:ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod10272020 (linux) +##### Version microsoft/oms:win-ciprod10272020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10052020 (windows) +##### Code change log +- Activate oneagent in few AKS regions (koreacentral,norwayeast) +- Disable syslog +- Fix timeout for Windows daemonset liveness probe +- Make request == limit for Windows daemonset resources (cpu & memory) +- Schema v2 for container log (ADX only - applicable only for select customers for piloting) + +### 10/05/2020 - +##### Version microsoft/oms:ciprod10052020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod10052020 (linux) +##### Version microsoft/oms:win-ciprod10052020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod10052020 (windows) +##### Code change log +- Health CRD to version v1 (from v1beta1) for k8s versions >= 1.19.0 +- Collection of PV usage metrics for PVs mounted by pods (kube-system pods excluded by default)(doc-link-needed) +- Zero fill few custom metrics under a timer, also add zero filling for new PV usage metrics +- Collection of additional Kubelet metrics ('kubelet_running_pod_count','volume_manager_total_volumes','kubelet_node_config_error','process_resident_memory_bytes','process_cpu_seconds_total','kubelet_runtime_operations_total','kubelet_runtime_operations_errors_total'). This also includes updates to 'kubelet' workbook to include these new metrics +- Collection of Azure NPM (Network Policy Manager) metrics (basic & advanced. By default, NPM metrics collection is turned OFF)(doc-link-needed) +- Support log collection when docker root is changed with knode. Tracked by [this](https://github.com/Azure/AKS/issues/1373) issue +- Support for Pods in 'Terminating' state for nodelost scenarios +- Fix for reduction in telemetry for custom metrics ingestion failures +- Fix CPU capacity/limits metrics being 0 for Virtual nodes (VK) +- Add new custom metric regions (eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth) +- Enable strict SSL validation for AppInsights Ruby SDK +- Turn off custom metrics upload for unsupported cluster types +- Install CA certs from wire server for windows (in certain clouds) + +### 09/16/2020 - +> Note: This agent release targetted ONLY for non-AKS clusters via Azure Monitor for containers HELM chart update +##### Version microsoft/oms:ciprod09162020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod09162020 (linux) +##### Version microsoft/oms:win-ciprod09162020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod09162020 (windows) +##### Code change log +- Collection of Azure Network Policy Manager Basic and Advanced metrics +- Add support in Windows Agent for Container log collection of CRI runtimes such as ContainerD +- Alertable metrics support Arc K8s cluster to parity with AKS +- Support for multiple container log mount paths when docker is updated through knode +- Bug fix related to MDM telemetry + +### 08/07/2020 - +##### Version microsoft/oms:ciprod08072020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod08072020 (linux) +##### Version microsoft/oms:win-ciprod08072020 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod08072020 (windows) +##### Code change log +- Collection of KubeState metrics for deployments and HPA +- Add the Proxy support for Windows agent +- Fix for ContainerState in ContainerInventory to handle Failed state and collection of environment variables for terminated and failed containers +- Change /spec to /metrics/cadvisor endpoint to collect node capacity metrics +- Disable Health Plugin by default and can enabled via configmap +- Pin version of jq to 1.5+dfsg-2 +- Bug fix for showing node as 'not ready' when there is disk pressure +- oneagent integration (disabled by default) +- Add region check before sending alertable metrics to MDM +- Telemetry fix for agent telemetry for sov. clouds + ### 11/09/2020 - ##### Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod11092020 (linux) diff --git a/build/common/installer/scripts/td-agent-bit-conf-customizer.rb b/build/common/installer/scripts/td-agent-bit-conf-customizer.rb index fae3acb36..35b71e550 100644 --- a/build/common/installer/scripts/td-agent-bit-conf-customizer.rb +++ b/build/common/installer/scripts/td-agent-bit-conf-customizer.rb @@ -18,12 +18,17 @@ def substituteFluentBitPlaceHolders bufferChunkSize = ENV["FBIT_TAIL_BUFFER_CHUNK_SIZE"] bufferMaxSize = ENV["FBIT_TAIL_BUFFER_MAX_SIZE"] - serviceInterval = (!interval.nil? && is_number?(interval)) ? interval : @default_service_interval + serviceInterval = (!interval.nil? && is_number?(interval) && interval.to_i > 0 ) ? interval : @default_service_interval serviceIntervalSetting = "Flush " + serviceInterval - tailBufferChunkSize = (!bufferChunkSize.nil? && is_number?(bufferChunkSize)) ? bufferChunkSize : nil + tailBufferChunkSize = (!bufferChunkSize.nil? && is_number?(bufferChunkSize) && bufferChunkSize.to_i > 0) ? bufferChunkSize : nil - tailBufferMaxSize = (!bufferMaxSize.nil? && is_number?(bufferMaxSize)) ? bufferMaxSize : nil + tailBufferMaxSize = (!bufferMaxSize.nil? && is_number?(bufferMaxSize) && bufferMaxSize.to_i > 0) ? bufferMaxSize : nil + + if ((!tailBufferChunkSize.nil? && tailBufferMaxSize.nil?) || (!tailBufferChunkSize.nil? && !tailBufferMaxSize.nil? && tailBufferChunkSize.to_i > tailBufferMaxSize.to_i)) + puts "config:warn buffer max size must be greater or equal to chunk size" + tailBufferMaxSize = tailBufferChunkSize + end text = File.read(@td_agent_bit_conf_path) new_contents = text.gsub("${SERVICE_FLUSH_INTERVAL}", serviceIntervalSetting) diff --git a/build/common/installer/scripts/tomlparser.rb b/build/common/installer/scripts/tomlparser.rb index 7235ee0c3..fe26f639e 100644 --- a/build/common/installer/scripts/tomlparser.rb +++ b/build/common/installer/scripts/tomlparser.rb @@ -228,7 +228,7 @@ def get_command_windows(env_variable_name, env_variable_value) file.write(commands) commands = get_command_windows('AZMON_LOG_TAIL_PATH', @logTailPath) file.write(commands) - commands = get_command_windows('AZMON_LOG_EXCLUSION_REGEX_PATTERN', @stdoutExcludeNamespaces) + commands = get_command_windows('AZMON_LOG_EXCLUSION_REGEX_PATTERN', @logExclusionRegexPattern) file.write(commands) commands = get_command_windows('AZMON_STDOUT_EXCLUDED_NAMESPACES', @stdoutExcludeNamespaces) file.write(commands) @@ -244,7 +244,7 @@ def get_command_windows(env_variable_name, env_variable_value) file.write(commands) commands = get_command_windows('AZMON_CLUSTER_COLLECT_ALL_KUBE_EVENTS', @collectAllKubeEvents) file.write(commands) - commands = get_command_windows('AZMON_CONTAINER_LOGS_ROUTE', @containerLogsRoute) + commands = get_command_windows('AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE', @containerLogsRoute) file.write(commands) # Close file after writing all environment variables diff --git a/build/linux/installer/conf/container.conf b/build/linux/installer/conf/container.conf index f7e6e1da9..958a85eb6 100644 --- a/build/linux/installer/conf/container.conf +++ b/build/linux/installer/conf/container.conf @@ -45,14 +45,12 @@ #custom_metrics_mdm filter plugin type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes,memoryRssBytes,pvUsedBytes log_level info type filter_telegraf2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth log_level debug diff --git a/build/linux/installer/conf/kube.conf b/build/linux/installer/conf/kube.conf index dbb4db0da..fb566c360 100644 --- a/build/linux/installer/conf/kube.conf +++ b/build/linux/installer/conf/kube.conf @@ -13,7 +13,14 @@ tag oms.containerinsights.KubePodInventory run_interval 60 log_level debug - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth + + + #Kubernetes Persistent Volume inventory + + type kubepvinventory + tag oms.containerinsights.KubePVInventory + run_interval 60 + log_level debug #Kubernetes events @@ -66,14 +73,12 @@ type filter_inventory2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth log_level info #custom_metrics_mdm filter plugin for perf data from windows nodes type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes,pvUsedBytes log_level info @@ -98,6 +103,21 @@ max_retry_wait 5m + + type out_oms + log_level debug + num_threads 5 + buffer_chunk_limit 4m + buffer_type file + buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer + buffer_queue_limit 20 + buffer_queue_full_action drop_oldest_chunk + flush_interval 20s + retry_limit 10 + retry_wait 5s + max_retry_wait 5m + + type out_oms log_level debug diff --git a/build/linux/installer/datafiles/base_container.data b/build/linux/installer/datafiles/base_container.data index ca2538b79..c680f0eea 100644 --- a/build/linux/installer/datafiles/base_container.data +++ b/build/linux/installer/datafiles/base_container.data @@ -22,6 +22,7 @@ MAINTAINER: 'Microsoft Corporation' /opt/microsoft/omsagent/plugin/filter_container.rb; source/plugins/ruby/filter_container.rb; 644; root; root /opt/microsoft/omsagent/plugin/in_kube_podinventory.rb; source/plugins/ruby/in_kube_podinventory.rb; 644; root; root +/opt/microsoft/omsagent/plugin/in_kube_pvinventory.rb; source/plugins/ruby/in_kube_pvinventory.rb; 644; root; root /opt/microsoft/omsagent/plugin/in_kube_events.rb; source/plugins/ruby/in_kube_events.rb; 644; root; root /opt/microsoft/omsagent/plugin/KubernetesApiClient.rb; source/plugins/ruby/KubernetesApiClient.rb; 644; root; root @@ -122,7 +123,7 @@ MAINTAINER: 'Microsoft Corporation' /opt/tomlparser-mdm-metrics-config.rb; build/linux/installer/scripts/tomlparser-mdm-metrics-config.rb; 755; root; root /opt/tomlparser-metric-collection-config.rb; build/linux/installer/scripts/tomlparser-metric-collection-config.rb; 755; root; root -/opt/tomlparser-health-config.rb; build/linux/installer/scripts/tomlparser-health-config.rb; 755; root; root +/opt/tomlparser-agent-config.rb; build/linux/installer/scripts/tomlparser-agent-config.rb; 755; root; root /opt/tomlparser.rb; build/common/installer/scripts/tomlparser.rb; 755; root; root /opt/td-agent-bit-conf-customizer.rb; build/common/installer/scripts/td-agent-bit-conf-customizer.rb; 755; root; root /opt/ConfigParseErrorLogger.rb; build/common/installer/scripts/ConfigParseErrorLogger.rb; 755; root; root diff --git a/build/linux/installer/scripts/tomlparser-agent-config.rb b/build/linux/installer/scripts/tomlparser-agent-config.rb new file mode 100644 index 000000000..e587909e5 --- /dev/null +++ b/build/linux/installer/scripts/tomlparser-agent-config.rb @@ -0,0 +1,220 @@ +#!/usr/local/bin/ruby + +#this should be require relative in Linux and require in windows, since it is a gem install on windows +@os_type = ENV["OS_TYPE"] +if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0 + require "tomlrb" +else + require_relative "tomlrb" +end + +require_relative "ConfigParseErrorLogger" + +@configMapMountPath = "/etc/config/settings/agent-settings" +@configSchemaVersion = "" +@enable_health_model = false + +# 250 Node items (15KB per node) account to approximately 4MB +@nodesChunkSize = 250 +# 1000 pods (10KB per pod) account to approximately 10MB +@podsChunkSize = 1000 +# 4000 events (1KB per event) account to approximately 4MB +@eventsChunkSize = 4000 +# roughly each deployment is 8k +# 500 deployments account to approximately 4MB +@deploymentsChunkSize = 500 +# roughly each HPA is 3k +# 2000 HPAs account to approximately 6-7MB +@hpaChunkSize = 2000 +# stream batch sizes to avoid large file writes +# too low will consume higher disk iops +@podsEmitStreamBatchSize = 200 +@nodesEmitStreamBatchSize = 100 + +# higher the chunk size rs pod memory consumption higher and lower api latency +# similarly lower the value, helps on the memory consumption but incurrs additional round trip latency +# these needs to be tuned be based on the workload +# nodes +@nodesChunkSizeMin = 100 +@nodesChunkSizeMax = 400 +# pods +@podsChunkSizeMin = 250 +@podsChunkSizeMax = 1500 +# events +@eventsChunkSizeMin = 2000 +@eventsChunkSizeMax = 10000 +# deployments +@deploymentsChunkSizeMin = 500 +@deploymentsChunkSizeMax = 1000 +# hpa +@hpaChunkSizeMin = 500 +@hpaChunkSizeMax = 2000 + +# emit stream sizes to prevent lower values which costs disk i/o +# max will be upto the chunk size +@podsEmitStreamBatchSizeMin = 50 +@nodesEmitStreamBatchSizeMin = 50 + +# configmap settings related fbit config +@fbitFlushIntervalSecs = 0 +@fbitTailBufferChunkSizeMBs = 0 +@fbitTailBufferMaxSizeMBs = 0 + + +def is_number?(value) + true if Integer(value) rescue false +end + +# Use parser to parse the configmap toml file to a ruby structure +def parseConfigMap + begin + # Check to see if config map is created + if (File.file?(@configMapMountPath)) + puts "config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values" + parsedConfig = Tomlrb.load_file(@configMapMountPath, symbolize_keys: true) + puts "config::Successfully parsed mounted config map" + return parsedConfig + else + puts "config::configmap container-azm-ms-agentconfig for agent settings not mounted, using defaults" + return nil + end + rescue => errorStr + ConfigParseErrorLogger.logError("Exception while parsing config map for agent settings : #{errorStr}, using defaults, please check config map for errors") + return nil + end +end + +# Use the ruby structure created after config parsing to set the right values to be used as environment variables +def populateSettingValuesFromConfigMap(parsedConfig) + begin + if !parsedConfig.nil? && !parsedConfig[:agent_settings].nil? + if !parsedConfig[:agent_settings][:health_model].nil? && !parsedConfig[:agent_settings][:health_model][:enabled].nil? + @enable_health_model = parsedConfig[:agent_settings][:health_model][:enabled] + puts "enable_health_model = #{@enable_health_model}" + end + chunk_config = parsedConfig[:agent_settings][:chunk_config] + if !chunk_config.nil? + nodesChunkSize = chunk_config[:NODES_CHUNK_SIZE] + if !nodesChunkSize.nil? && is_number?(nodesChunkSize) && (@nodesChunkSizeMin..@nodesChunkSizeMax) === nodesChunkSize.to_i + @nodesChunkSize = nodesChunkSize.to_i + puts "Using config map value: NODES_CHUNK_SIZE = #{@nodesChunkSize}" + end + + podsChunkSize = chunk_config[:PODS_CHUNK_SIZE] + if !podsChunkSize.nil? && is_number?(podsChunkSize) && (@podsChunkSizeMin..@podsChunkSizeMax) === podsChunkSize.to_i + @podsChunkSize = podsChunkSize.to_i + puts "Using config map value: PODS_CHUNK_SIZE = #{@podsChunkSize}" + end + + eventsChunkSize = chunk_config[:EVENTS_CHUNK_SIZE] + if !eventsChunkSize.nil? && is_number?(eventsChunkSize) && (@eventsChunkSizeMin..@eventsChunkSizeMax) === eventsChunkSize.to_i + @eventsChunkSize = eventsChunkSize.to_i + puts "Using config map value: EVENTS_CHUNK_SIZE = #{@eventsChunkSize}" + end + + deploymentsChunkSize = chunk_config[:DEPLOYMENTS_CHUNK_SIZE] + if !deploymentsChunkSize.nil? && is_number?(deploymentsChunkSize) && (@deploymentsChunkSizeMin..@deploymentsChunkSizeMax) === deploymentsChunkSize.to_i + @deploymentsChunkSize = deploymentsChunkSize.to_i + puts "Using config map value: DEPLOYMENTS_CHUNK_SIZE = #{@deploymentsChunkSize}" + end + + hpaChunkSize = chunk_config[:HPA_CHUNK_SIZE] + if !hpaChunkSize.nil? && is_number?(hpaChunkSize) && (@hpaChunkSizeMin..@hpaChunkSizeMax) === hpaChunkSize.to_i + @hpaChunkSize = hpaChunkSize.to_i + puts "Using config map value: HPA_CHUNK_SIZE = #{@hpaChunkSize}" + end + + podsEmitStreamBatchSize = chunk_config[:PODS_EMIT_STREAM_BATCH_SIZE] + if !podsEmitStreamBatchSize.nil? && is_number?(podsEmitStreamBatchSize) && + podsEmitStreamBatchSize.to_i <= @podsChunkSize && podsEmitStreamBatchSize.to_i >= @podsEmitStreamBatchSizeMin + @podsEmitStreamBatchSize = podsEmitStreamBatchSize.to_i + puts "Using config map value: PODS_EMIT_STREAM_BATCH_SIZE = #{@podsEmitStreamBatchSize}" + end + nodesEmitStreamBatchSize = chunk_config[:NODES_EMIT_STREAM_BATCH_SIZE] + if !nodesEmitStreamBatchSize.nil? && is_number?(nodesEmitStreamBatchSize) && + nodesEmitStreamBatchSize.to_i <= @nodesChunkSize && nodesEmitStreamBatchSize.to_i >= @nodesEmitStreamBatchSizeMin + @nodesEmitStreamBatchSize = nodesEmitStreamBatchSize.to_i + puts "Using config map value: NODES_EMIT_STREAM_BATCH_SIZE = #{@nodesEmitStreamBatchSize}" + end + end + # fbit config settings + fbit_config = parsedConfig[:agent_settings][:fbit_config] + if !fbit_config.nil? + fbitFlushIntervalSecs = fbit_config[:log_flush_interval_secs] + if !fbitFlushIntervalSecs.nil? && is_number?(fbitFlushIntervalSecs) && fbitFlushIntervalSecs.to_i > 0 + @fbitFlushIntervalSecs = fbitFlushIntervalSecs.to_i + puts "Using config map value: log_flush_interval_secs = #{@fbitFlushIntervalSecs}" + end + + fbitTailBufferChunkSizeMBs = fbit_config[:tail_buf_chunksize_megabytes] + if !fbitTailBufferChunkSizeMBs.nil? && is_number?(fbitTailBufferChunkSizeMBs) && fbitTailBufferChunkSizeMBs.to_i > 0 + @fbitTailBufferChunkSizeMBs = fbitTailBufferChunkSizeMBs.to_i + puts "Using config map value: tail_buf_chunksize_megabytes = #{@fbitTailBufferChunkSizeMBs}" + end + + fbitTailBufferMaxSizeMBs = fbit_config[:tail_buf_maxsize_megabytes] + if !fbitTailBufferMaxSizeMBs.nil? && is_number?(fbitTailBufferMaxSizeMBs) && fbitTailBufferMaxSizeMBs.to_i > 0 + if fbitTailBufferMaxSizeMBs.to_i >= @fbitTailBufferChunkSizeMBs + @fbitTailBufferMaxSizeMBs = fbitTailBufferMaxSizeMBs.to_i + puts "Using config map value: tail_buf_maxsize_megabytes = #{@fbitTailBufferMaxSizeMBs}" + else + # tail_buf_maxsize_megabytes has to be greater or equal to tail_buf_chunksize_megabytes + @fbitTailBufferMaxSizeMBs = @fbitTailBufferChunkSizeMBs + puts "config::warn: tail_buf_maxsize_megabytes must be greater or equal to value of tail_buf_chunksize_megabytes. Using tail_buf_maxsize_megabytes = #{@fbitTailBufferMaxSizeMBs} since provided config value not valid" + end + end + # in scenario - tail_buf_chunksize_megabytes provided but not tail_buf_maxsize_megabytes to prevent fbit crash + if @fbitTailBufferChunkSizeMBs > 0 && @fbitTailBufferMaxSizeMBs == 0 + @fbitTailBufferMaxSizeMBs = @fbitTailBufferChunkSizeMBs + puts "config::warn: since tail_buf_maxsize_megabytes not provided hence using tail_buf_maxsize_megabytes=#{@fbitTailBufferMaxSizeMBs} which is same as the value of tail_buf_chunksize_megabytes" + end + end + end + rescue => errorStr + puts "config::error:Exception while reading config settings for agent configuration setting - #{errorStr}, using defaults" + @enable_health_model = false + end +end + +@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"] +puts "****************Start Config Processing********************" +if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version , so hardcoding it + configMapSettings = parseConfigMap + if !configMapSettings.nil? + populateSettingValuesFromConfigMap(configMapSettings) + end +else + if (File.file?(@configMapMountPath)) + ConfigParseErrorLogger.logError("config::unsupported/missing config schema version - '#{@configSchemaVersion}' , using defaults, please use supported schema version") + end + @enable_health_model = false +end + +# Write the settings to file, so that they can be set as environment variables +file = File.open("agent_config_env_var", "w") + +if !file.nil? + file.write("export AZMON_CLUSTER_ENABLE_HEALTH_MODEL=#{@enable_health_model}\n") + file.write("export NODES_CHUNK_SIZE=#{@nodesChunkSize}\n") + file.write("export PODS_CHUNK_SIZE=#{@podsChunkSize}\n") + file.write("export EVENTS_CHUNK_SIZE=#{@eventsChunkSize}\n") + file.write("export DEPLOYMENTS_CHUNK_SIZE=#{@deploymentsChunkSize}\n") + file.write("export HPA_CHUNK_SIZE=#{@hpaChunkSize}\n") + file.write("export PODS_EMIT_STREAM_BATCH_SIZE=#{@podsEmitStreamBatchSize}\n") + file.write("export NODES_EMIT_STREAM_BATCH_SIZE=#{@nodesEmitStreamBatchSize}\n") + # fbit settings + if @fbitFlushIntervalSecs > 0 + file.write("export FBIT_SERVICE_FLUSH_INTERVAL=#{@fbitFlushIntervalSecs}\n") + end + if @fbitTailBufferChunkSizeMBs > 0 + file.write("export FBIT_TAIL_BUFFER_CHUNK_SIZE=#{@fbitTailBufferChunkSizeMBs}\n") + end + if @fbitTailBufferMaxSizeMBs > 0 + file.write("export FBIT_TAIL_BUFFER_MAX_SIZE=#{@fbitTailBufferMaxSizeMBs}\n") + end + # Close file after writing all environment variables + file.close +else + puts "Exception while opening file for writing config environment variables" + puts "****************End Config Processing********************" +end diff --git a/build/linux/installer/scripts/tomlparser-health-config.rb b/build/linux/installer/scripts/tomlparser-health-config.rb deleted file mode 100644 index 14c8bdb44..000000000 --- a/build/linux/installer/scripts/tomlparser-health-config.rb +++ /dev/null @@ -1,73 +0,0 @@ -#!/usr/local/bin/ruby - -#this should be require relative in Linux and require in windows, since it is a gem install on windows -@os_type = ENV["OS_TYPE"] -if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0 - require "tomlrb" -else - require_relative "tomlrb" -end - -require_relative "ConfigParseErrorLogger" - -@configMapMountPath = "/etc/config/settings/agent-settings" -@configSchemaVersion = "" -@enable_health_model = false - -# Use parser to parse the configmap toml file to a ruby structure -def parseConfigMap - begin - # Check to see if config map is created - if (File.file?(@configMapMountPath)) - puts "config::configmap container-azm-ms-agentconfig for agent health settings mounted, parsing values" - parsedConfig = Tomlrb.load_file(@configMapMountPath, symbolize_keys: true) - puts "config::Successfully parsed mounted config map" - return parsedConfig - else - puts "config::configmap container-azm-ms-agentconfig for agent health settings not mounted, using defaults" - return nil - end - rescue => errorStr - ConfigParseErrorLogger.logError("Exception while parsing config map for enabling health: #{errorStr}, using defaults, please check config map for errors") - return nil - end -end - -# Use the ruby structure created after config parsing to set the right values to be used as environment variables -def populateSettingValuesFromConfigMap(parsedConfig) - begin - if !parsedConfig.nil? && !parsedConfig[:agent_settings].nil? && !parsedConfig[:agent_settings][:health_model].nil? && !parsedConfig[:agent_settings][:health_model][:enabled].nil? - @enable_health_model = parsedConfig[:agent_settings][:health_model][:enabled] - puts "enable_health_model = #{@enable_health_model}" - end - rescue => errorStr - puts "config::error:Exception while reading config settings for health_model enabled setting - #{errorStr}, using defaults" - @enable_health_model = false - end -end - -@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"] -puts "****************Start Config Processing********************" -if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version , so hardcoding it - configMapSettings = parseConfigMap - if !configMapSettings.nil? - populateSettingValuesFromConfigMap(configMapSettings) - end -else - if (File.file?(@configMapMountPath)) - ConfigParseErrorLogger.logError("config::unsupported/missing config schema version - '#{@configSchemaVersion}' , using defaults, please use supported schema version") - end - @enable_health_model = false -end - -# Write the settings to file, so that they can be set as environment variables -file = File.open("health_config_env_var", "w") - -if !file.nil? - file.write("export AZMON_CLUSTER_ENABLE_HEALTH_MODEL=#{@enable_health_model}\n") - # Close file after writing all environment variables - file.close -else - puts "Exception while opening file for writing config environment variables" - puts "****************End Config Processing********************" -end \ No newline at end of file diff --git a/build/version b/build/version index a8b78ecac..711a96921 100644 --- a/build/version +++ b/build/version @@ -2,11 +2,11 @@ # Build Version Information -CONTAINER_BUILDVERSION_MAJOR=11 +CONTAINER_BUILDVERSION_MAJOR=12 CONTAINER_BUILDVERSION_MINOR=0 CONTAINER_BUILDVERSION_PATCH=0 -CONTAINER_BUILDVERSION_BUILDNR=1 -CONTAINER_BUILDVERSION_DATE=20201109 +CONTAINER_BUILDVERSION_BUILDNR=0 +CONTAINER_BUILDVERSION_DATE=20210111 CONTAINER_BUILDVERSION_STATUS=Developer_Build #-------------------------------- End of File ----------------------------------- diff --git a/build/windows/installer/certificategenerator/Program.cs b/build/windows/installer/certificategenerator/Program.cs index 43063c4be..e24d0e303 100644 --- a/build/windows/installer/certificategenerator/Program.cs +++ b/build/windows/installer/certificategenerator/Program.cs @@ -414,14 +414,12 @@ static void Main(string[] args) try { - if (!String.IsNullOrEmpty(Environment.GetEnvironmentVariable("WSKEY"))) - { - logAnalyticsWorkspaceSharedKey = Environment.GetEnvironmentVariable("WSKEY"); - } + // WSKEY isn't stored as an environment variable + logAnalyticsWorkspaceSharedKey = File.ReadAllText("C:/etc/omsagent-secret/KEY").Trim(); } catch (Exception ex) { - Console.WriteLine("Failed to read env variables (WSKEY)" + ex.Message); + Console.WriteLine("Failed to read secret (WSKEY)" + ex.Message); } try diff --git a/build/windows/installer/conf/fluent.conf b/build/windows/installer/conf/fluent.conf index c96300b1e..d5eb475ca 100644 --- a/build/windows/installer/conf/fluent.conf +++ b/build/windows/installer/conf/fluent.conf @@ -6,7 +6,8 @@ @type tail - path /var/log/containers/*.log + path "#{ENV['AZMON_LOG_TAIL_PATH']}" + exclude_path "#{ENV['AZMON_CLUSTER_LOG_TAIL_EXCLUDE_PATH']}" pos_file /var/opt/microsoft/fluent/fluentd-containers.log.pos tag oms.container.log.la @log_level trace @@ -28,6 +29,14 @@ @include fluent-docker-parser.conf + + @type grep + + key stream + pattern "#{ENV['AZMON_LOG_EXCLUSION_REGEX_PATTERN']}" + + + @type record_transformer # fluent-plugin-record-modifier more light-weight but needs to be installed (dependency worth it?) @@ -37,7 +46,6 @@ - @type forward send_timeout 60s diff --git a/build/windows/installer/scripts/livenessprobe.cmd b/build/windows/installer/scripts/livenessprobe.cmd index 06d577f31..19d0b69d7 100644 --- a/build/windows/installer/scripts/livenessprobe.cmd +++ b/build/windows/installer/scripts/livenessprobe.cmd @@ -1,40 +1,32 @@ -echo "Checking if fluent-bit is running" +REM "Checking if fluent-bit is running" tasklist /fi "imagename eq fluent-bit.exe" /fo "table" | findstr fluent-bit IF ERRORLEVEL 1 ( - echo "Fluent-Bit is not running" > /dev/termination-log + echo "Fluent-Bit is not running" exit /b 1 -) ELSE ( - echo "Fluent-Bit is running" ) -echo "Checking if config map has been updated since agent start" +REM "Checking if config map has been updated since agent start" IF EXIST C:\etc\omsagentwindows\filesystemwatcher.txt ( - echo "Config Map Updated since agent started" > /dev/termination-log + echo "Config Map Updated since agent started" exit /b 1 -) ELSE ( - echo "Config Map not Updated since agent start" ) -echo "Checking if certificate needs to be renewed (aka agent restart required)" +REM "Checking if certificate needs to be renewed (aka agent restart required)" IF EXIST C:\etc\omsagentwindows\renewcertificate.txt ( - echo "Certificate needs to be renewed" > /dev/termination-log + echo "Certificate needs to be renewed" exit /b 1 -) ELSE ( - echo "Certificate does NOT need to be renewd" ) -echo "Checking if fluentd service is running" +REM "Checking if fluentd service is running" sc query fluentdwinaks | findstr /i STATE | findstr RUNNING IF ERRORLEVEL 1 ( - echo "Fluentd Service is NOT Running" > /dev/termination-log + echo "Fluentd Service is NOT Running" exit /b 1 -) ELSE ( - echo "Fluentd Service is Running" ) exit /b 0 diff --git a/charts/azuremonitor-containers/Chart.yaml b/charts/azuremonitor-containers/Chart.yaml index 987841f77..a809a4e69 100644 --- a/charts/azuremonitor-containers/Chart.yaml +++ b/charts/azuremonitor-containers/Chart.yaml @@ -2,7 +2,7 @@ apiVersion: v1 appVersion: 7.0.0-1 description: Helm chart for deploying Azure Monitor container monitoring agent in Kubernetes name: azuremonitor-containers -version: 2.7.9 +version: 2.8.0 kubeVersion: "^1.10.0-0" keywords: - monitoring diff --git a/charts/azuremonitor-containers/templates/omsagent-daemonset-windows.yaml b/charts/azuremonitor-containers/templates/omsagent-daemonset-windows.yaml index 6a309c121..81003c704 100644 --- a/charts/azuremonitor-containers/templates/omsagent-daemonset-windows.yaml +++ b/charts/azuremonitor-containers/templates/omsagent-daemonset-windows.yaml @@ -27,6 +27,10 @@ spec: checksum/secret: {{ include (print $.Template.BasePath "/omsagent-secret.yaml") . | sha256sum }} checksum/config: {{ toYaml .Values.omsagent.resources | sha256sum }} spec: + dnsConfig: + options: + - name: ndots + value: "3" {{- if semverCompare ">=1.14-0" .Capabilities.KubeVersion.GitVersion }} nodeSelector: kubernetes.io/os: windows diff --git a/charts/azuremonitor-containers/templates/omsagent-daemonset.yaml b/charts/azuremonitor-containers/templates/omsagent-daemonset.yaml index d57c4d82b..3d29ede42 100644 --- a/charts/azuremonitor-containers/templates/omsagent-daemonset.yaml +++ b/charts/azuremonitor-containers/templates/omsagent-daemonset.yaml @@ -28,6 +28,10 @@ spec: checksum/config: {{ toYaml .Values.omsagent.resources | sha256sum }} checksum/logsettings: {{ toYaml .Values.omsagent.logsettings | sha256sum }} spec: + dnsConfig: + options: + - name: ndots + value: "3" {{- if .Values.omsagent.rbac }} serviceAccountName: omsagent {{- end }} diff --git a/charts/azuremonitor-containers/templates/omsagent-rbac.yaml b/charts/azuremonitor-containers/templates/omsagent-rbac.yaml index 4f7408e7c..bd4e9baf3 100644 --- a/charts/azuremonitor-containers/templates/omsagent-rbac.yaml +++ b/charts/azuremonitor-containers/templates/omsagent-rbac.yaml @@ -19,7 +19,7 @@ metadata: heritage: {{ .Release.Service }} rules: - apiGroups: [""] - resources: ["pods", "events", "nodes", "nodes/stats", "nodes/metrics", "nodes/spec", "nodes/proxy", "namespaces", "services"] + resources: ["pods", "events", "nodes", "nodes/stats", "nodes/metrics", "nodes/spec", "nodes/proxy", "namespaces", "services", "persistentvolumes"] verbs: ["list", "get", "watch"] - apiGroups: ["apps", "extensions", "autoscaling"] resources: ["replicasets", "deployments", "horizontalpodautoscalers"] diff --git a/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml b/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml index 475b17a46..fc7c471f8 100644 --- a/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml +++ b/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml @@ -18,7 +18,14 @@ data: tag oms.containerinsights.KubePodInventory run_interval 60 log_level debug - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth + + + #Kubernetes Persistent Volume inventory + + type kubepvinventory + tag oms.containerinsights.KubePVInventory + run_interval 60 + log_level debug #Kubernetes events @@ -70,14 +77,12 @@ data: type filter_inventory2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth log_level info # custom_metrics_mdm filter plugin for perf data from windows nodes type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes log_level info @@ -90,7 +95,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubepods*.buffer @@ -102,12 +107,27 @@ data: max_retry_wait 5m - + type out_oms log_level debug num_threads 5 buffer_chunk_limit 4m buffer_type file + buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer + buffer_queue_limit 20 + buffer_queue_full_action drop_oldest_chunk + flush_interval 20s + retry_limit 10 + retry_wait 5s + max_retry_wait 5m + + + + type out_oms + log_level debug + num_threads 2 + buffer_chunk_limit 4m + buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer buffer_queue_limit 20 buffer_queue_full_action drop_oldest_chunk @@ -135,7 +155,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/state/out_oms_kubenodes*.buffer @@ -164,7 +184,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubeperf*.buffer diff --git a/charts/azuremonitor-containers/values.yaml b/charts/azuremonitor-containers/values.yaml index 76ea0a26d..52e80dff8 100644 --- a/charts/azuremonitor-containers/values.yaml +++ b/charts/azuremonitor-containers/values.yaml @@ -12,10 +12,10 @@ Azure: omsagent: image: repo: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod" - tag: "ciprod11092020" - tagWindows: "win-ciprod11092020" + tag: "ciprod01112021" + tagWindows: "win-ciprod01112021" pullPolicy: IfNotPresent - dockerProviderVersion: "11.0.0-1" + dockerProviderVersion: "12.0.0-0" agentVersion: "1.10.0.1" ## To get your workspace id and key do the following ## You can create a Azure Loganalytics workspace from portal.azure.com and get its ID & PRIMARY KEY from 'Advanced Settings' tab in the Ux. @@ -56,6 +56,21 @@ omsagent: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - labelSelector: + matchExpressions: + - key: kubernetes.io/os + operator: In + values: + - linux + - key: type + operator: NotIn + values: + - virtual-kubelet + - key: kubernetes.io/arch + operator: In + values: + - amd64 nodeSelectorTerms: - labelSelector: matchExpressions: @@ -78,9 +93,22 @@ omsagent: operator: NotIn values: - virtual-kubelet + - key: beta.kubernetes.io/arch + operator: In + values: + - amd64 deployment: affinity: nodeAffinity: + # affinity to schedule on to ephemeral os node if its available + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 1 + preference: + matchExpressions: + - key: storageprofile + operator: NotIn + values: + - managed requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - labelSelector: @@ -97,6 +125,10 @@ omsagent: operator: NotIn values: - master + - key: kubernetes.io/arch + operator: In + values: + - amd64 nodeSelectorTerms: - labelSelector: matchExpressions: @@ -112,6 +144,10 @@ omsagent: operator: NotIn values: - master + - key: beta.kubernetes.io/arch + operator: In + values: + - amd64 ## Configure resource requests and limits ## ref: http://kubernetes.io/docs/user-guide/compute-resources/ ## @@ -133,4 +169,4 @@ omsagent: memory: 250Mi limits: cpu: 1 - memory: 750Mi + memory: 1Gi diff --git a/kubernetes/linux/Dockerfile b/kubernetes/linux/Dockerfile index d04e86128..2e1118922 100644 --- a/kubernetes/linux/Dockerfile +++ b/kubernetes/linux/Dockerfile @@ -2,7 +2,7 @@ FROM ubuntu:18.04 MAINTAINER OMSContainers@microsoft.com LABEL vendor=Microsoft\ Corp \ com.microsoft.product="Azure Monitor for containers" -ARG IMAGE_TAG=ciprod11092020 +ARG IMAGE_TAG=ciprod01112021 ENV AGENT_VERSION ${IMAGE_TAG} ENV tmpdir /opt ENV APPLICATIONINSIGHTS_AUTH NzAwZGM5OGYtYTdhZC00NThkLWI5NWMtMjA3ZjM3NmM3YmRi @@ -15,6 +15,7 @@ ENV HOST_VAR /hostfs/var ENV AZMON_COLLECT_ENV False ENV KUBE_CLIENT_BACKOFF_BASE 1 ENV KUBE_CLIENT_BACKOFF_DURATION 0 +ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR 0.9 RUN /usr/bin/apt-get update && /usr/bin/apt-get install -y libc-bin wget openssl curl sudo python-ctypes init-system-helpers net-tools rsyslog cron vim dmidecode apt-transport-https gnupg && rm -rf /var/lib/apt/lists/* COPY setup.sh main.sh defaultpromenvvariables defaultpromenvvariables-rs mdsd.xml envmdsd $tmpdir/ WORKDIR ${tmpdir} diff --git a/kubernetes/linux/main.sh b/kubernetes/linux/main.sh index b093eb74b..2f88c07ac 100644 --- a/kubernetes/linux/main.sh +++ b/kubernetes/linux/main.sh @@ -150,6 +150,17 @@ else echo "LA Onboarding:Workspace Id not mounted, skipping the telemetry check" fi +# Set environment variable for if public cloud by checking the workspace domain. +if [ -z $domain ]; then + ClOUD_ENVIRONMENT="unknown" +elif [ $domain == "opinsights.azure.com" ]; then + CLOUD_ENVIRONMENT="public" +else + CLOUD_ENVIRONMENT="national" +fi +export CLOUD_ENVIRONMENT=$CLOUD_ENVIRONMENT +echo "export CLOUD_ENVIRONMENT=$CLOUD_ENVIRONMENT" >> ~/.bashrc + #Parse the configmap to set the right environment variables. /opt/microsoft/omsagent/ruby/bin/ruby tomlparser.rb @@ -160,14 +171,24 @@ done source config_env_var -#Parse the configmap to set the right environment variables for health feature. -/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-health-config.rb +#Parse the configmap to set the right environment variables for agent config. +#Note > tomlparser-agent-config.rb has to be parsed first before td-agent-bit-conf-customizer.rb for fbit agent settings +/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-agent-config.rb + +cat agent_config_env_var | while read line; do + #echo $line + echo $line >> ~/.bashrc +done +source agent_config_env_var + +#Parse the configmap to set the right environment variables for network policy manager (npm) integration. +/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-npm-config.rb -cat health_config_env_var | while read line; do +cat integration_npm_config_env_var | while read line; do #echo $line echo $line >> ~/.bashrc done -source health_config_env_var +source integration_npm_config_env_var #Parse the configmap to set the right environment variables for network policy manager (npm) integration. /opt/microsoft/omsagent/ruby/bin/ruby tomlparser-npm-config.rb @@ -418,7 +439,7 @@ echo "export DOCKER_CIMPROV_VERSION=$DOCKER_CIMPROV_VERSION" >> ~/.bashrc #region check to auto-activate oneagent, to route container logs, #Intent is to activate one agent routing for all managed clusters with region in the regionllist, unless overridden by configmap -# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map +# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map # AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE will have the final route that we compute & set, based on our region list logic echo "************start oneagent log routing checks************" # by default, use configmap route for safer side @@ -451,9 +472,9 @@ else echo "current region is not in oneagent regions..." fi -if [ "$isoneagentregion" = true ]; then +if [ "$isoneagentregion" = true ]; then #if configmap has a routing for logs, but current region is in the oneagent region list, take the configmap route - if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then + if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE=$AZMON_CONTAINER_LOGS_ROUTE echo "oneagent region is true for current region:$currentregion and config map logs route is not empty. so using config map logs route as effective route:$AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE" else #there is no configmap route, so route thru oneagent @@ -500,7 +521,6 @@ if [ ! -e "/etc/config/kube.conf" ]; then echo "starting mdsd ..." mdsd -l -e ${MDSD_LOG}/mdsd.err -w ${MDSD_LOG}/mdsd.warn -o ${MDSD_LOG}/mdsd.info -q ${MDSD_LOG}/mdsd.qos & - touch /opt/AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE_V2 fi fi diff --git a/kubernetes/linux/setup.sh b/kubernetes/linux/setup.sh index fb41d4782..352be06d7 100644 --- a/kubernetes/linux/setup.sh +++ b/kubernetes/linux/setup.sh @@ -71,7 +71,7 @@ chmod 777 /opt/telegraf wget -qO - https://packages.fluentbit.io/fluentbit.key | sudo apt-key add - sudo echo "deb https://packages.fluentbit.io/ubuntu/xenial xenial main" >> /etc/apt/sources.list sudo apt-get update -sudo apt-get install td-agent-bit=1.4.2 -y +sudo apt-get install td-agent-bit=1.6.8 -y rm -rf $TMPDIR/omsbundle rm -f $TMPDIR/omsagent*.sh diff --git a/kubernetes/omsagent.yaml b/kubernetes/omsagent.yaml index 7d07eafcd..67bd9cdde 100644 --- a/kubernetes/omsagent.yaml +++ b/kubernetes/omsagent.yaml @@ -21,6 +21,7 @@ rules: "nodes/proxy", "namespaces", "services", + "persistentvolumes" ] verbs: ["list", "get", "watch"] - apiGroups: ["apps", "extensions", "autoscaling"] @@ -64,7 +65,14 @@ data: tag oms.containerinsights.KubePodInventory run_interval 60 log_level debug - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth + + + #Kubernetes Persistent Volume inventory + + type kubepvinventory + tag oms.containerinsights.KubePVInventory + run_interval 60 + log_level debug #Kubernetes events @@ -117,14 +125,12 @@ data: type filter_inventory2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth log_level info #custom_metrics_mdm filter plugin for perf data from windows nodes type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westeurope,southafricanorth,centralus,northcentralus,eastus2,koreacentral,eastasia,centralindia,uksouth,canadacentral,francecentral,japaneast,australiaeast,eastus2,westus,australiasoutheast,brazilsouth,germanywestcentral,northcentralus,switzerlandnorth metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes,pvUsedBytes log_level info @@ -137,7 +143,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubepods*.buffer @@ -149,10 +155,25 @@ data: max_retry_wait 5m + + type out_oms + log_level debug + num_threads 5 + buffer_chunk_limit 4m + buffer_type file + buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer + buffer_queue_limit 20 + buffer_queue_full_action drop_oldest_chunk + flush_interval 20s + retry_limit 10 + retry_wait 5s + max_retry_wait 5m + + type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer @@ -182,7 +203,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/state/out_oms_kubenodes*.buffer @@ -211,7 +232,7 @@ data: type out_oms log_level debug - num_threads 5 + num_threads 2 buffer_chunk_limit 4m buffer_type file buffer_path %STATE_DIR_WS%/out_oms_kubeperf*.buffer @@ -337,17 +358,21 @@ spec: tier: node annotations: agentVersion: "1.10.0.1" - dockerProviderVersion: "11.0.0-1" + dockerProviderVersion: "12.0.0-0" schema-versions: "v1" spec: serviceAccountName: omsagent + dnsConfig: + options: + - name: ndots + value: "3" containers: - name: omsagent - image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod11092020" + image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01112021" imagePullPolicy: IfNotPresent resources: limits: - cpu: 250m + cpu: 500m memory: 600Mi requests: cpu: 75m @@ -496,23 +521,22 @@ spec: rsName: "omsagent-rs" annotations: agentVersion: "1.10.0.1" - dockerProviderVersion: "11.0.0-1" + dockerProviderVersion: "12.0.0-0" schema-versions: "v1" spec: serviceAccountName: omsagent containers: - name: omsagent - image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod11092020" + image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod01112021" imagePullPolicy: IfNotPresent resources: limits: cpu: 1 - memory: 750Mi + memory: 1Gi requests: cpu: 150m memory: 250Mi env: - # azure devops pipeline uses AKS_RESOURCE_ID and AKS_REGION hence ensure to uncomment these - name: AKS_RESOURCE_ID value: "VALUE_AKS_RESOURCE_ID_VALUE" - name: AKS_REGION @@ -567,6 +591,15 @@ spec: periodSeconds: 60 affinity: nodeAffinity: + # affinity to schedule on to ephemeral os node if its available + preferredDuringSchedulingIgnoredDuringExecution: + - weight: 1 + preference: + matchExpressions: + - key: storageprofile + operator: NotIn + values: + - managed requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - labelSelector: @@ -642,13 +675,17 @@ spec: tier: node-win annotations: agentVersion: "1.10.0.1" - dockerProviderVersion: "11.0.0-1" + dockerProviderVersion: "12.0.0-0" schema-versions: "v1" spec: serviceAccountName: omsagent + dnsConfig: + options: + - name: ndots + value: "3" containers: - name: omsagent-win - image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod11092020" + image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod01112021" imagePullPolicy: IfNotPresent resources: limits: diff --git a/kubernetes/windows/Dockerfile b/kubernetes/windows/Dockerfile index 10ea235b2..f852bd236 100644 --- a/kubernetes/windows/Dockerfile +++ b/kubernetes/windows/Dockerfile @@ -3,7 +3,7 @@ MAINTAINER OMSContainers@microsoft.com LABEL vendor=Microsoft\ Corp \ com.microsoft.product="Azure Monitor for containers" -ARG IMAGE_TAG=win-ciprod11092020 +ARG IMAGE_TAG=win-ciprod01112021 # Do not split this into multiple RUN! # Docker creates a layer for every RUN-Statement diff --git a/kubernetes/windows/main.ps1 b/kubernetes/windows/main.ps1 index 2e8659601..a297e3801 100644 --- a/kubernetes/windows/main.ps1 +++ b/kubernetes/windows/main.ps1 @@ -43,34 +43,32 @@ function Start-FileSystemWatcher { function Set-EnvironmentVariables { $domain = "opinsights.azure.com" + $cloud_environment = "public" if (Test-Path /etc/omsagent-secret/DOMAIN) { # TODO: Change to omsagent-secret before merging $domain = Get-Content /etc/omsagent-secret/DOMAIN + $cloud_environment = "national" } # Set DOMAIN [System.Environment]::SetEnvironmentVariable("DOMAIN", $domain, "Process") [System.Environment]::SetEnvironmentVariable("DOMAIN", $domain, "Machine") + # Set CLOUD_ENVIRONMENT + [System.Environment]::SetEnvironmentVariable("CLOUD_ENVIRONMENT", $cloud_environment, "Process") + [System.Environment]::SetEnvironmentVariable("CLOUD_ENVIRONMENT", $cloud_environment, "Machine") + $wsID = "" if (Test-Path /etc/omsagent-secret/WSID) { # TODO: Change to omsagent-secret before merging $wsID = Get-Content /etc/omsagent-secret/WSID } - # Set DOMAIN + # Set WSID [System.Environment]::SetEnvironmentVariable("WSID", $wsID, "Process") [System.Environment]::SetEnvironmentVariable("WSID", $wsID, "Machine") - $wsKey = "" - if (Test-Path /etc/omsagent-secret/KEY) { - # TODO: Change to omsagent-secret before merging - $wsKey = Get-Content /etc/omsagent-secret/KEY - } - - # Set KEY - [System.Environment]::SetEnvironmentVariable("WSKEY", $wsKey, "Process") - [System.Environment]::SetEnvironmentVariable("WSKEY", $wsKey, "Machine") + # Don't store WSKEY as environment variable $proxy = "" if (Test-Path /etc/omsagent-secret/PROXY) { diff --git a/scripts/onboarding/managed/disable-monitoring.ps1 b/scripts/onboarding/managed/disable-monitoring.ps1 index 1c011bfff..bcd135dba 100644 --- a/scripts/onboarding/managed/disable-monitoring.ps1 +++ b/scripts/onboarding/managed/disable-monitoring.ps1 @@ -15,6 +15,8 @@ tenantId of the service principal which will be used for the azure login .PARAMETER kubeContext (optional) kube-context of the k8 cluster to install Azure Monitor for containers HELM chart + .PARAMETER azureCloudName (optional) + Name of the Azure cloud name. Supported Azure cloud Name is AzureCloud or AzureUSGovernment Pre-requisites: - Azure Managed cluster Resource Id @@ -34,7 +36,9 @@ param( [Parameter(mandatory = $false)] [string]$tenantId, [Parameter(mandatory = $false)] - [string]$kubeContext + [string]$kubeContext, + [Parameter(mandatory = $false)] + [string]$azureCloudName ) $helmChartReleaseName = "azmon-containers-release-1" @@ -46,6 +50,21 @@ $isAksCluster = $false $isAroV4Cluster = $false $isUsingServicePrincipal = $false +if ([string]::IsNullOrEmpty($azureCloudName) -eq $true) { + Write-Host("Azure cloud name parameter not passed in so using default cloud as AzureCloud") + $azureCloudName = "AzureCloud" +} else { + if(($azureCloudName.ToLower() -eq "azurecloud" ) -eq $true) { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + } elseif (($azureCloudName.ToLower() -eq "azureusgovernment" ) -eq $true) { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + } else { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + Write-Host("Only supported Azure clouds are : AzureCloud and AzureUSGovernment") + exit + } +} + # checks the required Powershell modules exist and if not exists, request the user permission to install $azAccountModule = Get-Module -ListAvailable -Name Az.Accounts $azResourcesModule = Get-Module -ListAvailable -Name Az.Resources @@ -226,14 +245,19 @@ Write-Host("Cluster SubscriptionId : '" + $clusterSubscriptionId + "' ") -Foregr if ($isUsingServicePrincipal) { $spSecret = ConvertTo-SecureString -String $servicePrincipalClientSecret -AsPlainText -Force $spCreds = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList $servicePrincipalClientId,$spSecret - Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId + Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId -Environment $azureCloudName } try { Write-Host("") Write-Host("Trying to get the current Az login context...") $account = Get-AzContext -ErrorAction Stop - Write-Host("Successfully fetched current AzContext context...") -ForegroundColor Green + $ctxCloud = $account.Environment.Name + if(($azureCloudName.ToLower() -eq $ctxCloud.ToLower() ) -eq $false) { + Write-Host("Specified azure cloud name is not same as current context cloud hence setting account to null to retrigger the login" ) -ForegroundColor Green + $account = $null + } + Write-Host("Successfully fetched current AzContext context and azure cloud name: $azureCloudName" ) -ForegroundColor Green Write-Host("") } catch { @@ -249,10 +273,10 @@ if ($null -eq $account.Account) { if ($isUsingServicePrincipal) { $spSecret = ConvertTo-SecureString -String $servicePrincipalClientSecret -AsPlainText -Force $spCreds = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList $servicePrincipalClientId,$spSecret - Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId + Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId -Environment $azureCloudName } else { Write-Host("Please login...") - Connect-AzAccount -subscriptionid $clusterSubscriptionId + Connect-AzAccount -subscriptionid $clusterSubscriptionId -Environment $azureCloudName } } catch { diff --git a/scripts/onboarding/managed/disable-monitoring.sh b/scripts/onboarding/managed/disable-monitoring.sh index c11426f30..d43a79f51 100644 --- a/scripts/onboarding/managed/disable-monitoring.sh +++ b/scripts/onboarding/managed/disable-monitoring.sh @@ -280,10 +280,27 @@ done } +validate_and_configure_supported_cloud() { + echo "get active azure cloud name configured to azure cli" + azureCloudName=$(az cloud show --query name -o tsv | tr "[:upper:]" "[:lower:]") + echo "active azure cloud name configured to azure cli: ${azureCloudName}" + if [ "$isArcK8sCluster" = true ]; then + if [ "$azureCloudName" != "azurecloud" -a "$azureCloudName" != "azureusgovernment" ]; then + echo "-e only supported clouds are AzureCloud and AzureUSGovernment for Azure Arc enabled Kubernetes cluster type" + exit 1 + fi + else + # For ARO v4, only supported cloud is public so just configure to public to keep the existing behavior + configure_to_public_cloud + fi +} # parse args parse_args $@ +# validate and configure azure cloud +validate_and_configure_supported_cloud + # parse cluster resource id clusterSubscriptionId="$(echo $clusterResourceId | cut -d'/' -f3 | tr "[:upper:]" "[:lower:]")" clusterResourceGroup="$(echo $clusterResourceId | cut -d'/' -f5)" diff --git a/scripts/onboarding/managed/enable-monitoring.ps1 b/scripts/onboarding/managed/enable-monitoring.ps1 index b052f22c5..5e9cb8a25 100644 --- a/scripts/onboarding/managed/enable-monitoring.ps1 +++ b/scripts/onboarding/managed/enable-monitoring.ps1 @@ -22,6 +22,8 @@ .PARAMETER proxyEndpoint (optional) Provide Proxy endpoint if you have K8s cluster behind the proxy and would like to route Azure Monitor for containers outbound traffic via proxy. Format of the proxy endpoint should be http(s://:@: + .PARAMETER azureCloudName (optional) + Name of the Azure cloud name. Supported Azure cloud Name is AzureCloud or AzureUSGovernment Pre-requisites: - Azure Managed cluster Resource Id @@ -44,9 +46,13 @@ param( [Parameter(mandatory = $false)] [string]$kubeContext, [Parameter(mandatory = $false)] - [string]$workspaceResourceId, + [string]$servicePrincipalClientSecret, + [Parameter(mandatory = $false)] + [string]$tenantId, + [Parameter(mandatory = $false)] + [string]$kubeContext, [Parameter(mandatory = $false)] - [string]$proxyEndpoint + [string]$azureCloudName ) $solutionTemplateUri = "https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_dev/scripts/onboarding/templates/azuremonitor-containerSolution.json" @@ -60,9 +66,27 @@ $isUsingServicePrincipal = $false # released chart version in mcr $mcr = "mcr.microsoft.com" -$mcrChartVersion = "2.7.9" +$mcrChartVersion = "2.8.0" $mcrChartRepoPath = "azuremonitor/containerinsights/preview/azuremonitor-containers" $helmLocalRepoName = "." +$omsAgentDomainName="opinsights.azure.com" + +if ([string]::IsNullOrEmpty($azureCloudName) -eq $true) { + Write-Host("Azure cloud name parameter not passed in so using default cloud as AzureCloud") + $azureCloudName = "AzureCloud" +} else { + if(($azureCloudName.ToLower() -eq "azurecloud" ) -eq $true) { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + $omsAgentDomainName="opinsights.azure.com" + } elseif (($azureCloudName.ToLower() -eq "azureusgovernment" ) -eq $true) { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + $omsAgentDomainName="opinsights.azure.us" + } else { + Write-Host("Specified Azure Cloud name is : $azureCloudName") + Write-Host("Only supported azure clouds are : AzureCloud and AzureUSGovernment") + exit + } +} # checks the required Powershell modules exist and if not exists, request the user permission to install $azAccountModule = Get-Module -ListAvailable -Name Az.Accounts @@ -244,14 +268,19 @@ Write-Host("Cluster SubscriptionId : '" + $clusterSubscriptionId + "' ") -Foregr if ($isUsingServicePrincipal) { $spSecret = ConvertTo-SecureString -String $servicePrincipalClientSecret -AsPlainText -Force $spCreds = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList $servicePrincipalClientId, $spSecret - Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId + Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId -Environment $azureCloudName } try { Write-Host("") Write-Host("Trying to get the current Az login context...") $account = Get-AzContext -ErrorAction Stop - Write-Host("Successfully fetched current AzContext context...") -ForegroundColor Green + $ctxCloud = $account.Environment.Name + if(($azureCloudName.ToLower() -eq $ctxCloud.ToLower() ) -eq $false) { + Write-Host("Specified azure cloud name is not same as current context cloud hence setting account to null to retrigger the login" ) -ForegroundColor Green + $account = $null + } + Write-Host("Successfully fetched current AzContext context and azure cloud name: $azureCloudName" ) -ForegroundColor Green Write-Host("") } catch { @@ -266,11 +295,12 @@ if ($null -eq $account.Account) { if ($isUsingServicePrincipal) { $spSecret = ConvertTo-SecureString -String $servicePrincipalClientSecret -AsPlainText -Force $spCreds = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList $servicePrincipalClientId, $spSecret - Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId + + Connect-AzAccount -ServicePrincipal -Credential $spCreds -Tenant $tenantId -Subscription $clusterSubscriptionId -Environment $azureCloudName } else { Write-Host("Please login...") - Connect-AzAccount -subscriptionid $clusterSubscriptionId + Connect-AzAccount -subscriptionid $clusterSubscriptionId -Environment $azureCloudName } } catch { @@ -380,7 +410,8 @@ if ([string]::IsNullOrEmpty($workspaceResourceId)) { "westeurope" = "westeurope" ; "westindia" = "centralindia" ; "westus" = "westus" ; - "westus2" = "westus2" + "westus2" = "westus2"; + "usgovvirginia" = "usgovvirginia" } $workspaceRegionCode = "EUS" @@ -531,7 +562,7 @@ try { Write-Host("helmChartRepoPath is : ${helmChartRepoPath}") - $helmParameters = "omsagent.secret.wsid=$workspaceGUID,omsagent.secret.key=$workspacePrimarySharedKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion" + $helmParameters = "omsagent.domain=$omsAgentDomainName,omsagent.secret.wsid=$workspaceGUID,omsagent.secret.key=$workspacePrimarySharedKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion" if ([string]::IsNullOrEmpty($proxyEndpoint) -eq $false) { Write-Host("using proxy endpoint since its provided") $helmParameters = $helmParameters + ",omsagent.proxy=$proxyEndpoint" diff --git a/scripts/onboarding/managed/enable-monitoring.sh b/scripts/onboarding/managed/enable-monitoring.sh index bb6974258..2dc0a465f 100644 --- a/scripts/onboarding/managed/enable-monitoring.sh +++ b/scripts/onboarding/managed/enable-monitoring.sh @@ -38,11 +38,13 @@ set -e set -o pipefail -# default to public cloud since only supported cloud is azure public clod +# default to public cloud since only supported cloud is azure public cloud defaultAzureCloud="AzureCloud" +# default domain will be for public cloud +omsAgentDomainName="opinsights.azure.com" # released chart version in mcr -mcrChartVersion="2.7.9" +mcrChartVersion="2.8.0" mcr="mcr.microsoft.com" mcrChartRepoPath="azuremonitor/containerinsights/preview/azuremonitor-containers" helmLocalRepoName="." @@ -307,6 +309,25 @@ parse_args() { } +validate_and_configure_supported_cloud() { + echo "get active azure cloud name configured to azure cli" + azureCloudName=$(az cloud show --query name -o tsv | tr "[:upper:]" "[:lower:]") + echo "active azure cloud name configured to azure cli: ${azureCloudName}" + if [ "$isArcK8sCluster" = true ]; then + if [ "$azureCloudName" != "azurecloud" -a "$azureCloudName" != "azureusgovernment" ]; then + echo "-e only supported clouds are AzureCloud and AzureUSGovernment for Azure Arc enabled Kubernetes cluster type" + exit 1 + fi + if [ "$azureCloudName" = "azureusgovernment" ]; then + echo "setting omsagent domain as opinsights.azure.us since the azure cloud is azureusgovernment " + omsAgentDomainName="opinsights.azure.us" + fi + else + # For ARO v4, only supported cloud is public so just configure to public to keep the existing behavior + configure_to_public_cloud + fi +} + configure_to_public_cloud() { echo "Set AzureCloud as active cloud for az cli" az cloud set -n $defaultAzureCloud @@ -398,8 +419,10 @@ create_default_log_analytics_workspace() { [westindia]=centralindia [westus]=westus [westus2]=westus2 + [usgovvirginia]=usgovvirginia ) + echo "cluster Region:"$clusterRegion if [ -n "${AzureCloudRegionToOmsRegionMap[$clusterRegion]}" ]; then workspaceRegion=${AzureCloudRegionToOmsRegionMap[$clusterRegion]} fi @@ -433,6 +456,7 @@ create_default_log_analytics_workspace() { workspaceResourceId=$(az resource show -g $workspaceResourceGroup -n $workspaceName --resource-type $workspaceResourceProvider --query id) workspaceResourceId=$(echo $workspaceResourceId | tr -d '"') + echo "workspace resource Id: ${workspaceResourceId}" } add_container_insights_solution() { @@ -504,18 +528,18 @@ install_helm_chart() { echo "using proxy endpoint since proxy configuration passed in" if [ -z "$kubeconfigContext" ]; then echo "using current kube-context since --kube-context/-k parameter not passed in" - helm upgrade --install $releaseName --set omsagent.proxy=$proxyEndpoint,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath + helm upgrade --install $releaseName --set omsagent.domain=$omsAgentDomainName,omsagent.proxy=$proxyEndpoint,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath else echo "using --kube-context:${kubeconfigContext} since passed in" - helm upgrade --install $releaseName --set omsagent.proxy=$proxyEndpoint,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath --kube-context ${kubeconfigContext} + helm upgrade --install $releaseName --set omsagent.domain=$omsAgentDomainName,omsagent.proxy=$proxyEndpoint,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath --kube-context ${kubeconfigContext} fi else if [ -z "$kubeconfigContext" ]; then echo "using current kube-context since --kube-context/-k parameter not passed in" - helm upgrade --install $releaseName --set omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath + helm upgrade --install $releaseName --set omsagent.domain=$omsAgentDomainName,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath else echo "using --kube-context:${kubeconfigContext} since passed in" - helm upgrade --install $releaseName --set omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath --kube-context ${kubeconfigContext} + helm upgrade --install $releaseName --set omsagent.domain=$omsAgentDomainName,omsagent.secret.wsid=$workspaceGuid,omsagent.secret.key=$workspaceKey,omsagent.env.clusterId=$clusterResourceId,omsagent.env.clusterRegion=$clusterRegion $helmChartRepoPath --kube-context ${kubeconfigContext} fi fi @@ -560,8 +584,8 @@ enable_aks_monitoring_addon() { # parse and validate args parse_args $@ -# configure azure cli for public cloud -configure_to_public_cloud +# validate and configure azure cli for cloud +validate_and_configure_supported_cloud # parse cluster resource id clusterSubscriptionId="$(echo $clusterResourceId | cut -d'/' -f3 | tr "[:upper:]" "[:lower:]")" diff --git a/scripts/onboarding/managed/upgrade-monitoring.sh b/scripts/onboarding/managed/upgrade-monitoring.sh index 11ecf6819..8826b6df6 100644 --- a/scripts/onboarding/managed/upgrade-monitoring.sh +++ b/scripts/onboarding/managed/upgrade-monitoring.sh @@ -20,7 +20,7 @@ set -e set -o pipefail # released chart version for Azure Arc enabled Kubernetes public preview -mcrChartVersion="2.7.9" +mcrChartVersion="2.8.0" mcr="mcr.microsoft.com" mcrChartRepoPath="azuremonitor/containerinsights/preview/azuremonitor-containers" @@ -281,11 +281,26 @@ set_azure_subscription() { echo "successfully configured subscription id: ${subscriptionId} as current subscription for the azure cli" } +validate_and_configure_supported_cloud() { + echo "get active azure cloud name configured to azure cli" + azureCloudName=$(az cloud show --query name -o tsv | tr "[:upper:]" "[:lower:]") + echo "active azure cloud name configured to azure cli: ${azureCloudName}" + if [ "$isArcK8sCluster" = true ]; then + if [ "$azureCloudName" != "azurecloud" -a "$azureCloudName" != "azureusgovernment" ]; then + echo "-e only supported clouds are AzureCloud and AzureUSGovernment for Azure Arc enabled Kubernetes cluster type" + exit 1 + fi + else + # For ARO v4, only supported cloud is public so just configure to public to keep the existing behavior + configure_to_public_cloud + fi +} + # parse and validate args parse_args $@ -# configure azure cli for public cloud -configure_to_public_cloud +# configure azure cli for cloud +validate_and_configure_supported_cloud # parse cluster resource id clusterSubscriptionId="$(echo $clusterResourceId | cut -d'/' -f3 | tr "[:upper:]" "[:lower:]")" diff --git a/scripts/preview/health/omsagent-template-aks-engine.yaml b/scripts/preview/health/omsagent-template-aks-engine.yaml index 5526602c0..5e063fd54 100644 --- a/scripts/preview/health/omsagent-template-aks-engine.yaml +++ b/scripts/preview/health/omsagent-template-aks-engine.yaml @@ -108,14 +108,12 @@ data: type filter_inventory2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westEurope log_level info # custom_metrics_mdm filter plugin for perf data from windows nodes type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westEurope metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes log_level info diff --git a/scripts/preview/health/omsagent-template.yaml b/scripts/preview/health/omsagent-template.yaml index 6e3a52020..e58e9c33f 100644 --- a/scripts/preview/health/omsagent-template.yaml +++ b/scripts/preview/health/omsagent-template.yaml @@ -108,14 +108,12 @@ data: type filter_inventory2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westEurope log_level info # custom_metrics_mdm filter plugin for perf data from windows nodes type filter_cadvisor2mdm - custom_metrics_azure_regions eastus,southcentralus,westcentralus,westus2,southeastasia,northeurope,westEurope metrics_to_collect cpuUsageNanoCores,memoryWorkingSetBytes log_level info diff --git a/source/plugins/ruby/CustomMetricsUtils.rb b/source/plugins/ruby/CustomMetricsUtils.rb index a19580630..220313e6b 100644 --- a/source/plugins/ruby/CustomMetricsUtils.rb +++ b/source/plugins/ruby/CustomMetricsUtils.rb @@ -6,21 +6,15 @@ def initialize end class << self - def check_custom_metrics_availability(custom_metric_regions) + def check_custom_metrics_availability aks_region = ENV['AKS_REGION'] aks_resource_id = ENV['AKS_RESOURCE_ID'] + aks_cloud_environment = ENV['CLOUD_ENVIRONMENT'] if aks_region.to_s.empty? || aks_resource_id.to_s.empty? return false # This will also take care of AKS-Engine Scenario. AKS_REGION/AKS_RESOURCE_ID is not set for AKS-Engine. Only ACS_RESOURCE_NAME is set end - custom_metrics_regions_arr = custom_metric_regions.split(',') - custom_metrics_regions_hash = custom_metrics_regions_arr.map {|x| [x.downcase,true]}.to_h - - if custom_metrics_regions_hash.key?(aks_region.downcase) - true - else - false - end + return aks_cloud_environment.to_s.downcase == 'public' end end end \ No newline at end of file diff --git a/source/plugins/ruby/KubernetesApiClient.rb b/source/plugins/ruby/KubernetesApiClient.rb index 073eb0417..aca2142a0 100644 --- a/source/plugins/ruby/KubernetesApiClient.rb +++ b/source/plugins/ruby/KubernetesApiClient.rb @@ -172,6 +172,10 @@ def isAROV3Cluster return @@IsAROV3Cluster end + def isAROv3MasterOrInfraPod(nodeName) + return isAROV3Cluster() && (!nodeName.nil? && (nodeName.downcase.start_with?("infra-") || nodeName.downcase.start_with?("master-"))) + end + def isNodeMaster return @@IsNodeMaster if !@@IsNodeMaster.nil? @@IsNodeMaster = false @@ -276,7 +280,8 @@ def getPods(namespace) def getWindowsNodes winNodes = [] begin - resourceUri = getNodesResourceUri("nodes") + # get only windows nodes + resourceUri = getNodesResourceUri("nodes?labelSelector=kubernetes.io%2Fos%3Dwindows") nodeInventory = JSON.parse(getKubeResourceInfo(resourceUri).body) @Log.info "KubernetesAPIClient::getWindowsNodes : Got nodes from kube api" # Resetting the windows node cache @@ -396,42 +401,67 @@ def getPodUid(podNameSpace, podMetadata) return podUid end - def getContainerResourceRequestsAndLimits(metricJSON, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) + def getContainerResourceRequestsAndLimits(pod, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) metricItems = [] begin clusterId = getClusterId - metricInfo = metricJSON - metricInfo["items"].each do |pod| - podNameSpace = pod["metadata"]["namespace"] - podUid = getPodUid(podNameSpace, pod["metadata"]) - if podUid.nil? - next - end - - # For ARO, skip the pods scheduled on to master or infra nodes to ingest - if isAROV3Cluster() && !pod["spec"].nil? && !pod["spec"]["nodeName"].nil? && - (pod["spec"]["nodeName"].downcase.start_with?("infra-") || - pod["spec"]["nodeName"].downcase.start_with?("master-")) - next - end + podNameSpace = pod["metadata"]["namespace"] + podUid = getPodUid(podNameSpace, pod["metadata"]) + if podUid.nil? + return metricItems + end - podContainers = [] - if !pod["spec"]["containers"].nil? && !pod["spec"]["containers"].empty? - podContainers = podContainers + pod["spec"]["containers"] - end - # Adding init containers to the record list as well. - if !pod["spec"]["initContainers"].nil? && !pod["spec"]["initContainers"].empty? - podContainers = podContainers + pod["spec"]["initContainers"] - end + nodeName = "" + #for unscheduled (non-started) pods nodeName does NOT exist + if !pod["spec"]["nodeName"].nil? + nodeName = pod["spec"]["nodeName"] + end + # For ARO, skip the pods scheduled on to master or infra nodes to ingest + if isAROv3MasterOrInfraPod(nodeName) + return metricItems + end - if (!podContainers.nil? && !podContainers.empty? && !pod["spec"]["nodeName"].nil?) - nodeName = pod["spec"]["nodeName"] - podContainers.each do |container| - containerName = container["name"] - #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z - if (!container["resources"].nil? && !container["resources"].empty? && !container["resources"][metricCategory].nil? && !container["resources"][metricCategory][metricNameToCollect].nil?) - metricValue = getMetricNumericValue(metricNameToCollect, container["resources"][metricCategory][metricNameToCollect]) + podContainers = [] + if !pod["spec"]["containers"].nil? && !pod["spec"]["containers"].empty? + podContainers = podContainers + pod["spec"]["containers"] + end + # Adding init containers to the record list as well. + if !pod["spec"]["initContainers"].nil? && !pod["spec"]["initContainers"].empty? + podContainers = podContainers + pod["spec"]["initContainers"] + end + if (!podContainers.nil? && !podContainers.empty? && !pod["spec"]["nodeName"].nil?) + podContainers.each do |container| + containerName = container["name"] + #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z + if (!container["resources"].nil? && !container["resources"].empty? && !container["resources"][metricCategory].nil? && !container["resources"][metricCategory][metricNameToCollect].nil?) + metricValue = getMetricNumericValue(metricNameToCollect, container["resources"][metricCategory][metricNameToCollect]) + + metricItem = {} + metricItem["DataItems"] = [] + + metricProps = {} + metricProps["Timestamp"] = metricTime + metricProps["Host"] = nodeName + # Adding this so that it is not set by base omsagent since it was not set earlier and being set by base omsagent + metricProps["Computer"] = nodeName + metricProps["ObjectName"] = "K8SContainer" + metricProps["InstanceName"] = clusterId + "/" + podUid + "/" + containerName + + metricProps["Collections"] = [] + metricCollections = {} + metricCollections["CounterName"] = metricNametoReturn + metricCollections["Value"] = metricValue + + metricProps["Collections"].push(metricCollections) + metricItem["DataItems"].push(metricProps) + metricItems.push(metricItem) + #No container level limit for the given metric, so default to node level limit + else + nodeMetricsHashKey = clusterId + "/" + nodeName + "_" + "allocatable" + "_" + metricNameToCollect + if (metricCategory == "limits" && @@NodeMetrics.has_key?(nodeMetricsHashKey)) + metricValue = @@NodeMetrics[nodeMetricsHashKey] + #@Log.info("Limits not set for container #{clusterId + "/" + podUid + "/" + containerName} using node level limits: #{nodeMetricsHashKey}=#{metricValue} ") metricItem = {} metricItem["DataItems"] = [] @@ -451,32 +481,6 @@ def getContainerResourceRequestsAndLimits(metricJSON, metricCategory, metricName metricProps["Collections"].push(metricCollections) metricItem["DataItems"].push(metricProps) metricItems.push(metricItem) - #No container level limit for the given metric, so default to node level limit - else - nodeMetricsHashKey = clusterId + "/" + nodeName + "_" + "allocatable" + "_" + metricNameToCollect - if (metricCategory == "limits" && @@NodeMetrics.has_key?(nodeMetricsHashKey)) - metricValue = @@NodeMetrics[nodeMetricsHashKey] - #@Log.info("Limits not set for container #{clusterId + "/" + podUid + "/" + containerName} using node level limits: #{nodeMetricsHashKey}=#{metricValue} ") - metricItem = {} - metricItem["DataItems"] = [] - - metricProps = {} - metricProps["Timestamp"] = metricTime - metricProps["Host"] = nodeName - # Adding this so that it is not set by base omsagent since it was not set earlier and being set by base omsagent - metricProps["Computer"] = nodeName - metricProps["ObjectName"] = "K8SContainer" - metricProps["InstanceName"] = clusterId + "/" + podUid + "/" + containerName - - metricProps["Collections"] = [] - metricCollections = {} - metricCollections["CounterName"] = metricNametoReturn - metricCollections["Value"] = metricValue - - metricProps["Collections"].push(metricCollections) - metricItem["DataItems"].push(metricProps) - metricItems.push(metricItem) - end end end end @@ -488,78 +492,74 @@ def getContainerResourceRequestsAndLimits(metricJSON, metricCategory, metricName return metricItems end #getContainerResourceRequestAndLimits - def getContainerResourceRequestsAndLimitsAsInsightsMetrics(metricJSON, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) + def getContainerResourceRequestsAndLimitsAsInsightsMetrics(pod, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) metricItems = [] begin clusterId = getClusterId clusterName = getClusterName - - metricInfo = metricJSON - metricInfo["items"].each do |pod| - podNameSpace = pod["metadata"]["namespace"] - if podNameSpace.eql?("kube-system") && !pod["metadata"].key?("ownerReferences") - # The above case seems to be the only case where you have horizontal scaling of pods - # but no controller, in which case cAdvisor picks up kubernetes.io/config.hash - # instead of the actual poduid. Since this uid is not being surface into the UX - # its ok to use this. - # Use kubernetes.io/config.hash to be able to correlate with cadvisor data - if pod["metadata"]["annotations"].nil? - next - else - podUid = pod["metadata"]["annotations"]["kubernetes.io/config.hash"] - end + podNameSpace = pod["metadata"]["namespace"] + if podNameSpace.eql?("kube-system") && !pod["metadata"].key?("ownerReferences") + # The above case seems to be the only case where you have horizontal scaling of pods + # but no controller, in which case cAdvisor picks up kubernetes.io/config.hash + # instead of the actual poduid. Since this uid is not being surface into the UX + # its ok to use this. + # Use kubernetes.io/config.hash to be able to correlate with cadvisor data + if pod["metadata"]["annotations"].nil? + return metricItems else - podUid = pod["metadata"]["uid"] + podUid = pod["metadata"]["annotations"]["kubernetes.io/config.hash"] end + else + podUid = pod["metadata"]["uid"] + end - podContainers = [] - if !pod["spec"]["containers"].nil? && !pod["spec"]["containers"].empty? - podContainers = podContainers + pod["spec"]["containers"] - end - # Adding init containers to the record list as well. - if !pod["spec"]["initContainers"].nil? && !pod["spec"]["initContainers"].empty? - podContainers = podContainers + pod["spec"]["initContainers"] - end + podContainers = [] + if !pod["spec"]["containers"].nil? && !pod["spec"]["containers"].empty? + podContainers = podContainers + pod["spec"]["containers"] + end + # Adding init containers to the record list as well. + if !pod["spec"]["initContainers"].nil? && !pod["spec"]["initContainers"].empty? + podContainers = podContainers + pod["spec"]["initContainers"] + end - if (!podContainers.nil? && !podContainers.empty?) - if (!pod["spec"]["nodeName"].nil?) - nodeName = pod["spec"]["nodeName"] + if (!podContainers.nil? && !podContainers.empty?) + if (!pod["spec"]["nodeName"].nil?) + nodeName = pod["spec"]["nodeName"] + else + nodeName = "" #unscheduled pod. We still want to collect limits & requests for GPU + end + podContainers.each do |container| + metricValue = nil + containerName = container["name"] + #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z + if (!container["resources"].nil? && !container["resources"].empty? && !container["resources"][metricCategory].nil? && !container["resources"][metricCategory][metricNameToCollect].nil?) + metricValue = getMetricNumericValue(metricNameToCollect, container["resources"][metricCategory][metricNameToCollect]) else - nodeName = "" #unscheduled pod. We still want to collect limits & requests for GPU - end - podContainers.each do |container| - metricValue = nil - containerName = container["name"] - #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z - if (!container["resources"].nil? && !container["resources"].empty? && !container["resources"][metricCategory].nil? && !container["resources"][metricCategory][metricNameToCollect].nil?) - metricValue = getMetricNumericValue(metricNameToCollect, container["resources"][metricCategory][metricNameToCollect]) - else - #No container level limit for the given metric, so default to node level limit for non-gpu metrics - if (metricNameToCollect.downcase != "nvidia.com/gpu") && (metricNameToCollect.downcase != "amd.com/gpu") - nodeMetricsHashKey = clusterId + "/" + nodeName + "_" + "allocatable" + "_" + metricNameToCollect - metricValue = @@NodeMetrics[nodeMetricsHashKey] - end - end - if (!metricValue.nil?) - metricItem = {} - metricItem["CollectionTime"] = metricTime - metricItem["Computer"] = nodeName - metricItem["Name"] = metricNametoReturn - metricItem["Value"] = metricValue - metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN - metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_GPU_NAMESPACE - - metricTags = {} - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = clusterId - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = clusterName - metricTags[Constants::INSIGHTSMETRICS_TAGS_CONTAINER_NAME] = podUid + "/" + containerName - #metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = podNameSpace - - metricItem["Tags"] = metricTags - - metricItems.push(metricItem) + #No container level limit for the given metric, so default to node level limit for non-gpu metrics + if (metricNameToCollect.downcase != "nvidia.com/gpu") && (metricNameToCollect.downcase != "amd.com/gpu") + nodeMetricsHashKey = clusterId + "/" + nodeName + "_" + "allocatable" + "_" + metricNameToCollect + metricValue = @@NodeMetrics[nodeMetricsHashKey] end end + if (!metricValue.nil?) + metricItem = {} + metricItem["CollectionTime"] = metricTime + metricItem["Computer"] = nodeName + metricItem["Name"] = metricNametoReturn + metricItem["Value"] = metricValue + metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN + metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_GPU_NAMESPACE + + metricTags = {} + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = clusterId + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = clusterName + metricTags[Constants::INSIGHTSMETRICS_TAGS_CONTAINER_NAME] = podUid + "/" + containerName + #metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = podNameSpace + + metricItem["Tags"] = metricTags + + metricItems.push(metricItem) + end end end rescue => error @@ -578,32 +578,9 @@ def parseNodeLimits(metricJSON, metricCategory, metricNameToCollect, metricNamet #if we are coming up with the time it should be same for all nodes #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z metricInfo["items"].each do |node| - if (!node["status"][metricCategory].nil?) - - # metricCategory can be "capacity" or "allocatable" and metricNameToCollect can be "cpu" or "memory" - metricValue = getMetricNumericValue(metricNameToCollect, node["status"][metricCategory][metricNameToCollect]) - - metricItem = {} - metricItem["DataItems"] = [] - metricProps = {} - metricProps["Timestamp"] = metricTime - metricProps["Host"] = node["metadata"]["name"] - # Adding this so that it is not set by base omsagent since it was not set earlier and being set by base omsagent - metricProps["Computer"] = node["metadata"]["name"] - metricProps["ObjectName"] = "K8SNode" - metricProps["InstanceName"] = clusterId + "/" + node["metadata"]["name"] - metricProps["Collections"] = [] - metricCollections = {} - metricCollections["CounterName"] = metricNametoReturn - metricCollections["Value"] = metricValue - - metricProps["Collections"].push(metricCollections) - metricItem["DataItems"].push(metricProps) + metricItem = parseNodeLimitsFromNodeItem(node, metricCategory, metricNameToCollect, metricNametoReturn, metricTime) + if !metricItem.nil? && !metricItem.empty? metricItems.push(metricItem) - #push node level metrics to a inmem hash so that we can use it looking up at container level. - #Currently if container level cpu & memory limits are not defined we default to node level limits - @@NodeMetrics[clusterId + "/" + node["metadata"]["name"] + "_" + metricCategory + "_" + metricNameToCollect] = metricValue - #@Log.info ("Node metric hash: #{@@NodeMetrics}") end end rescue => error @@ -612,49 +589,82 @@ def parseNodeLimits(metricJSON, metricCategory, metricNameToCollect, metricNamet return metricItems end #parseNodeLimits - def parseNodeLimitsAsInsightsMetrics(metricJSON, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) - metricItems = [] + def parseNodeLimitsFromNodeItem(node, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) + metricItem = {} begin - metricInfo = metricJSON clusterId = getClusterId - clusterName = getClusterName #Since we are getting all node data at the same time and kubernetes doesnt specify a timestamp for the capacity and allocation metrics, #if we are coming up with the time it should be same for all nodes #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z - metricInfo["items"].each do |node| - if (!node["status"][metricCategory].nil?) && (!node["status"][metricCategory][metricNameToCollect].nil?) - - # metricCategory can be "capacity" or "allocatable" and metricNameToCollect can be "cpu" or "memory" or "amd.com/gpu" or "nvidia.com/gpu" - metricValue = getMetricNumericValue(metricNameToCollect, node["status"][metricCategory][metricNameToCollect]) - - metricItem = {} - metricItem["CollectionTime"] = metricTime - metricItem["Computer"] = node["metadata"]["name"] - metricItem["Name"] = metricNametoReturn - metricItem["Value"] = metricValue - metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN - metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_GPU_NAMESPACE - - metricTags = {} - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = clusterId - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = clusterName - metricTags[Constants::INSIGHTSMETRICS_TAGS_GPU_VENDOR] = metricNameToCollect - - metricItem["Tags"] = metricTags + if (!node["status"][metricCategory].nil?) && (!node["status"][metricCategory][metricNameToCollect].nil?) + # metricCategory can be "capacity" or "allocatable" and metricNameToCollect can be "cpu" or "memory" + metricValue = getMetricNumericValue(metricNameToCollect, node["status"][metricCategory][metricNameToCollect]) + + metricItem["DataItems"] = [] + metricProps = {} + metricProps["Timestamp"] = metricTime + metricProps["Host"] = node["metadata"]["name"] + # Adding this so that it is not set by base omsagent since it was not set earlier and being set by base omsagent + metricProps["Computer"] = node["metadata"]["name"] + metricProps["ObjectName"] = "K8SNode" + metricProps["InstanceName"] = clusterId + "/" + node["metadata"]["name"] + metricProps["Collections"] = [] + metricCollections = {} + metricCollections["CounterName"] = metricNametoReturn + metricCollections["Value"] = metricValue + + metricProps["Collections"].push(metricCollections) + metricItem["DataItems"].push(metricProps) + + #push node level metrics to a inmem hash so that we can use it looking up at container level. + #Currently if container level cpu & memory limits are not defined we default to node level limits + @@NodeMetrics[clusterId + "/" + node["metadata"]["name"] + "_" + metricCategory + "_" + metricNameToCollect] = metricValue + #@Log.info ("Node metric hash: #{@@NodeMetrics}") + end + rescue => error + @Log.warn("parseNodeLimitsFromNodeItem failed: #{error} for metric #{metricCategory} #{metricNameToCollect}") + end + return metricItem + end #parseNodeLimitsFromNodeItem - metricItems.push(metricItem) - #push node level metrics (except gpu ones) to a inmem hash so that we can use it looking up at container level. - #Currently if container level cpu & memory limits are not defined we default to node level limits - if (metricNameToCollect.downcase != "nvidia.com/gpu") && (metricNameToCollect.downcase != "amd.com/gpu") - @@NodeMetrics[clusterId + "/" + node["metadata"]["name"] + "_" + metricCategory + "_" + metricNameToCollect] = metricValue - #@Log.info ("Node metric hash: #{@@NodeMetrics}") - end + def parseNodeLimitsAsInsightsMetrics(node, metricCategory, metricNameToCollect, metricNametoReturn, metricTime = Time.now.utc.iso8601) + metricItem = {} + begin + #Since we are getting all node data at the same time and kubernetes doesnt specify a timestamp for the capacity and allocation metrics, + #if we are coming up with the time it should be same for all nodes + #metricTime = Time.now.utc.iso8601 #2018-01-30T19:36:14Z + if (!node["status"][metricCategory].nil?) && (!node["status"][metricCategory][metricNameToCollect].nil?) + clusterId = getClusterId + clusterName = getClusterName + + # metricCategory can be "capacity" or "allocatable" and metricNameToCollect can be "cpu" or "memory" or "amd.com/gpu" or "nvidia.com/gpu" + metricValue = getMetricNumericValue(metricNameToCollect, node["status"][metricCategory][metricNameToCollect]) + + metricItem["CollectionTime"] = metricTime + metricItem["Computer"] = node["metadata"]["name"] + metricItem["Name"] = metricNametoReturn + metricItem["Value"] = metricValue + metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN + metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_GPU_NAMESPACE + + metricTags = {} + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = clusterId + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = clusterName + metricTags[Constants::INSIGHTSMETRICS_TAGS_GPU_VENDOR] = metricNameToCollect + + metricItem["Tags"] = metricTags + + #push node level metrics (except gpu ones) to a inmem hash so that we can use it looking up at container level. + #Currently if container level cpu & memory limits are not defined we default to node level limits + if (metricNameToCollect.downcase != "nvidia.com/gpu") && (metricNameToCollect.downcase != "amd.com/gpu") + @@NodeMetrics[clusterId + "/" + node["metadata"]["name"] + "_" + metricCategory + "_" + metricNameToCollect] = metricValue + #@Log.info ("Node metric hash: #{@@NodeMetrics}") end end rescue => error @Log.warn("parseNodeLimitsAsInsightsMetrics failed: #{error} for metric #{metricCategory} #{metricNameToCollect}") end - return metricItems + return metricItem end def getMetricNumericValue(metricName, metricVal) @@ -777,5 +787,32 @@ def getKubeAPIServerUrl end return apiServerUrl end + + def getKubeServicesInventoryRecords(serviceList, batchTime = Time.utc.iso8601) + kubeServiceRecords = [] + begin + if (!serviceList.nil? && !serviceList.empty?) + servicesCount = serviceList["items"].length + @Log.info("KubernetesApiClient::getKubeServicesInventoryRecords : number of services in serviceList #{servicesCount} @ #{Time.now.utc.iso8601}") + serviceList["items"].each do |item| + kubeServiceRecord = {} + kubeServiceRecord["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated + kubeServiceRecord["ServiceName"] = item["metadata"]["name"] + kubeServiceRecord["Namespace"] = item["metadata"]["namespace"] + kubeServiceRecord["SelectorLabels"] = [item["spec"]["selector"]] + # added these before emit to avoid memory foot print + # kubeServiceRecord["ClusterId"] = KubernetesApiClient.getClusterId + # kubeServiceRecord["ClusterName"] = KubernetesApiClient.getClusterName + kubeServiceRecord["ClusterIP"] = item["spec"]["clusterIP"] + kubeServiceRecord["ServiceType"] = item["spec"]["type"] + kubeServiceRecords.push(kubeServiceRecord.dup) + end + end + rescue => errorStr + @Log.warn "KubernetesApiClient::getKubeServicesInventoryRecords:Failed with an error : #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + return kubeServiceRecords + end end end diff --git a/source/plugins/ruby/constants.rb b/source/plugins/ruby/constants.rb index 0e5099c5e..cf41900dc 100644 --- a/source/plugins/ruby/constants.rb +++ b/source/plugins/ruby/constants.rb @@ -1,61 +1,61 @@ # frozen_string_literal: true class Constants - INSIGHTSMETRICS_TAGS_ORIGIN = "container.azm.ms" - INSIGHTSMETRICS_TAGS_CLUSTERID = "container.azm.ms/clusterId" - INSIGHTSMETRICS_TAGS_CLUSTERNAME = "container.azm.ms/clusterName" - INSIGHTSMETRICS_TAGS_GPU_VENDOR = "gpuVendor" - INSIGHTSMETRICS_TAGS_GPU_NAMESPACE = "container.azm.ms/gpu" - INSIGHTSMETRICS_TAGS_GPU_MODEL = "gpuModel" - INSIGHTSMETRICS_TAGS_GPU_ID = "gpuId" - INSIGHTSMETRICS_TAGS_CONTAINER_NAME = "containerName" - INSIGHTSMETRICS_TAGS_CONTAINER_ID = "containerName" - INSIGHTSMETRICS_TAGS_K8SNAMESPACE = "k8sNamespace" - INSIGHTSMETRICS_TAGS_CONTROLLER_NAME = "controllerName" - INSIGHTSMETRICS_TAGS_CONTROLLER_KIND = "controllerKind" - INSIGHTSMETRICS_TAGS_POD_UID = "podUid" - INSIGTHTSMETRICS_TAGS_PV_NAMESPACE = "container.azm.ms/pv" - INSIGHTSMETRICS_TAGS_PVC_NAME = "pvcName" - INSIGHTSMETRICS_TAGS_PVC_NAMESPACE = "pvcNamespace" - INSIGHTSMETRICS_TAGS_POD_NAME = "podName" - INSIGHTSMETRICS_TAGS_PV_CAPACITY_BYTES = "pvCapacityBytes" - INSIGHTSMETRICS_TAGS_VOLUME_NAME = "volumeName" - INSIGHTSMETRICS_FLUENT_TAG = "oms.api.InsightsMetrics" - REASON_OOM_KILLED = "oomkilled" - #Kubestate (common) - INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE = "container.azm.ms/kubestate" - INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME = "creationTime" - #Kubestate (deployments) - INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_DEPLOYMENT_STATE = "kube_deployment_status_replicas_ready" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_NAME = "deployment" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_CREATIONTIME = "creationTime" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STRATEGY = "deploymentStrategy" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_SPEC_REPLICAS = "spec_replicas" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_UPDATED = "status_replicas_updated" - INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_AVAILABLE = "status_replicas_available" - #Kubestate (HPA) - INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_HPA_STATE = "kube_hpa_status_current_replicas" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_NAME = "hpa" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MAX_REPLICAS = "spec_max_replicas" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MIN_REPLICAS = "spec_min_replicas" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_KIND = "targetKind" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_NAME = "targetName" - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_DESIRED_REPLICAS = "status_desired_replicas" - - INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_LAST_SCALE_TIME = "lastScaleTime" - # MDM Metric names - MDM_OOM_KILLED_CONTAINER_COUNT = "oomKilledContainerCount" - MDM_CONTAINER_RESTART_COUNT = "restartingContainerCount" - MDM_POD_READY_PERCENTAGE = "podReadyPercentage" - MDM_STALE_COMPLETED_JOB_COUNT = "completedJobsCount" - MDM_DISK_USED_PERCENTAGE = "diskUsedPercentage" - MDM_CONTAINER_CPU_UTILIZATION_METRIC = "cpuExceededPercentage" - MDM_CONTAINER_MEMORY_RSS_UTILIZATION_METRIC = "memoryRssExceededPercentage" - MDM_CONTAINER_MEMORY_WORKING_SET_UTILIZATION_METRIC = "memoryWorkingSetExceededPercentage" - MDM_PV_UTILIZATION_METRIC = "pvUsageExceededPercentage" - MDM_NODE_CPU_USAGE_PERCENTAGE = "cpuUsagePercentage" - MDM_NODE_MEMORY_RSS_PERCENTAGE = "memoryRssPercentage" - MDM_NODE_MEMORY_WORKING_SET_PERCENTAGE = "memoryWorkingSetPercentage" + INSIGHTSMETRICS_TAGS_ORIGIN = "container.azm.ms" + INSIGHTSMETRICS_TAGS_CLUSTERID = "container.azm.ms/clusterId" + INSIGHTSMETRICS_TAGS_CLUSTERNAME = "container.azm.ms/clusterName" + INSIGHTSMETRICS_TAGS_GPU_VENDOR = "gpuVendor" + INSIGHTSMETRICS_TAGS_GPU_NAMESPACE = "container.azm.ms/gpu" + INSIGHTSMETRICS_TAGS_GPU_MODEL = "gpuModel" + INSIGHTSMETRICS_TAGS_GPU_ID = "gpuId" + INSIGHTSMETRICS_TAGS_CONTAINER_NAME = "containerName" + INSIGHTSMETRICS_TAGS_CONTAINER_ID = "containerName" + INSIGHTSMETRICS_TAGS_K8SNAMESPACE = "k8sNamespace" + INSIGHTSMETRICS_TAGS_CONTROLLER_NAME = "controllerName" + INSIGHTSMETRICS_TAGS_CONTROLLER_KIND = "controllerKind" + INSIGHTSMETRICS_TAGS_POD_UID = "podUid" + INSIGTHTSMETRICS_TAGS_PV_NAMESPACE = "container.azm.ms/pv" + INSIGHTSMETRICS_TAGS_PVC_NAME = "pvcName" + INSIGHTSMETRICS_TAGS_PVC_NAMESPACE = "pvcNamespace" + INSIGHTSMETRICS_TAGS_POD_NAME = "podName" + INSIGHTSMETRICS_TAGS_PV_CAPACITY_BYTES = "pvCapacityBytes" + INSIGHTSMETRICS_TAGS_VOLUME_NAME = "volumeName" + INSIGHTSMETRICS_FLUENT_TAG = "oms.api.InsightsMetrics" + REASON_OOM_KILLED = "oomkilled" + #Kubestate (common) + INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE = "container.azm.ms/kubestate" + INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME = "creationTime" + #Kubestate (deployments) + INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_DEPLOYMENT_STATE = "kube_deployment_status_replicas_ready" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_NAME = "deployment" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_CREATIONTIME = "creationTime" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STRATEGY = "deploymentStrategy" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_SPEC_REPLICAS = "spec_replicas" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_UPDATED = "status_replicas_updated" + INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_AVAILABLE = "status_replicas_available" + #Kubestate (HPA) + INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_HPA_STATE = "kube_hpa_status_current_replicas" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_NAME = "hpa" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MAX_REPLICAS = "spec_max_replicas" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MIN_REPLICAS = "spec_min_replicas" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_KIND = "targetKind" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_NAME = "targetName" + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_DESIRED_REPLICAS = "status_desired_replicas" + + INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_LAST_SCALE_TIME = "lastScaleTime" + # MDM Metric names + MDM_OOM_KILLED_CONTAINER_COUNT = "oomKilledContainerCount" + MDM_CONTAINER_RESTART_COUNT = "restartingContainerCount" + MDM_POD_READY_PERCENTAGE = "podReadyPercentage" + MDM_STALE_COMPLETED_JOB_COUNT = "completedJobsCount" + MDM_DISK_USED_PERCENTAGE = "diskUsedPercentage" + MDM_CONTAINER_CPU_UTILIZATION_METRIC = "cpuExceededPercentage" + MDM_CONTAINER_MEMORY_RSS_UTILIZATION_METRIC = "memoryRssExceededPercentage" + MDM_CONTAINER_MEMORY_WORKING_SET_UTILIZATION_METRIC = "memoryWorkingSetExceededPercentage" + MDM_PV_UTILIZATION_METRIC = "pvUsageExceededPercentage" + MDM_NODE_CPU_USAGE_PERCENTAGE = "cpuUsagePercentage" + MDM_NODE_MEMORY_RSS_PERCENTAGE = "memoryRssPercentage" + MDM_NODE_MEMORY_WORKING_SET_PERCENTAGE = "memoryWorkingSetPercentage" CONTAINER_TERMINATED_RECENTLY_IN_MINUTES = 5 OBJECT_NAME_K8S_CONTAINER = "K8SContainer" @@ -77,6 +77,9 @@ class Constants OMSAGENT_ZERO_FILL = "omsagent" KUBESYSTEM_NAMESPACE_ZERO_FILL = "kube-system" VOLUME_NAME_ZERO_FILL = "-" + PV_TYPES =["awsElasticBlockStore", "azureDisk", "azureFile", "cephfs", "cinder", "csi", "fc", "flexVolume", + "flocker", "gcePersistentDisk", "glusterfs", "hostPath", "iscsi", "local", "nfs", + "photonPersistentDisk", "portworxVolume", "quobyte", "rbd", "scaleIO", "storageos", "vsphereVolume"] #Telemetry constants CONTAINER_METRICS_HEART_BEAT_EVENT = "ContainerMetricsMdmHeartBeatEvent" @@ -84,10 +87,13 @@ class Constants CONTAINER_RESOURCE_UTIL_HEART_BEAT_EVENT = "ContainerResourceUtilMdmHeartBeatEvent" PV_USAGE_HEART_BEAT_EVENT = "PVUsageMdmHeartBeatEvent" PV_KUBE_SYSTEM_METRICS_ENABLED_EVENT = "CollectPVKubeSystemMetricsEnabled" + PV_INVENTORY_HEART_BEAT_EVENT = "KubePVInventoryHeartBeatEvent" TELEMETRY_FLUSH_INTERVAL_IN_MINUTES = 10 KUBE_STATE_TELEMETRY_FLUSH_INTERVAL_IN_MINUTES = 15 ZERO_FILL_METRICS_INTERVAL_IN_MINUTES = 30 MDM_TIME_SERIES_FLUSHED_IN_LAST_HOUR = "MdmTimeSeriesFlushedInLastHour" + MDM_EXCEPTION_TELEMETRY_METRIC = "AKSCustomMetricsMdmExceptions" + MDM_EXCEPTIONS_METRIC_FLUSH_INTERVAL = 30 #Pod Statuses POD_STATUS_TERMINATING = "Terminating" diff --git a/source/plugins/ruby/filter_cadvisor2mdm.rb b/source/plugins/ruby/filter_cadvisor2mdm.rb index 3bc674ea8..8d7e729c8 100644 --- a/source/plugins/ruby/filter_cadvisor2mdm.rb +++ b/source/plugins/ruby/filter_cadvisor2mdm.rb @@ -15,7 +15,6 @@ class CAdvisor2MdmFilter < Filter config_param :enable_log, :integer, :default => 0 config_param :log_path, :string, :default => "/var/opt/microsoft/docker-cimprov/log/filter_cadvisor2mdm.log" - config_param :custom_metrics_azure_regions, :string config_param :metrics_to_collect, :string, :default => "Constants::CPU_USAGE_NANO_CORES,Constants::MEMORY_WORKING_SET_BYTES,Constants::MEMORY_RSS_BYTES,Constants::PV_USED_BYTES" @@hostName = (OMS::Common.get_hostname) @@ -42,7 +41,7 @@ def configure(conf) def start super begin - @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability(@custom_metrics_azure_regions) + @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability @metrics_to_collect_hash = build_metrics_hash @log.debug "After check_custom_metrics_availability process_incoming_stream #{@process_incoming_stream}" @@containerResourceUtilTelemetryTimeTracker = DateTime.now.to_time.to_i @@ -309,8 +308,16 @@ def ensure_cpu_memory_capacity_set end elsif controller_type.downcase == "daemonset" capacity_from_kubelet = KubeletUtils.get_node_capacity - @cpu_capacity = capacity_from_kubelet[0] - @memory_capacity = capacity_from_kubelet[1] + + # Error handling in case /metrics/cadvsior endpoint fails + if !capacity_from_kubelet.nil? && capacity_from_kubelet.length > 1 + @cpu_capacity = capacity_from_kubelet[0] + @memory_capacity = capacity_from_kubelet[1] + else + # cpu_capacity and memory_capacity keep initialized value of 0.0 + @log.error "Error getting capacity_from_kubelet: cpu_capacity and memory_capacity" + end + end end diff --git a/source/plugins/ruby/filter_inventory2mdm.rb b/source/plugins/ruby/filter_inventory2mdm.rb index b5ef587ff..38ccab885 100644 --- a/source/plugins/ruby/filter_inventory2mdm.rb +++ b/source/plugins/ruby/filter_inventory2mdm.rb @@ -13,7 +13,6 @@ class Inventory2MdmFilter < Filter config_param :enable_log, :integer, :default => 0 config_param :log_path, :string, :default => '/var/opt/microsoft/docker-cimprov/log/filter_inventory2mdm.log' - config_param :custom_metrics_azure_regions, :string @@node_count_metric_name = 'nodesCount' @@pod_count_metric_name = 'podCount' @@ -98,7 +97,7 @@ def configure(conf) def start super - @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability(@custom_metrics_azure_regions) + @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability @log.debug "After check_custom_metrics_availability process_incoming_stream #{@process_incoming_stream}" end diff --git a/source/plugins/ruby/filter_telegraf2mdm.rb b/source/plugins/ruby/filter_telegraf2mdm.rb index 98d258ea5..88ae428d1 100644 --- a/source/plugins/ruby/filter_telegraf2mdm.rb +++ b/source/plugins/ruby/filter_telegraf2mdm.rb @@ -15,7 +15,6 @@ class Telegraf2MdmFilter < Filter config_param :enable_log, :integer, :default => 0 config_param :log_path, :string, :default => "/var/opt/microsoft/docker-cimprov/log/filter_telegraf2mdm.log" - config_param :custom_metrics_azure_regions, :string @process_incoming_stream = true @@ -36,7 +35,7 @@ def configure(conf) def start super begin - @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability(@custom_metrics_azure_regions) + @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability @log.debug "After check_custom_metrics_availability process_incoming_stream #{@process_incoming_stream}" rescue => errorStr @log.info "Error initializing plugin #{errorStr}" diff --git a/source/plugins/ruby/in_kube_events.rb b/source/plugins/ruby/in_kube_events.rb index 6f59a3fc1..4f6017cc5 100644 --- a/source/plugins/ruby/in_kube_events.rb +++ b/source/plugins/ruby/in_kube_events.rb @@ -17,8 +17,9 @@ def initialize require_relative "omslog" require_relative "ApplicationInsightsUtility" - # 30000 events account to approximately 5MB - @EVENTS_CHUNK_SIZE = 30000 + # refer tomlparser-agent-config for defaults + # this configurable via configmap + @EVENTS_CHUNK_SIZE = 0 # Initializing events count for telemetry @eventsCount = 0 @@ -36,6 +37,15 @@ def configure(conf) def start if @run_interval + if !ENV["EVENTS_CHUNK_SIZE"].nil? && !ENV["EVENTS_CHUNK_SIZE"].empty? && ENV["EVENTS_CHUNK_SIZE"].to_i > 0 + @EVENTS_CHUNK_SIZE = ENV["EVENTS_CHUNK_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kube_events::start: setting to default value since got EVENTS_CHUNK_SIZE nil or empty") + @EVENTS_CHUNK_SIZE = 4000 + end + $log.info("in_kube_events::start : EVENTS_CHUNK_SIZE @ #{@EVENTS_CHUNK_SIZE}") + @finished = false @condition = ConditionVariable.new @mutex = Mutex.new @@ -82,6 +92,8 @@ def enumerate end $log.info("in_kube_events::enumerate : Done getting events from Kube API @ #{Time.now.utc.iso8601}") if (!eventList.nil? && !eventList.empty? && eventList.key?("items") && !eventList["items"].nil? && !eventList["items"].empty?) + eventsCount = eventList["items"].length + $log.info "in_kube_events::enumerate:Received number of events in eventList is #{eventsCount} @ #{Time.now.utc.iso8601}" newEventQueryState = parse_and_emit_records(eventList, eventQueryState, newEventQueryState, batchTime) else $log.warn "in_kube_events::enumerate:Received empty eventList" @@ -91,6 +103,8 @@ def enumerate while (!continuationToken.nil? && !continuationToken.empty?) continuationToken, eventList = KubernetesApiClient.getResourcesAndContinuationToken("events?fieldSelector=type!=Normal&limit=#{@EVENTS_CHUNK_SIZE}&continue=#{continuationToken}") if (!eventList.nil? && !eventList.empty? && eventList.key?("items") && !eventList["items"].nil? && !eventList["items"].empty?) + eventsCount = eventList["items"].length + $log.info "in_kube_events::enumerate:Received number of events in eventList is #{eventsCount} @ #{Time.now.utc.iso8601}" newEventQueryState = parse_and_emit_records(eventList, eventQueryState, newEventQueryState, batchTime) else $log.warn "in_kube_events::enumerate:Received empty eventList" diff --git a/source/plugins/ruby/in_kube_nodes.rb b/source/plugins/ruby/in_kube_nodes.rb index 4d58382f5..e7c5060a5 100644 --- a/source/plugins/ruby/in_kube_nodes.rb +++ b/source/plugins/ruby/in_kube_nodes.rb @@ -32,7 +32,12 @@ def initialize require_relative "ApplicationInsightsUtility" require_relative "oms_common" require_relative "omslog" - @NODES_CHUNK_SIZE = "400" + # refer tomlparser-agent-config for the defaults + @NODES_CHUNK_SIZE = 0 + @NODES_EMIT_STREAM_BATCH_SIZE = 0 + + @nodeInventoryE2EProcessingLatencyMs = 0 + @nodesAPIE2ELatencyMs = 0 require_relative "constants" end @@ -45,11 +50,30 @@ def configure(conf) def start if @run_interval + if !ENV["NODES_CHUNK_SIZE"].nil? && !ENV["NODES_CHUNK_SIZE"].empty? && ENV["NODES_CHUNK_SIZE"].to_i > 0 + @NODES_CHUNK_SIZE = ENV["NODES_CHUNK_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kube_nodes::start: setting to default value since got NODES_CHUNK_SIZE nil or empty") + @NODES_CHUNK_SIZE = 250 + end + $log.info("in_kube_nodes::start : NODES_CHUNK_SIZE @ #{@NODES_CHUNK_SIZE}") + + if !ENV["NODES_EMIT_STREAM_BATCH_SIZE"].nil? && !ENV["NODES_EMIT_STREAM_BATCH_SIZE"].empty? && ENV["NODES_EMIT_STREAM_BATCH_SIZE"].to_i > 0 + @NODES_EMIT_STREAM_BATCH_SIZE = ENV["NODES_EMIT_STREAM_BATCH_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kube_nodes::start: setting to default value since got NODES_EMIT_STREAM_BATCH_SIZE nil or empty") + @NODES_EMIT_STREAM_BATCH_SIZE = 100 + end + $log.info("in_kube_nodes::start : NODES_EMIT_STREAM_BATCH_SIZE @ #{@NODES_EMIT_STREAM_BATCH_SIZE}") + @finished = false @condition = ConditionVariable.new @mutex = Mutex.new @thread = Thread.new(&method(:run_periodic)) @@nodeTelemetryTimeTracker = DateTime.now.to_time.to_i + @@nodeInventoryLatencyTelemetryTimeTracker = DateTime.now.to_time.to_i end end @@ -69,14 +93,20 @@ def enumerate currentTime = Time.now batchTime = currentTime.utc.iso8601 + @nodesAPIE2ELatencyMs = 0 + @nodeInventoryE2EProcessingLatencyMs = 0 + nodeInventoryStartTime = (Time.now.to_f * 1000).to_i + nodesAPIChunkStartTime = (Time.now.to_f * 1000).to_i # Initializing continuation token to nil continuationToken = nil $log.info("in_kube_nodes::enumerate : Getting nodes from Kube API @ #{Time.now.utc.iso8601}") resourceUri = KubernetesApiClient.getNodesResourceUri("nodes?limit=#{@NODES_CHUNK_SIZE}") continuationToken, nodeInventory = KubernetesApiClient.getResourcesAndContinuationToken(resourceUri) - $log.info("in_kube_nodes::enumerate : Done getting nodes from Kube API @ #{Time.now.utc.iso8601}") + nodesAPIChunkEndTime = (Time.now.to_f * 1000).to_i + @nodesAPIE2ELatencyMs = (nodesAPIChunkEndTime - nodesAPIChunkStartTime) if (!nodeInventory.nil? && !nodeInventory.empty? && nodeInventory.key?("items") && !nodeInventory["items"].nil? && !nodeInventory["items"].empty?) + $log.info("in_kube_nodes::enumerate : number of node items :#{nodeInventory["items"].length} from Kube API @ #{Time.now.utc.iso8601}") parse_and_emit_records(nodeInventory, batchTime) else $log.warn "in_kube_nodes::enumerate:Received empty nodeInventory" @@ -84,14 +114,26 @@ def enumerate #If we receive a continuation token, make calls, process and flush data until we have processed all data while (!continuationToken.nil? && !continuationToken.empty?) + nodesAPIChunkStartTime = (Time.now.to_f * 1000).to_i continuationToken, nodeInventory = KubernetesApiClient.getResourcesAndContinuationToken(resourceUri + "&continue=#{continuationToken}") + nodesAPIChunkEndTime = (Time.now.to_f * 1000).to_i + @nodesAPIE2ELatencyMs = @nodesAPIE2ELatencyMs + (nodesAPIChunkEndTime - nodesAPIChunkStartTime) if (!nodeInventory.nil? && !nodeInventory.empty? && nodeInventory.key?("items") && !nodeInventory["items"].nil? && !nodeInventory["items"].empty?) + $log.info("in_kube_nodes::enumerate : number of node items :#{nodeInventory["items"].length} from Kube API @ #{Time.now.utc.iso8601}") parse_and_emit_records(nodeInventory, batchTime) else $log.warn "in_kube_nodes::enumerate:Received empty nodeInventory" end end + @nodeInventoryE2EProcessingLatencyMs = ((Time.now.to_f * 1000).to_i - nodeInventoryStartTime) + timeDifference = (DateTime.now.to_time.to_i - @@nodeInventoryLatencyTelemetryTimeTracker).abs + timeDifferenceInMinutes = timeDifference / 60 + if (timeDifferenceInMinutes >= Constants::TELEMETRY_FLUSH_INTERVAL_IN_MINUTES) + ApplicationInsightsUtility.sendMetricTelemetry("NodeInventoryE2EProcessingLatencyMs", @nodeInventoryE2EProcessingLatencyMs, {}) + ApplicationInsightsUtility.sendMetricTelemetry("NodesAPIE2ELatencyMs", @nodesAPIE2ELatencyMs, {}) + @@nodeInventoryLatencyTelemetryTimeTracker = DateTime.now.to_time.to_i + end # Setting this to nil so that we dont hold memory until GC kicks in nodeInventory = nil rescue => errorStr @@ -109,77 +151,32 @@ def parse_and_emit_records(nodeInventory, batchTime = Time.utc.iso8601) eventStream = MultiEventStream.new containerNodeInventoryEventStream = MultiEventStream.new insightsMetricsEventStream = MultiEventStream.new + kubePerfEventStream = MultiEventStream.new @@istestvar = ENV["ISTEST"] #get node inventory - nodeInventory["items"].each do |items| - record = {} - # Sending records for ContainerNodeInventory - containerNodeInventoryRecord = {} - containerNodeInventoryRecord["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated - containerNodeInventoryRecord["Computer"] = items["metadata"]["name"] - - record["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated - record["Computer"] = items["metadata"]["name"] - record["ClusterName"] = KubernetesApiClient.getClusterName - record["ClusterId"] = KubernetesApiClient.getClusterId - record["CreationTimeStamp"] = items["metadata"]["creationTimestamp"] - record["Labels"] = [items["metadata"]["labels"]] - record["Status"] = "" - - if !items["spec"]["providerID"].nil? && !items["spec"]["providerID"].empty? - if File.file?(@@AzStackCloudFileName) # existence of this file indicates agent running on azstack - record["KubernetesProviderID"] = "azurestack" - else - #Multicluster kusto query is filtering after splitting by ":" to the left, so do the same here - #https://msazure.visualstudio.com/One/_git/AzureUX-Monitoring?path=%2Fsrc%2FMonitoringExtension%2FClient%2FInfraInsights%2FData%2FQueryTemplates%2FMultiClusterKustoQueryTemplate.ts&_a=contents&version=GBdev - provider = items["spec"]["providerID"].split(":")[0] - if !provider.nil? && !provider.empty? - record["KubernetesProviderID"] = provider - else - record["KubernetesProviderID"] = items["spec"]["providerID"] - end - end - else - record["KubernetesProviderID"] = "onprem" - end - - # Refer to https://kubernetes.io/docs/concepts/architecture/nodes/#condition for possible node conditions. - # We check the status of each condition e.g. {"type": "OutOfDisk","status": "False"} . Based on this we - # populate the KubeNodeInventory Status field. A possible value for this field could be "Ready OutofDisk" - # implying that the node is ready for hosting pods, however its out of disk. - - if items["status"].key?("conditions") && !items["status"]["conditions"].empty? - allNodeConditions = "" - items["status"]["conditions"].each do |condition| - if condition["status"] == "True" - if !allNodeConditions.empty? - allNodeConditions = allNodeConditions + "," + condition["type"] - else - allNodeConditions = condition["type"] - end - end - #collect last transition to/from ready (no matter ready is true/false) - if condition["type"] == "Ready" && !condition["lastTransitionTime"].nil? - record["LastTransitionTimeReady"] = condition["lastTransitionTime"] - end - end - if !allNodeConditions.empty? - record["Status"] = allNodeConditions + nodeInventory["items"].each do |item| + # node inventory + nodeInventoryRecord = getNodeInventoryRecord(item, batchTime) + wrapper = { + "DataType" => "KUBE_NODE_INVENTORY_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [nodeInventoryRecord.each { |k, v| nodeInventoryRecord[k] = v }], + } + eventStream.add(emitTime, wrapper) if wrapper + if @NODES_EMIT_STREAM_BATCH_SIZE > 0 && eventStream.count >= @NODES_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_node::parse_and_emit_records: number of node inventory records emitted #{@NODES_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@tag, eventStream) if eventStream + $log.info("in_kube_node::parse_and_emit_records: number of mdm node inventory records emitted #{@NODES_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@MDMKubeNodeInventoryTag, eventStream) if eventStream + + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0) + $log.info("kubeNodeInventoryEmitStreamSuccess @ #{Time.now.utc.iso8601}") end + eventStream = MultiEventStream.new end - nodeInfo = items["status"]["nodeInfo"] - record["KubeletVersion"] = nodeInfo["kubeletVersion"] - record["KubeProxyVersion"] = nodeInfo["kubeProxyVersion"] - containerNodeInventoryRecord["OperatingSystem"] = nodeInfo["osImage"] - containerRuntimeVersion = nodeInfo["containerRuntimeVersion"] - if containerRuntimeVersion.downcase.start_with?("docker://") - containerNodeInventoryRecord["DockerVersion"] = containerRuntimeVersion.split("//")[1] - else - # using containerRuntimeVersion as DockerVersion as is for non docker runtimes - containerNodeInventoryRecord["DockerVersion"] = containerRuntimeVersion - end - # ContainerNodeInventory data for docker version and operating system. + # container node inventory + containerNodeInventoryRecord = getContainerNodeInventoryRecord(item, batchTime) containerNodeInventoryWrapper = { "DataType" => "CONTAINER_NODE_INVENTORY_BLOB", "IPName" => "ContainerInsights", @@ -187,33 +184,81 @@ def parse_and_emit_records(nodeInventory, batchTime = Time.utc.iso8601) } containerNodeInventoryEventStream.add(emitTime, containerNodeInventoryWrapper) if containerNodeInventoryWrapper - wrapper = { - "DataType" => "KUBE_NODE_INVENTORY_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [record.each { |k, v| record[k] = v }], - } - eventStream.add(emitTime, wrapper) if wrapper + if @NODES_EMIT_STREAM_BATCH_SIZE > 0 && containerNodeInventoryEventStream.count >= @NODES_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_node::parse_and_emit_records: number of container node inventory records emitted #{@NODES_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@ContainerNodeInventoryTag, containerNodeInventoryEventStream) if containerNodeInventoryEventStream + containerNodeInventoryEventStream = MultiEventStream.new + end + + # node metrics records + nodeMetricRecords = [] + nodeMetricRecord = KubernetesApiClient.parseNodeLimitsFromNodeItem(item, "allocatable", "cpu", "cpuAllocatableNanoCores", batchTime) + if !nodeMetricRecord.nil? && !nodeMetricRecord.empty? + nodeMetricRecords.push(nodeMetricRecord) + end + nodeMetricRecord = KubernetesApiClient.parseNodeLimitsFromNodeItem(item, "allocatable", "memory", "memoryAllocatableBytes", batchTime) + if !nodeMetricRecord.nil? && !nodeMetricRecord.empty? + nodeMetricRecords.push(nodeMetricRecord) + end + nodeMetricRecord = KubernetesApiClient.parseNodeLimitsFromNodeItem(item, "capacity", "cpu", "cpuCapacityNanoCores", batchTime) + if !nodeMetricRecord.nil? && !nodeMetricRecord.empty? + nodeMetricRecords.push(nodeMetricRecord) + end + nodeMetricRecord = KubernetesApiClient.parseNodeLimitsFromNodeItem(item, "capacity", "memory", "memoryCapacityBytes", batchTime) + if !nodeMetricRecord.nil? && !nodeMetricRecord.empty? + nodeMetricRecords.push(nodeMetricRecord) + end + nodeMetricRecords.each do |metricRecord| + metricRecord["DataType"] = "LINUX_PERF_BLOB" + metricRecord["IPName"] = "LogManagement" + kubePerfEventStream.add(emitTime, metricRecord) if metricRecord + end + if @NODES_EMIT_STREAM_BATCH_SIZE > 0 && kubePerfEventStream.count >= @NODES_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_nodes::parse_and_emit_records: number of node perf metric records emitted #{@NODES_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream + kubePerfEventStream = MultiEventStream.new + end + + # node GPU metrics record + nodeGPUInsightsMetricsRecords = [] + insightsMetricsRecord = KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(item, "allocatable", "nvidia.com/gpu", "nodeGpuAllocatable", batchTime) + if !insightsMetricsRecord.nil? && !insightsMetricsRecord.empty? + nodeGPUInsightsMetricsRecords.push(insightsMetricsRecord) + end + insightsMetricsRecord = KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(item, "capacity", "nvidia.com/gpu", "nodeGpuCapacity", batchTime) + if !insightsMetricsRecord.nil? && !insightsMetricsRecord.empty? + nodeGPUInsightsMetricsRecords.push(insightsMetricsRecord) + end + insightsMetricsRecord = KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(item, "allocatable", "amd.com/gpu", "nodeGpuAllocatable", batchTime) + if !insightsMetricsRecord.nil? && !insightsMetricsRecord.empty? + nodeGPUInsightsMetricsRecords.push(insightsMetricsRecord) + end + insightsMetricsRecord = KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(item, "capacity", "amd.com/gpu", "nodeGpuCapacity", batchTime) + if !insightsMetricsRecord.nil? && !insightsMetricsRecord.empty? + nodeGPUInsightsMetricsRecords.push(insightsMetricsRecord) + end + nodeGPUInsightsMetricsRecords.each do |insightsMetricsRecord| + wrapper = { + "DataType" => "INSIGHTS_METRICS_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], + } + insightsMetricsEventStream.add(emitTime, wrapper) if wrapper + end + if @NODES_EMIT_STREAM_BATCH_SIZE > 0 && insightsMetricsEventStream.count >= @NODES_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_nodes::parse_and_emit_records: number of GPU node perf metric records emitted #{@NODES_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + insightsMetricsEventStream = MultiEventStream.new + end # Adding telemetry to send node telemetry every 10 minutes timeDifference = (DateTime.now.to_time.to_i - @@nodeTelemetryTimeTracker).abs timeDifferenceInMinutes = timeDifference / 60 if (timeDifferenceInMinutes >= Constants::TELEMETRY_FLUSH_INTERVAL_IN_MINUTES) - properties = {} - properties["Computer"] = record["Computer"] - properties["KubeletVersion"] = record["KubeletVersion"] - properties["OperatingSystem"] = nodeInfo["operatingSystem"] - # DockerVersion field holds docker version if runtime is docker/moby else :// - if containerRuntimeVersion.downcase.start_with?("docker://") - properties["DockerVersion"] = containerRuntimeVersion.split("//")[1] - else - properties["DockerVersion"] = containerRuntimeVersion - end - properties["KubernetesProviderID"] = record["KubernetesProviderID"] - properties["KernelVersion"] = nodeInfo["kernelVersion"] - properties["OSImage"] = nodeInfo["osImage"] + properties = getNodeTelemetryProps(item) + properties["KubernetesProviderID"] = nodeInventoryRecord["KubernetesProviderID"] + capacityInfo = item["status"]["capacity"] - capacityInfo = items["status"]["capacity"] ApplicationInsightsUtility.sendMetricTelemetry("NodeMemory", capacityInfo["memory"], properties) - begin if (!capacityInfo["nvidia.com/gpu"].nil?) && (!capacityInfo["nvidia.com/gpu"].empty?) properties["nvigpus"] = capacityInfo["nvidia.com/gpu"] @@ -247,72 +292,32 @@ def parse_and_emit_records(nodeInventory, batchTime = Time.utc.iso8601) telemetrySent = true end end - router.emit_stream(@tag, eventStream) if eventStream - router.emit_stream(@@MDMKubeNodeInventoryTag, eventStream) if eventStream - router.emit_stream(@@ContainerNodeInventoryTag, containerNodeInventoryEventStream) if containerNodeInventoryEventStream if telemetrySent == true @@nodeTelemetryTimeTracker = DateTime.now.to_time.to_i end - - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && eventStream.count > 0) - $log.info("kubeNodeInventoryEmitStreamSuccess @ #{Time.now.utc.iso8601}") + if eventStream.count > 0 + $log.info("in_kube_node::parse_and_emit_records: number of node inventory records emitted #{eventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@tag, eventStream) if eventStream + $log.info("in_kube_node::parse_and_emit_records: number of mdm node inventory records emitted #{eventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@MDMKubeNodeInventoryTag, eventStream) if eventStream + eventStream = nil end - #:optimize:kubeperf merge - begin - #if(!nodeInventory.empty?) - nodeMetricDataItems = [] - #allocatable metrics @ node level - nodeMetricDataItems.concat(KubernetesApiClient.parseNodeLimits(nodeInventory, "allocatable", "cpu", "cpuAllocatableNanoCores", batchTime)) - nodeMetricDataItems.concat(KubernetesApiClient.parseNodeLimits(nodeInventory, "allocatable", "memory", "memoryAllocatableBytes", batchTime)) - #capacity metrics @ node level - nodeMetricDataItems.concat(KubernetesApiClient.parseNodeLimits(nodeInventory, "capacity", "cpu", "cpuCapacityNanoCores", batchTime)) - nodeMetricDataItems.concat(KubernetesApiClient.parseNodeLimits(nodeInventory, "capacity", "memory", "memoryCapacityBytes", batchTime)) - - kubePerfEventStream = MultiEventStream.new - - nodeMetricDataItems.each do |record| - record["DataType"] = "LINUX_PERF_BLOB" - record["IPName"] = "LogManagement" - kubePerfEventStream.add(emitTime, record) if record - end - #end - router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream - - #start GPU InsightsMetrics items - begin - nodeGPUInsightsMetricsDataItems = [] - nodeGPUInsightsMetricsDataItems.concat(KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(nodeInventory, "allocatable", "nvidia.com/gpu", "nodeGpuAllocatable", batchTime)) - nodeGPUInsightsMetricsDataItems.concat(KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(nodeInventory, "capacity", "nvidia.com/gpu", "nodeGpuCapacity", batchTime)) - - nodeGPUInsightsMetricsDataItems.concat(KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(nodeInventory, "allocatable", "amd.com/gpu", "nodeGpuAllocatable", batchTime)) - nodeGPUInsightsMetricsDataItems.concat(KubernetesApiClient.parseNodeLimitsAsInsightsMetrics(nodeInventory, "capacity", "amd.com/gpu", "nodeGpuCapacity", batchTime)) - - nodeGPUInsightsMetricsDataItems.each do |insightsMetricsRecord| - wrapper = { - "DataType" => "INSIGHTS_METRICS_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], - } - insightsMetricsEventStream.add(emitTime, wrapper) if wrapper - end - - router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) - $log.info("kubeNodeInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") - end - rescue => errorStr - $log.warn "Failed when processing GPU metrics in_kube_nodes : #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) - end - #end GPU InsightsMetrics items - rescue => errorStr - $log.warn "Failed in enumerate for KubePerf from in_kube_nodes : #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + if containerNodeInventoryEventStream.count > 0 + $log.info("in_kube_node::parse_and_emit_records: number of container node inventory records emitted #{containerNodeInventoryEventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@ContainerNodeInventoryTag, containerNodeInventoryEventStream) if containerNodeInventoryEventStream + containerNodeInventoryEventStream = nil end - #:optimize:end kubeperf merge + if kubePerfEventStream.count > 0 + $log.info("in_kube_nodes::parse_and_emit_records: number of node perf metric records emitted #{kubePerfEventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream + kubePerfEventStream = nil + end + if insightsMetricsEventStream.count > 0 + $log.info("in_kube_nodes::parse_and_emit_records: number of GPU node perf metric records emitted #{insightsMetricsEventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + insightsMetricsEventStream = nil + end rescue => errorStr $log.warn "Failed to retrieve node inventory: #{errorStr}" $log.debug_backtrace(errorStr.backtrace) @@ -352,5 +357,112 @@ def run_periodic end @mutex.unlock end + + # TODO - move this method to KubernetesClient or helper class + def getNodeInventoryRecord(item, batchTime = Time.utc.iso8601) + record = {} + begin + record["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated + record["Computer"] = item["metadata"]["name"] + record["ClusterName"] = KubernetesApiClient.getClusterName + record["ClusterId"] = KubernetesApiClient.getClusterId + record["CreationTimeStamp"] = item["metadata"]["creationTimestamp"] + record["Labels"] = [item["metadata"]["labels"]] + record["Status"] = "" + + if !item["spec"]["providerID"].nil? && !item["spec"]["providerID"].empty? + if File.file?(@@AzStackCloudFileName) # existence of this file indicates agent running on azstack + record["KubernetesProviderID"] = "azurestack" + else + #Multicluster kusto query is filtering after splitting by ":" to the left, so do the same here + #https://msazure.visualstudio.com/One/_git/AzureUX-Monitoring?path=%2Fsrc%2FMonitoringExtension%2FClient%2FInfraInsights%2FData%2FQueryTemplates%2FMultiClusterKustoQueryTemplate.ts&_a=contents&version=GBdev + provider = item["spec"]["providerID"].split(":")[0] + if !provider.nil? && !provider.empty? + record["KubernetesProviderID"] = provider + else + record["KubernetesProviderID"] = item["spec"]["providerID"] + end + end + else + record["KubernetesProviderID"] = "onprem" + end + + # Refer to https://kubernetes.io/docs/concepts/architecture/nodes/#condition for possible node conditions. + # We check the status of each condition e.g. {"type": "OutOfDisk","status": "False"} . Based on this we + # populate the KubeNodeInventory Status field. A possible value for this field could be "Ready OutofDisk" + # implying that the node is ready for hosting pods, however its out of disk. + if item["status"].key?("conditions") && !item["status"]["conditions"].empty? + allNodeConditions = "" + item["status"]["conditions"].each do |condition| + if condition["status"] == "True" + if !allNodeConditions.empty? + allNodeConditions = allNodeConditions + "," + condition["type"] + else + allNodeConditions = condition["type"] + end + end + #collect last transition to/from ready (no matter ready is true/false) + if condition["type"] == "Ready" && !condition["lastTransitionTime"].nil? + record["LastTransitionTimeReady"] = condition["lastTransitionTime"] + end + end + if !allNodeConditions.empty? + record["Status"] = allNodeConditions + end + end + nodeInfo = item["status"]["nodeInfo"] + record["KubeletVersion"] = nodeInfo["kubeletVersion"] + record["KubeProxyVersion"] = nodeInfo["kubeProxyVersion"] + rescue => errorStr + $log.warn "in_kube_nodes::getNodeInventoryRecord:Failed: #{errorStr}" + end + return record + end + + # TODO - move this method to KubernetesClient or helper class + def getContainerNodeInventoryRecord(item, batchTime = Time.utc.iso8601) + containerNodeInventoryRecord = {} + begin + containerNodeInventoryRecord["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated + containerNodeInventoryRecord["Computer"] = item["metadata"]["name"] + nodeInfo = item["status"]["nodeInfo"] + containerNodeInventoryRecord["OperatingSystem"] = nodeInfo["osImage"] + containerRuntimeVersion = nodeInfo["containerRuntimeVersion"] + if containerRuntimeVersion.downcase.start_with?("docker://") + containerNodeInventoryRecord["DockerVersion"] = containerRuntimeVersion.split("//")[1] + else + # using containerRuntimeVersion as DockerVersion as is for non docker runtimes + containerNodeInventoryRecord["DockerVersion"] = containerRuntimeVersion + end + rescue => errorStr + $log.warn "in_kube_nodes::getContainerNodeInventoryRecord:Failed: #{errorStr}" + end + return containerNodeInventoryRecord + end + + # TODO - move this method to KubernetesClient or helper class + def getNodeTelemetryProps(item) + properties = {} + begin + properties["Computer"] = item["metadata"]["name"] + nodeInfo = item["status"]["nodeInfo"] + properties["KubeletVersion"] = nodeInfo["kubeletVersion"] + properties["OperatingSystem"] = nodeInfo["osImage"] + properties["KernelVersion"] = nodeInfo["kernelVersion"] + properties["OSImage"] = nodeInfo["osImage"] + containerRuntimeVersion = nodeInfo["containerRuntimeVersion"] + if containerRuntimeVersion.downcase.start_with?("docker://") + properties["DockerVersion"] = containerRuntimeVersion.split("//")[1] + else + # using containerRuntimeVersion as DockerVersion as is for non docker runtimes + properties["DockerVersion"] = containerRuntimeVersion + end + properties["NODES_CHUNK_SIZE"] = @NODES_CHUNK_SIZE + properties["NODES_EMIT_STREAM_BATCH_SIZE"] = @NODES_EMIT_STREAM_BATCH_SIZE + rescue => errorStr + $log.warn "in_kube_nodes::getContainerNodeIngetNodeTelemetryPropsventoryRecord:Failed: #{errorStr}" + end + return properties + end end # Kube_Node_Input end # module diff --git a/source/plugins/ruby/in_kube_podinventory.rb b/source/plugins/ruby/in_kube_podinventory.rb index 4880d80e7..0cff2eefe 100644 --- a/source/plugins/ruby/in_kube_podinventory.rb +++ b/source/plugins/ruby/in_kube_podinventory.rb @@ -2,7 +2,7 @@ # frozen_string_literal: true module Fluent - require_relative "podinventory_to_mdm" + require_relative "podinventory_to_mdm" class Kube_PodInventory_Input < Input Plugin.register_input("kubepodinventory", self) @@ -19,7 +19,7 @@ def initialize require "yajl" require "set" require "time" - + require_relative "kubernetes_container_inventory" require_relative "KubernetesApiClient" require_relative "ApplicationInsightsUtility" @@ -27,24 +27,48 @@ def initialize require_relative "omslog" require_relative "constants" - @PODS_CHUNK_SIZE = "1500" + # refer tomlparser-agent-config for updating defaults + # this configurable via configmap + @PODS_CHUNK_SIZE = 0 + @PODS_EMIT_STREAM_BATCH_SIZE = 0 + @podCount = 0 + @serviceCount = 0 @controllerSet = Set.new [] @winContainerCount = 0 @controllerData = {} + @podInventoryE2EProcessingLatencyMs = 0 + @podsAPIE2ELatencyMs = 0 end config_param :run_interval, :time, :default => 60 config_param :tag, :string, :default => "oms.containerinsights.KubePodInventory" - config_param :custom_metrics_azure_regions, :string def configure(conf) super - @inventoryToMdmConvertor = Inventory2MdmConvertor.new(@custom_metrics_azure_regions) + @inventoryToMdmConvertor = Inventory2MdmConvertor.new() end def start if @run_interval + if !ENV["PODS_CHUNK_SIZE"].nil? && !ENV["PODS_CHUNK_SIZE"].empty? && ENV["PODS_CHUNK_SIZE"].to_i > 0 + @PODS_CHUNK_SIZE = ENV["PODS_CHUNK_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kube_podinventory::start: setting to default value since got PODS_CHUNK_SIZE nil or empty") + @PODS_CHUNK_SIZE = 1000 + end + $log.info("in_kube_podinventory::start : PODS_CHUNK_SIZE @ #{@PODS_CHUNK_SIZE}") + + if !ENV["PODS_EMIT_STREAM_BATCH_SIZE"].nil? && !ENV["PODS_EMIT_STREAM_BATCH_SIZE"].empty? && ENV["PODS_EMIT_STREAM_BATCH_SIZE"].to_i > 0 + @PODS_EMIT_STREAM_BATCH_SIZE = ENV["PODS_EMIT_STREAM_BATCH_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kube_podinventory::start: setting to default value since got PODS_EMIT_STREAM_BATCH_SIZE nil or empty") + @PODS_EMIT_STREAM_BATCH_SIZE = 200 + end + $log.info("in_kube_podinventory::start : PODS_EMIT_STREAM_BATCH_SIZE @ #{@PODS_EMIT_STREAM_BATCH_SIZE}") + @finished = false @condition = ConditionVariable.new @mutex = Mutex.new @@ -68,12 +92,15 @@ def enumerate(podList = nil) podInventory = podList telemetryFlush = false @podCount = 0 + @serviceCount = 0 @controllerSet = Set.new [] @winContainerCount = 0 @controllerData = {} currentTime = Time.now batchTime = currentTime.utc.iso8601 - + serviceRecords = [] + @podInventoryE2EProcessingLatencyMs = 0 + podInventoryStartTime = (Time.now.to_f * 1000).to_i # Get services first so that we dont need to make a call for very chunk $log.info("in_kube_podinventory::enumerate : Getting services from Kube API @ #{Time.now.utc.iso8601}") serviceInfo = KubernetesApiClient.getKubeResourceInfo("services") @@ -85,32 +112,48 @@ def enumerate(podList = nil) serviceList = Yajl::Parser.parse(StringIO.new(serviceInfo.body)) $log.info("in_kube_podinventory::enumerate:End:Parsing services data using yajl @ #{Time.now.utc.iso8601}") serviceInfo = nil + # service inventory records much smaller and fixed size compared to serviceList + serviceRecords = KubernetesApiClient.getKubeServicesInventoryRecords(serviceList, batchTime) + # updating for telemetry + @serviceCount += serviceRecords.length + serviceList = nil end + # to track e2e processing latency + @podsAPIE2ELatencyMs = 0 + podsAPIChunkStartTime = (Time.now.to_f * 1000).to_i # Initializing continuation token to nil continuationToken = nil $log.info("in_kube_podinventory::enumerate : Getting pods from Kube API @ #{Time.now.utc.iso8601}") continuationToken, podInventory = KubernetesApiClient.getResourcesAndContinuationToken("pods?limit=#{@PODS_CHUNK_SIZE}") $log.info("in_kube_podinventory::enumerate : Done getting pods from Kube API @ #{Time.now.utc.iso8601}") + podsAPIChunkEndTime = (Time.now.to_f * 1000).to_i + @podsAPIE2ELatencyMs = (podsAPIChunkEndTime - podsAPIChunkStartTime) if (!podInventory.nil? && !podInventory.empty? && podInventory.key?("items") && !podInventory["items"].nil? && !podInventory["items"].empty?) - parse_and_emit_records(podInventory, serviceList, continuationToken, batchTime) + $log.info("in_kube_podinventory::enumerate : number of pod items :#{podInventory["items"].length} from Kube API @ #{Time.now.utc.iso8601}") + parse_and_emit_records(podInventory, serviceRecords, continuationToken, batchTime) else $log.warn "in_kube_podinventory::enumerate:Received empty podInventory" end #If we receive a continuation token, make calls, process and flush data until we have processed all data while (!continuationToken.nil? && !continuationToken.empty?) + podsAPIChunkStartTime = (Time.now.to_f * 1000).to_i continuationToken, podInventory = KubernetesApiClient.getResourcesAndContinuationToken("pods?limit=#{@PODS_CHUNK_SIZE}&continue=#{continuationToken}") + podsAPIChunkEndTime = (Time.now.to_f * 1000).to_i + @podsAPIE2ELatencyMs = @podsAPIE2ELatencyMs + (podsAPIChunkEndTime - podsAPIChunkStartTime) if (!podInventory.nil? && !podInventory.empty? && podInventory.key?("items") && !podInventory["items"].nil? && !podInventory["items"].empty?) - parse_and_emit_records(podInventory, serviceList, continuationToken, batchTime) + $log.info("in_kube_podinventory::enumerate : number of pod items :#{podInventory["items"].length} from Kube API @ #{Time.now.utc.iso8601}") + parse_and_emit_records(podInventory, serviceRecords, continuationToken, batchTime) else $log.warn "in_kube_podinventory::enumerate:Received empty podInventory" end end + @podInventoryE2EProcessingLatencyMs = ((Time.now.to_f * 1000).to_i - podInventoryStartTime) # Setting these to nil so that we dont hold memory until GC kicks in podInventory = nil - serviceList = nil + serviceRecords = nil # Adding telemetry to send pod telemetry every 5 minutes timeDifference = (DateTime.now.to_time.to_i - @@podTelemetryTimeTracker).abs @@ -123,14 +166,19 @@ def enumerate(podList = nil) if telemetryFlush == true telemetryProperties = {} telemetryProperties["Computer"] = @@hostName + telemetryProperties["PODS_CHUNK_SIZE"] = @PODS_CHUNK_SIZE + telemetryProperties["PODS_EMIT_STREAM_BATCH_SIZE"] = @PODS_EMIT_STREAM_BATCH_SIZE ApplicationInsightsUtility.sendCustomEvent("KubePodInventoryHeartBeatEvent", telemetryProperties) ApplicationInsightsUtility.sendMetricTelemetry("PodCount", @podCount, {}) + ApplicationInsightsUtility.sendMetricTelemetry("ServiceCount", @serviceCount, {}) telemetryProperties["ControllerData"] = @controllerData.to_json ApplicationInsightsUtility.sendMetricTelemetry("ControllerCount", @controllerSet.length, telemetryProperties) if @winContainerCount > 0 telemetryProperties["ClusterWideWindowsContainersCount"] = @winContainerCount ApplicationInsightsUtility.sendCustomEvent("WindowsContainerInventoryEvent", telemetryProperties) end + ApplicationInsightsUtility.sendMetricTelemetry("PodInventoryE2EProcessingLatencyMs", @podInventoryE2EProcessingLatencyMs, telemetryProperties) + ApplicationInsightsUtility.sendMetricTelemetry("PodsAPIE2ELatencyMs", @podsAPIE2ELatencyMs, telemetryProperties) @@podTelemetryTimeTracker = DateTime.now.to_time.to_i end rescue => errorStr @@ -138,260 +186,138 @@ def enumerate(podList = nil) $log.debug_backtrace(errorStr.backtrace) ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) end - end + end - def parse_and_emit_records(podInventory, serviceList, continuationToken, batchTime = Time.utc.iso8601) + def parse_and_emit_records(podInventory, serviceRecords, continuationToken, batchTime = Time.utc.iso8601) currentTime = Time.now emitTime = currentTime.to_f #batchTime = currentTime.utc.iso8601 eventStream = MultiEventStream.new + kubePerfEventStream = MultiEventStream.new + insightsMetricsEventStream = MultiEventStream.new @@istestvar = ENV["ISTEST"] begin #begin block start # Getting windows nodes from kubeapi winNodes = KubernetesApiClient.getWindowsNodesArray - - podInventory["items"].each do |items| #podInventory block start - containerInventoryRecords = [] - records = [] - record = {} - record["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated - record["Name"] = items["metadata"]["name"] - podNameSpace = items["metadata"]["namespace"] - - # For ARO v3 cluster, skip the pods scheduled on to master or infra nodes - if KubernetesApiClient.isAROV3Cluster && !items["spec"].nil? && !items["spec"]["nodeName"].nil? && - (items["spec"]["nodeName"].downcase.start_with?("infra-") || - items["spec"]["nodeName"].downcase.start_with?("master-")) - next - end - - podUid = KubernetesApiClient.getPodUid(podNameSpace, items["metadata"]) - if podUid.nil? - next - end - record["PodUid"] = podUid - record["PodLabel"] = [items["metadata"]["labels"]] - record["Namespace"] = podNameSpace - record["PodCreationTimeStamp"] = items["metadata"]["creationTimestamp"] - #for unscheduled (non-started) pods startTime does NOT exist - if !items["status"]["startTime"].nil? - record["PodStartTime"] = items["status"]["startTime"] - else - record["PodStartTime"] = "" - end - #podStatus - # the below is for accounting 'NodeLost' scenario, where-in the pod(s) in the lost node is still being reported as running - podReadyCondition = true - if !items["status"]["reason"].nil? && items["status"]["reason"] == "NodeLost" && !items["status"]["conditions"].nil? - items["status"]["conditions"].each do |condition| - if condition["type"] == "Ready" && condition["status"] == "False" - podReadyCondition = false - break - end + podInventory["items"].each do |item| #podInventory block start + # pod inventory records + podInventoryRecords = getPodInventoryRecords(item, serviceRecords, batchTime) + podInventoryRecords.each do |record| + if !record.nil? + wrapper = { + "DataType" => "KUBE_POD_INVENTORY_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [record.each { |k, v| record[k] = v }], + } + eventStream.add(emitTime, wrapper) if wrapper + @inventoryToMdmConvertor.process_pod_inventory_record(wrapper) end end - - if podReadyCondition == false - record["PodStatus"] = "Unknown" - # ICM - https://portal.microsofticm.com/imp/v3/incidents/details/187091803/home - elsif !items["metadata"]["deletionTimestamp"].nil? && !items["metadata"]["deletionTimestamp"].empty? - record["PodStatus"] = Constants::POD_STATUS_TERMINATING - else - record["PodStatus"] = items["status"]["phase"] - end - #for unscheduled (non-started) pods podIP does NOT exist - if !items["status"]["podIP"].nil? - record["PodIp"] = items["status"]["podIP"] - else - record["PodIp"] = "" - end - #for unscheduled (non-started) pods nodeName does NOT exist - if !items["spec"]["nodeName"].nil? - record["Computer"] = items["spec"]["nodeName"] - else - record["Computer"] = "" - end - # Setting this flag to true so that we can send ContainerInventory records for containers # on windows nodes and parse environment variables for these containers if winNodes.length > 0 - if (!record["Computer"].empty? && (winNodes.include? record["Computer"])) + nodeName = "" + if !item["spec"]["nodeName"].nil? + nodeName = item["spec"]["nodeName"] + end + if (!nodeName.empty? && (winNodes.include? nodeName)) clusterCollectEnvironmentVar = ENV["AZMON_CLUSTER_COLLECT_ENV_VAR"] #Generate ContainerInventory records for windows nodes so that we can get image and image tag in property panel - containerInventoryRecordsInPodItem = KubernetesContainerInventory.getContainerInventoryRecords(items, batchTime, clusterCollectEnvironmentVar, true) - containerInventoryRecordsInPodItem.each do |containerRecord| - containerInventoryRecords.push(containerRecord) - end + containerInventoryRecords = KubernetesContainerInventory.getContainerInventoryRecords(item, batchTime, clusterCollectEnvironmentVar, true) + # Send container inventory records for containers on windows nodes + @winContainerCount += containerInventoryRecords.length + containerInventoryRecords.each do |cirecord| + if !cirecord.nil? + ciwrapper = { + "DataType" => "CONTAINER_INVENTORY_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [cirecord.each { |k, v| cirecord[k] = v }], + } + eventStream.add(emitTime, ciwrapper) if ciwrapper + end + end end end - record["ClusterId"] = KubernetesApiClient.getClusterId - record["ClusterName"] = KubernetesApiClient.getClusterName - record["ServiceName"] = getServiceNameFromLabels(items["metadata"]["namespace"], items["metadata"]["labels"], serviceList) - - if !items["metadata"]["ownerReferences"].nil? - record["ControllerKind"] = items["metadata"]["ownerReferences"][0]["kind"] - record["ControllerName"] = items["metadata"]["ownerReferences"][0]["name"] - @controllerSet.add(record["ControllerKind"] + record["ControllerName"]) - #Adding controller kind to telemetry ro information about customer workload - if (@controllerData[record["ControllerKind"]].nil?) - @controllerData[record["ControllerKind"]] = 1 - else - controllerValue = @controllerData[record["ControllerKind"]] - @controllerData[record["ControllerKind"]] += 1 + if @PODS_EMIT_STREAM_BATCH_SIZE > 0 && eventStream.count >= @PODS_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_podinventory::parse_and_emit_records: number of pod inventory records emitted #{@PODS_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0) + $log.info("kubePodInventoryEmitStreamSuccess @ #{Time.now.utc.iso8601}") end + router.emit_stream(@tag, eventStream) if eventStream + eventStream = MultiEventStream.new end - podRestartCount = 0 - record["PodRestartCount"] = 0 - #Invoke the helper method to compute ready/not ready mdm metric - @inventoryToMdmConvertor.process_record_for_pods_ready_metric(record["ControllerName"], record["Namespace"], items["status"]["conditions"]) + #container perf records + containerMetricDataItems = [] + containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(item, "requests", "cpu", "cpuRequestNanoCores", batchTime)) + containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(item, "requests", "memory", "memoryRequestBytes", batchTime)) + containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(item, "limits", "cpu", "cpuLimitNanoCores", batchTime)) + containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(item, "limits", "memory", "memoryLimitBytes", batchTime)) - podContainers = [] - if items["status"].key?("containerStatuses") && !items["status"]["containerStatuses"].empty? - podContainers = podContainers + items["status"]["containerStatuses"] - end - # Adding init containers to the record list as well. - if items["status"].key?("initContainerStatuses") && !items["status"]["initContainerStatuses"].empty? - podContainers = podContainers + items["status"]["initContainerStatuses"] + containerMetricDataItems.each do |record| + record["DataType"] = "LINUX_PERF_BLOB" + record["IPName"] = "LogManagement" + kubePerfEventStream.add(emitTime, record) if record end - # if items["status"].key?("containerStatuses") && !items["status"]["containerStatuses"].empty? #container status block start - if !podContainers.empty? #container status block start - podContainers.each do |container| - containerRestartCount = 0 - lastFinishedTime = nil - # Need this flag to determine if we need to process container data for mdm metrics like oomkilled and container restart - #container Id is of the form - #docker://dfd9da983f1fd27432fb2c1fe3049c0a1d25b1c697b2dc1a530c986e58b16527 - if !container["containerID"].nil? - record["ContainerID"] = container["containerID"].split("//")[1] - else - # for containers that have image issues (like invalid image/tag etc..) this will be empty. do not make it all 0 - record["ContainerID"] = "" - end - #keeping this as which is same as InstanceName in perf table - if podUid.nil? || container["name"].nil? - next - else - record["ContainerName"] = podUid + "/" + container["name"] - end - #Pod restart count is a sumtotal of restart counts of individual containers - #within the pod. The restart count of a container is maintained by kubernetes - #itself in the form of a container label. - containerRestartCount = container["restartCount"] - record["ContainerRestartCount"] = containerRestartCount - - containerStatus = container["state"] - record["ContainerStatusReason"] = "" - # state is of the following form , so just picking up the first key name - # "state": { - # "waiting": { - # "reason": "CrashLoopBackOff", - # "message": "Back-off 5m0s restarting failed container=metrics-server pod=metrics-server-2011498749-3g453_kube-system(5953be5f-fcae-11e7-a356-000d3ae0e432)" - # } - # }, - # the below is for accounting 'NodeLost' scenario, where-in the containers in the lost node/pod(s) is still being reported as running - if podReadyCondition == false - record["ContainerStatus"] = "Unknown" - else - record["ContainerStatus"] = containerStatus.keys[0] - end - #TODO : Remove ContainerCreationTimeStamp from here since we are sending it as a metric - #Picking up both container and node start time from cAdvisor to be consistent - if containerStatus.keys[0] == "running" - record["ContainerCreationTimeStamp"] = container["state"]["running"]["startedAt"] - else - if !containerStatus[containerStatus.keys[0]]["reason"].nil? && !containerStatus[containerStatus.keys[0]]["reason"].empty? - record["ContainerStatusReason"] = containerStatus[containerStatus.keys[0]]["reason"] - end - # Process the record to see if job was completed 6 hours ago. If so, send metric to mdm - if !record["ControllerKind"].nil? && record["ControllerKind"].downcase == Constants::CONTROLLER_KIND_JOB - @inventoryToMdmConvertor.process_record_for_terminated_job_metric(record["ControllerName"], record["Namespace"], containerStatus) - end - end - - # Record the last state of the container. This may have information on why a container was killed. - begin - if !container["lastState"].nil? && container["lastState"].keys.length == 1 - lastStateName = container["lastState"].keys[0] - lastStateObject = container["lastState"][lastStateName] - if !lastStateObject.is_a?(Hash) - raise "expected a hash object. This could signify a bug or a kubernetes API change" - end - - if lastStateObject.key?("reason") && lastStateObject.key?("startedAt") && lastStateObject.key?("finishedAt") - newRecord = Hash.new - newRecord["lastState"] = lastStateName # get the name of the last state (ex: terminated) - lastStateReason = lastStateObject["reason"] - # newRecord["reason"] = lastStateObject["reason"] # (ex: OOMKilled) - newRecord["reason"] = lastStateReason # (ex: OOMKilled) - newRecord["startedAt"] = lastStateObject["startedAt"] # (ex: 2019-07-02T14:58:51Z) - lastFinishedTime = lastStateObject["finishedAt"] - newRecord["finishedAt"] = lastFinishedTime # (ex: 2019-07-02T14:58:52Z) - - # only write to the output field if everything previously ran without error - record["ContainerLastStatus"] = newRecord - - #Populate mdm metric for OOMKilled container count if lastStateReason is OOMKilled - if lastStateReason.downcase == Constants::REASON_OOM_KILLED - @inventoryToMdmConvertor.process_record_for_oom_killed_metric(record["ControllerName"], record["Namespace"], lastFinishedTime) - end - lastStateReason = nil - else - record["ContainerLastStatus"] = Hash.new - end - else - record["ContainerLastStatus"] = Hash.new - end - - #Populate mdm metric for container restart count if greater than 0 - if (!containerRestartCount.nil? && (containerRestartCount.is_a? Integer) && containerRestartCount > 0) - @inventoryToMdmConvertor.process_record_for_container_restarts_metric(record["ControllerName"], record["Namespace"], lastFinishedTime) - end - rescue => errorStr - $log.warn "Failed in parse_and_emit_record pod inventory while processing ContainerLastStatus: #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) - record["ContainerLastStatus"] = Hash.new - end + if @PODS_EMIT_STREAM_BATCH_SIZE > 0 && kubePerfEventStream.count >= @PODS_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_podinventory::parse_and_emit_records: number of container perf records emitted #{@PODS_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream + kubePerfEventStream = MultiEventStream.new + end - podRestartCount += containerRestartCount - records.push(record.dup) - end - else # for unscheduled pods there are no status.containerStatuses, in this case we still want the pod - records.push(record) - end #container status block end - records.each do |record| - if !record.nil? - record["PodRestartCount"] = podRestartCount - wrapper = { - "DataType" => "KUBE_POD_INVENTORY_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [record.each { |k, v| record[k] = v }], - } - eventStream.add(emitTime, wrapper) if wrapper - @inventoryToMdmConvertor.process_pod_inventory_record(wrapper) - end + # container GPU records + containerGPUInsightsMetricsDataItems = [] + containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(item, "requests", "nvidia.com/gpu", "containerGpuRequests", batchTime)) + containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(item, "limits", "nvidia.com/gpu", "containerGpuLimits", batchTime)) + containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(item, "requests", "amd.com/gpu", "containerGpuRequests", batchTime)) + containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(item, "limits", "amd.com/gpu", "containerGpuLimits", batchTime)) + containerGPUInsightsMetricsDataItems.each do |insightsMetricsRecord| + wrapper = { + "DataType" => "INSIGHTS_METRICS_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], + } + insightsMetricsEventStream.add(emitTime, wrapper) if wrapper end - # Send container inventory records for containers on windows nodes - @winContainerCount += containerInventoryRecords.length - containerInventoryRecords.each do |cirecord| - if !cirecord.nil? - ciwrapper = { - "DataType" => "CONTAINER_INVENTORY_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [cirecord.each { |k, v| cirecord[k] = v }], - } - eventStream.add(emitTime, ciwrapper) if ciwrapper + + if @PODS_EMIT_STREAM_BATCH_SIZE > 0 && insightsMetricsEventStream.count >= @PODS_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_podinventory::parse_and_emit_records: number of GPU insights metrics records emitted #{@PODS_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0) + $log.info("kubePodInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") end + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + insightsMetricsEventStream = MultiEventStream.new end end #podInventory block end - router.emit_stream(@tag, eventStream) if eventStream + if eventStream.count > 0 + $log.info("in_kube_podinventory::parse_and_emit_records: number of pod inventory records emitted #{eventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@tag, eventStream) if eventStream + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0) + $log.info("kubePodInventoryEmitStreamSuccess @ #{Time.now.utc.iso8601}") + end + eventStream = nil + end + + if kubePerfEventStream.count > 0 + $log.info("in_kube_podinventory::parse_and_emit_records: number of perf records emitted #{kubePerfEventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream + kubePerfEventStream = nil + end + + if insightsMetricsEventStream.count > 0 + $log.info("in_kube_podinventory::parse_and_emit_records: number of insights metrics records emitted #{insightsMetricsEventStream.count} @ #{Time.now.utc.iso8601}") + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0) + $log.info("kubePodInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") + end + insightsMetricsEventStream = nil + end - if continuationToken.nil? #no more chunks in this batch to be sent, get all pod inventory records to send + if continuationToken.nil? #no more chunks in this batch to be sent, get all mdm pod inventory records to send @log.info "Sending pod inventory mdm records to out_mdm" pod_inventory_mdm_records = @inventoryToMdmConvertor.get_pod_inventory_mdm_records(batchTime) @log.info "pod_inventory_mdm_records.size #{pod_inventory_mdm_records.size}" @@ -402,101 +328,36 @@ def parse_and_emit_records(podInventory, serviceList, continuationToken, batchTi router.emit_stream(@@MDMKubePodInventoryTag, mdm_pod_inventory_es) if mdm_pod_inventory_es end - #:optimize:kubeperf merge - begin - #if(!podInventory.empty?) - containerMetricDataItems = [] - #hostName = (OMS::Common.get_hostname) - containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(podInventory, "requests", "cpu", "cpuRequestNanoCores", batchTime)) - containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(podInventory, "requests", "memory", "memoryRequestBytes", batchTime)) - containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(podInventory, "limits", "cpu", "cpuLimitNanoCores", batchTime)) - containerMetricDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimits(podInventory, "limits", "memory", "memoryLimitBytes", batchTime)) - - kubePerfEventStream = MultiEventStream.new - insightsMetricsEventStream = MultiEventStream.new - - containerMetricDataItems.each do |record| - record["DataType"] = "LINUX_PERF_BLOB" - record["IPName"] = "LogManagement" - kubePerfEventStream.add(emitTime, record) if record - end - #end - router.emit_stream(@@kubeperfTag, kubePerfEventStream) if kubePerfEventStream - - begin - #start GPU InsightsMetrics items - - containerGPUInsightsMetricsDataItems = [] - containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(podInventory, "requests", "nvidia.com/gpu", "containerGpuRequests", batchTime)) - containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(podInventory, "limits", "nvidia.com/gpu", "containerGpuLimits", batchTime)) - - containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(podInventory, "requests", "amd.com/gpu", "containerGpuRequests", batchTime)) - containerGPUInsightsMetricsDataItems.concat(KubernetesApiClient.getContainerResourceRequestsAndLimitsAsInsightsMetrics(podInventory, "limits", "amd.com/gpu", "containerGpuLimits", batchTime)) - - containerGPUInsightsMetricsDataItems.each do |insightsMetricsRecord| - wrapper = { - "DataType" => "INSIGHTS_METRICS_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], - } - insightsMetricsEventStream.add(emitTime, wrapper) if wrapper - - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) - $log.info("kubePodInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") - end - end - - router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream - #end GPU InsightsMetrics items - rescue => errorStr - $log.warn "Failed when processing GPU metrics in_kube_podinventory : #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) - end - rescue => errorStr - $log.warn "Failed in parse_and_emit_record for KubePerf from in_kube_podinventory : #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) - end - #:optimize:end kubeperf merge - - #:optimize:start kubeservices merge - begin - if (!serviceList.nil? && !serviceList.empty?) - kubeServicesEventStream = MultiEventStream.new - serviceList["items"].each do |items| - kubeServiceRecord = {} - kubeServiceRecord["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated - kubeServiceRecord["ServiceName"] = items["metadata"]["name"] - kubeServiceRecord["Namespace"] = items["metadata"]["namespace"] - kubeServiceRecord["SelectorLabels"] = [items["spec"]["selector"]] + if continuationToken.nil? # sending kube services inventory records + kubeServicesEventStream = MultiEventStream.new + serviceRecords.each do |kubeServiceRecord| + if !kubeServiceRecord.nil? + # adding before emit to reduce memory foot print kubeServiceRecord["ClusterId"] = KubernetesApiClient.getClusterId kubeServiceRecord["ClusterName"] = KubernetesApiClient.getClusterName - kubeServiceRecord["ClusterIP"] = items["spec"]["clusterIP"] - kubeServiceRecord["ServiceType"] = items["spec"]["type"] - # : Add ports and status fields kubeServicewrapper = { "DataType" => "KUBE_SERVICES_BLOB", "IPName" => "ContainerInsights", "DataItems" => [kubeServiceRecord.each { |k, v| kubeServiceRecord[k] = v }], } kubeServicesEventStream.add(emitTime, kubeServicewrapper) if kubeServicewrapper + if @PODS_EMIT_STREAM_BATCH_SIZE > 0 && kubeServicesEventStream.count >= @PODS_EMIT_STREAM_BATCH_SIZE + $log.info("in_kube_podinventory::parse_and_emit_records: number of service records emitted #{@PODS_EMIT_STREAM_BATCH_SIZE} @ #{Time.now.utc.iso8601}") + router.emit_stream(@@kubeservicesTag, kubeServicesEventStream) if kubeServicesEventStream + kubeServicesEventStream = MultiEventStream.new + end end + end + + if kubeServicesEventStream.count > 0 + $log.info("in_kube_podinventory::parse_and_emit_records : number of service records emitted #{kubeServicesEventStream.count} @ #{Time.now.utc.iso8601}") router.emit_stream(@@kubeservicesTag, kubeServicesEventStream) if kubeServicesEventStream end - rescue => errorStr - $log.warn "Failed in parse_and_emit_record for KubeServices from in_kube_podinventory : #{errorStr}" - $log.debug_backtrace(errorStr.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + kubeServicesEventStream = nil end - #:optimize:end kubeservices merge #Updating value for AppInsights telemetry @podCount += podInventory["items"].length - - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && eventStream.count > 0) - $log.info("kubePodInventoryEmitStreamSuccess @ #{Time.now.utc.iso8601}") - end rescue => errorStr $log.warn "Failed in parse_and_emit_record pod inventory: #{errorStr}" $log.debug_backtrace(errorStr.backtrace) @@ -536,25 +397,238 @@ def run_periodic @mutex.unlock end - def getServiceNameFromLabels(namespace, labels, serviceList) + # TODO - move this method to KubernetesClient or helper class + def getPodInventoryRecords(item, serviceRecords, batchTime = Time.utc.iso8601) + records = [] + record = {} + + begin + record["CollectionTime"] = batchTime #This is the time that is mapped to become TimeGenerated + record["Name"] = item["metadata"]["name"] + podNameSpace = item["metadata"]["namespace"] + podUid = KubernetesApiClient.getPodUid(podNameSpace, item["metadata"]) + if podUid.nil? + return records + end + + nodeName = "" + #for unscheduled (non-started) pods nodeName does NOT exist + if !item["spec"]["nodeName"].nil? + nodeName = item["spec"]["nodeName"] + end + # For ARO v3 cluster, skip the pods scheduled on to master or infra nodes + if KubernetesApiClient.isAROv3MasterOrInfraPod(nodeName) + return records + end + + record["PodUid"] = podUid + record["PodLabel"] = [item["metadata"]["labels"]] + record["Namespace"] = podNameSpace + record["PodCreationTimeStamp"] = item["metadata"]["creationTimestamp"] + #for unscheduled (non-started) pods startTime does NOT exist + if !item["status"]["startTime"].nil? + record["PodStartTime"] = item["status"]["startTime"] + else + record["PodStartTime"] = "" + end + #podStatus + # the below is for accounting 'NodeLost' scenario, where-in the pod(s) in the lost node is still being reported as running + podReadyCondition = true + if !item["status"]["reason"].nil? && item["status"]["reason"] == "NodeLost" && !item["status"]["conditions"].nil? + item["status"]["conditions"].each do |condition| + if condition["type"] == "Ready" && condition["status"] == "False" + podReadyCondition = false + break + end + end + end + if podReadyCondition == false + record["PodStatus"] = "Unknown" + # ICM - https://portal.microsofticm.com/imp/v3/incidents/details/187091803/home + elsif !item["metadata"]["deletionTimestamp"].nil? && !item["metadata"]["deletionTimestamp"].empty? + record["PodStatus"] = Constants::POD_STATUS_TERMINATING + else + record["PodStatus"] = item["status"]["phase"] + end + #for unscheduled (non-started) pods podIP does NOT exist + if !item["status"]["podIP"].nil? + record["PodIp"] = item["status"]["podIP"] + else + record["PodIp"] = "" + end + + record["Computer"] = nodeName + record["ClusterId"] = KubernetesApiClient.getClusterId + record["ClusterName"] = KubernetesApiClient.getClusterName + record["ServiceName"] = getServiceNameFromLabels(item["metadata"]["namespace"], item["metadata"]["labels"], serviceRecords) + + if !item["metadata"]["ownerReferences"].nil? + record["ControllerKind"] = item["metadata"]["ownerReferences"][0]["kind"] + record["ControllerName"] = item["metadata"]["ownerReferences"][0]["name"] + @controllerSet.add(record["ControllerKind"] + record["ControllerName"]) + #Adding controller kind to telemetry ro information about customer workload + if (@controllerData[record["ControllerKind"]].nil?) + @controllerData[record["ControllerKind"]] = 1 + else + controllerValue = @controllerData[record["ControllerKind"]] + @controllerData[record["ControllerKind"]] += 1 + end + end + podRestartCount = 0 + record["PodRestartCount"] = 0 + + #Invoke the helper method to compute ready/not ready mdm metric + @inventoryToMdmConvertor.process_record_for_pods_ready_metric(record["ControllerName"], record["Namespace"], item["status"]["conditions"]) + + podContainers = [] + if item["status"].key?("containerStatuses") && !item["status"]["containerStatuses"].empty? + podContainers = podContainers + item["status"]["containerStatuses"] + end + # Adding init containers to the record list as well. + if item["status"].key?("initContainerStatuses") && !item["status"]["initContainerStatuses"].empty? + podContainers = podContainers + item["status"]["initContainerStatuses"] + end + # if items["status"].key?("containerStatuses") && !items["status"]["containerStatuses"].empty? #container status block start + if !podContainers.empty? #container status block start + podContainers.each do |container| + containerRestartCount = 0 + lastFinishedTime = nil + # Need this flag to determine if we need to process container data for mdm metrics like oomkilled and container restart + #container Id is of the form + #docker://dfd9da983f1fd27432fb2c1fe3049c0a1d25b1c697b2dc1a530c986e58b16527 + if !container["containerID"].nil? + record["ContainerID"] = container["containerID"].split("//")[1] + else + # for containers that have image issues (like invalid image/tag etc..) this will be empty. do not make it all 0 + record["ContainerID"] = "" + end + #keeping this as which is same as InstanceName in perf table + if podUid.nil? || container["name"].nil? + next + else + record["ContainerName"] = podUid + "/" + container["name"] + end + #Pod restart count is a sumtotal of restart counts of individual containers + #within the pod. The restart count of a container is maintained by kubernetes + #itself in the form of a container label. + containerRestartCount = container["restartCount"] + record["ContainerRestartCount"] = containerRestartCount + + containerStatus = container["state"] + record["ContainerStatusReason"] = "" + # state is of the following form , so just picking up the first key name + # "state": { + # "waiting": { + # "reason": "CrashLoopBackOff", + # "message": "Back-off 5m0s restarting failed container=metrics-server pod=metrics-server-2011498749-3g453_kube-system(5953be5f-fcae-11e7-a356-000d3ae0e432)" + # } + # }, + # the below is for accounting 'NodeLost' scenario, where-in the containers in the lost node/pod(s) is still being reported as running + if podReadyCondition == false + record["ContainerStatus"] = "Unknown" + else + record["ContainerStatus"] = containerStatus.keys[0] + end + #TODO : Remove ContainerCreationTimeStamp from here since we are sending it as a metric + #Picking up both container and node start time from cAdvisor to be consistent + if containerStatus.keys[0] == "running" + record["ContainerCreationTimeStamp"] = container["state"]["running"]["startedAt"] + else + if !containerStatus[containerStatus.keys[0]]["reason"].nil? && !containerStatus[containerStatus.keys[0]]["reason"].empty? + record["ContainerStatusReason"] = containerStatus[containerStatus.keys[0]]["reason"] + end + # Process the record to see if job was completed 6 hours ago. If so, send metric to mdm + if !record["ControllerKind"].nil? && record["ControllerKind"].downcase == Constants::CONTROLLER_KIND_JOB + @inventoryToMdmConvertor.process_record_for_terminated_job_metric(record["ControllerName"], record["Namespace"], containerStatus) + end + end + + # Record the last state of the container. This may have information on why a container was killed. + begin + if !container["lastState"].nil? && container["lastState"].keys.length == 1 + lastStateName = container["lastState"].keys[0] + lastStateObject = container["lastState"][lastStateName] + if !lastStateObject.is_a?(Hash) + raise "expected a hash object. This could signify a bug or a kubernetes API change" + end + + if lastStateObject.key?("reason") && lastStateObject.key?("startedAt") && lastStateObject.key?("finishedAt") + newRecord = Hash.new + newRecord["lastState"] = lastStateName # get the name of the last state (ex: terminated) + lastStateReason = lastStateObject["reason"] + # newRecord["reason"] = lastStateObject["reason"] # (ex: OOMKilled) + newRecord["reason"] = lastStateReason # (ex: OOMKilled) + newRecord["startedAt"] = lastStateObject["startedAt"] # (ex: 2019-07-02T14:58:51Z) + lastFinishedTime = lastStateObject["finishedAt"] + newRecord["finishedAt"] = lastFinishedTime # (ex: 2019-07-02T14:58:52Z) + + # only write to the output field if everything previously ran without error + record["ContainerLastStatus"] = newRecord + + #Populate mdm metric for OOMKilled container count if lastStateReason is OOMKilled + if lastStateReason.downcase == Constants::REASON_OOM_KILLED + @inventoryToMdmConvertor.process_record_for_oom_killed_metric(record["ControllerName"], record["Namespace"], lastFinishedTime) + end + lastStateReason = nil + else + record["ContainerLastStatus"] = Hash.new + end + else + record["ContainerLastStatus"] = Hash.new + end + + #Populate mdm metric for container restart count if greater than 0 + if (!containerRestartCount.nil? && (containerRestartCount.is_a? Integer) && containerRestartCount > 0) + @inventoryToMdmConvertor.process_record_for_container_restarts_metric(record["ControllerName"], record["Namespace"], lastFinishedTime) + end + rescue => errorStr + $log.warn "Failed in parse_and_emit_record pod inventory while processing ContainerLastStatus: #{errorStr}" + $log.debug_backtrace(errorStr.backtrace) + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + record["ContainerLastStatus"] = Hash.new + end + + podRestartCount += containerRestartCount + records.push(record.dup) + end + else # for unscheduled pods there are no status.containerStatuses, in this case we still want the pod + records.push(record) + end #container status block end + + records.each do |record| + if !record.nil? + record["PodRestartCount"] = podRestartCount + end + end + rescue => error + $log.warn("getPodInventoryRecords failed: #{error}") + end + return records + end + + # TODO - move this method to KubernetesClient or helper class + def getServiceNameFromLabels(namespace, labels, serviceRecords) serviceName = "" begin if !labels.nil? && !labels.empty? - if (!serviceList.nil? && !serviceList.empty? && serviceList.key?("items") && !serviceList["items"].empty?) - serviceList["items"].each do |item| - found = 0 - if !item["spec"].nil? && !item["spec"]["selector"].nil? && item["metadata"]["namespace"] == namespace - selectorLabels = item["spec"]["selector"] - if !selectorLabels.empty? - selectorLabels.each do |key, value| - if !(labels.select { |k, v| k == key && v == value }.length > 0) - break - end - found = found + 1 + serviceRecords.each do |kubeServiceRecord| + found = 0 + if kubeServiceRecord["Namespace"] == namespace + selectorLabels = {} + # selector labels wrapped in array in kube service records so unwrapping here + if !kubeServiceRecord["SelectorLabels"].nil? && kubeServiceRecord["SelectorLabels"].length > 0 + selectorLabels = kubeServiceRecord["SelectorLabels"][0] + end + if !selectorLabels.nil? && !selectorLabels.empty? + selectorLabels.each do |key, value| + if !(labels.select { |k, v| k == key && v == value }.length > 0) + break end + found = found + 1 end + # service can have no selectors if found == selectorLabels.length - return item["metadata"]["name"] + return kubeServiceRecord["ServiceName"] end end end diff --git a/source/plugins/ruby/in_kube_pvinventory.rb b/source/plugins/ruby/in_kube_pvinventory.rb new file mode 100644 index 000000000..861b3a8e1 --- /dev/null +++ b/source/plugins/ruby/in_kube_pvinventory.rb @@ -0,0 +1,253 @@ +module Fluent + class Kube_PVInventory_Input < Input + Plugin.register_input("kubepvinventory", self) + + @@hostName = (OMS::Common.get_hostname) + + def initialize + super + require "yaml" + require "yajl/json_gem" + require "yajl" + require "time" + require_relative "KubernetesApiClient" + require_relative "ApplicationInsightsUtility" + require_relative "oms_common" + require_relative "omslog" + require_relative "constants" + + # Response size is around 1500 bytes per PV + @PV_CHUNK_SIZE = "5000" + @pvTypeToCountHash = {} + end + + config_param :run_interval, :time, :default => 60 + config_param :tag, :string, :default => "oms.containerinsights.KubePVInventory" + + def configure(conf) + super + end + + def start + if @run_interval + @finished = false + @condition = ConditionVariable.new + @mutex = Mutex.new + @thread = Thread.new(&method(:run_periodic)) + @@pvTelemetryTimeTracker = DateTime.now.to_time.to_i + end + end + + def shutdown + if @run_interval + @mutex.synchronize { + @finished = true + @condition.signal + } + @thread.join + end + end + + def enumerate + begin + pvInventory = nil + telemetryFlush = false + @pvTypeToCountHash = {} + currentTime = Time.now + batchTime = currentTime.utc.iso8601 + + continuationToken = nil + $log.info("in_kube_pvinventory::enumerate : Getting PVs from Kube API @ #{Time.now.utc.iso8601}") + continuationToken, pvInventory = KubernetesApiClient.getResourcesAndContinuationToken("persistentvolumes?limit=#{@PV_CHUNK_SIZE}") + $log.info("in_kube_pvinventory::enumerate : Done getting PVs from Kube API @ #{Time.now.utc.iso8601}") + + if (!pvInventory.nil? && !pvInventory.empty? && pvInventory.key?("items") && !pvInventory["items"].nil? && !pvInventory["items"].empty?) + parse_and_emit_records(pvInventory, batchTime) + else + $log.warn "in_kube_pvinventory::enumerate:Received empty pvInventory" + end + + # If we receive a continuation token, make calls, process and flush data until we have processed all data + while (!continuationToken.nil? && !continuationToken.empty?) + continuationToken, pvInventory = KubernetesApiClient.getResourcesAndContinuationToken("persistentvolumes?limit=#{@PV_CHUNK_SIZE}&continue=#{continuationToken}") + if (!pvInventory.nil? && !pvInventory.empty? && pvInventory.key?("items") && !pvInventory["items"].nil? && !pvInventory["items"].empty?) + parse_and_emit_records(pvInventory, batchTime) + else + $log.warn "in_kube_pvinventory::enumerate:Received empty pvInventory" + end + end + + # Setting this to nil so that we dont hold memory until GC kicks in + pvInventory = nil + + # Adding telemetry to send pod telemetry every 10 minutes + timeDifference = (DateTime.now.to_time.to_i - @@pvTelemetryTimeTracker).abs + timeDifferenceInMinutes = timeDifference / 60 + if (timeDifferenceInMinutes >= Constants::TELEMETRY_FLUSH_INTERVAL_IN_MINUTES) + telemetryFlush = true + end + + # Flush AppInsights telemetry once all the processing is done + if telemetryFlush == true + telemetryProperties = {} + telemetryProperties["CountsOfPVTypes"] = @pvTypeToCountHash.to_json + ApplicationInsightsUtility.sendCustomEvent(Constants::PV_INVENTORY_HEART_BEAT_EVENT, telemetryProperties) + @@pvTelemetryTimeTracker = DateTime.now.to_time.to_i + end + + rescue => errorStr + $log.warn "in_kube_pvinventory::enumerate:Failed in enumerate: #{errorStr}" + $log.debug_backtrace(errorStr.backtrace) + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + end # end enumerate + + def parse_and_emit_records(pvInventory, batchTime = Time.utc.iso8601) + currentTime = Time.now + emitTime = currentTime.to_f + eventStream = MultiEventStream.new + + begin + records = [] + pvInventory["items"].each do |item| + + # Node, pod, & usage info can be found by joining with pvUsedBytes metric using PVCNamespace/PVCName + record = {} + record["CollectionTime"] = batchTime + record["ClusterId"] = KubernetesApiClient.getClusterId + record["ClusterName"] = KubernetesApiClient.getClusterName + record["PVName"] = item["metadata"]["name"] + record["PVStatus"] = item["status"]["phase"] + record["PVAccessModes"] = item["spec"]["accessModes"].join(', ') + record["PVStorageClassName"] = item["spec"]["storageClassName"] + record["PVCapacityBytes"] = KubernetesApiClient.getMetricNumericValue("memory", item["spec"]["capacity"]["storage"]) + record["PVCreationTimeStamp"] = item["metadata"]["creationTimestamp"] + + # Optional values + pvcNamespace, pvcName = getPVCInfo(item) + type, typeInfo = getTypeInfo(item) + record["PVCNamespace"] = pvcNamespace + record["PVCName"] = pvcName + record["PVType"] = type + record["PVTypeInfo"] = typeInfo + + records.push(record) + + # Record telemetry + if type == nil + type = "empty" + end + if (@pvTypeToCountHash.has_key? type) + @pvTypeToCountHash[type] += 1 + else + @pvTypeToCountHash[type] = 1 + end + end + + records.each do |record| + if !record.nil? + wrapper = { + "DataType" => "KUBE_PV_INVENTORY_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [record.each { |k, v| record[k] = v }], + } + eventStream.add(emitTime, wrapper) if wrapper + end + end + + router.emit_stream(@tag, eventStream) if eventStream + + rescue => errorStr + $log.warn "Failed in parse_and_emit_record for in_kube_pvinventory: #{errorStr}" + $log.debug_backtrace(errorStr.backtrace) + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + end + + def getPVCInfo(item) + begin + if !item["spec"].nil? && !item["spec"]["claimRef"].nil? + claimRef = item["spec"]["claimRef"] + pvcNamespace = claimRef["namespace"] + pvcName = claimRef["name"] + return pvcNamespace, pvcName + end + rescue => errorStr + $log.warn "Failed in getPVCInfo for in_kube_pvinventory: #{errorStr}" + $log.debug_backtrace(errorStr.backtrace) + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + + # No PVC or an error + return nil, nil + end + + def getTypeInfo(item) + begin + if !item["spec"].nil? + (Constants::PV_TYPES).each do |pvType| + + # PV is this type + if !item["spec"][pvType].nil? + + # Get additional info if azure disk/file + typeInfo = {} + if pvType == "azureDisk" + azureDisk = item["spec"]["azureDisk"] + typeInfo["DiskName"] = azureDisk["diskName"] + typeInfo["DiskUri"] = azureDisk["diskURI"] + elsif pvType == "azureFile" + typeInfo["FileShareName"] = item["spec"]["azureFile"]["shareName"] + end + + # Can only have one type: return right away when found + return pvType, typeInfo + + end + end + end + rescue => errorStr + $log.warn "Failed in getTypeInfo for in_kube_pvinventory: #{errorStr}" + $log.debug_backtrace(errorStr.backtrace) + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + + # No matches from list of types or an error + return nil, {} + end + + + def run_periodic + @mutex.lock + done = @finished + @nextTimeToRun = Time.now + @waitTimeout = @run_interval + until done + @nextTimeToRun = @nextTimeToRun + @run_interval + @now = Time.now + if @nextTimeToRun <= @now + @waitTimeout = 1 + @nextTimeToRun = @now + else + @waitTimeout = @nextTimeToRun - @now + end + @condition.wait(@mutex, @waitTimeout) + done = @finished + @mutex.unlock + if !done + begin + $log.info("in_kube_pvinventory::run_periodic.enumerate.start #{Time.now.utc.iso8601}") + enumerate + $log.info("in_kube_pvinventory::run_periodic.enumerate.end #{Time.now.utc.iso8601}") + rescue => errorStr + $log.warn "in_kube_pvinventory::run_periodic: enumerate Failed to retrieve pod inventory: #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry(errorStr) + end + end + @mutex.lock + end + @mutex.unlock + end + + end # Kube_PVInventory_Input +end # module \ No newline at end of file diff --git a/source/plugins/ruby/in_kubestate_deployments.rb b/source/plugins/ruby/in_kubestate_deployments.rb index bcf397150..27e4709a2 100644 --- a/source/plugins/ruby/in_kubestate_deployments.rb +++ b/source/plugins/ruby/in_kubestate_deployments.rb @@ -2,230 +2,238 @@ # frozen_string_literal: true module Fluent - class Kube_Kubestate_Deployments_Input < Input - Plugin.register_input("kubestatedeployments", self) - @@istestvar = ENV["ISTEST"] - # telemetry - To keep telemetry cost reasonable, we keep track of the max deployments over a period of 15m - @@deploymentsCount = 0 - - - - def initialize - super - require "yajl/json_gem" - require "yajl" - require "date" - require "time" - - require_relative "KubernetesApiClient" - require_relative "oms_common" - require_relative "omslog" - require_relative "ApplicationInsightsUtility" - require_relative "constants" - - # roughly each deployment is 8k - # 1000 deployments account to approximately 8MB - @DEPLOYMENTS_CHUNK_SIZE = 1000 - @DEPLOYMENTS_API_GROUP = "apps" - @@telemetryLastSentTime = DateTime.now.to_time.to_i - - - @deploymentsRunningTotal = 0 - - @NodeName = OMS::Common.get_hostname - @ClusterId = KubernetesApiClient.getClusterId - @ClusterName = KubernetesApiClient.getClusterName - end - - config_param :run_interval, :time, :default => 60 - config_param :tag, :string, :default => Constants::INSIGHTSMETRICS_FLUENT_TAG - - def configure(conf) - super - end - - def start - if @run_interval - @finished = false - @condition = ConditionVariable.new - @mutex = Mutex.new - @thread = Thread.new(&method(:run_periodic)) + class Kube_Kubestate_Deployments_Input < Input + Plugin.register_input("kubestatedeployments", self) + @@istestvar = ENV["ISTEST"] + # telemetry - To keep telemetry cost reasonable, we keep track of the max deployments over a period of 15m + @@deploymentsCount = 0 + + def initialize + super + require "yajl/json_gem" + require "yajl" + require "date" + require "time" + + require_relative "KubernetesApiClient" + require_relative "oms_common" + require_relative "omslog" + require_relative "ApplicationInsightsUtility" + require_relative "constants" + + # refer tomlparser-agent-config for defaults + # this configurable via configmap + @DEPLOYMENTS_CHUNK_SIZE = 0 + + @DEPLOYMENTS_API_GROUP = "apps" + @@telemetryLastSentTime = DateTime.now.to_time.to_i + + @deploymentsRunningTotal = 0 + + @NodeName = OMS::Common.get_hostname + @ClusterId = KubernetesApiClient.getClusterId + @ClusterName = KubernetesApiClient.getClusterName + end + + config_param :run_interval, :time, :default => 60 + config_param :tag, :string, :default => Constants::INSIGHTSMETRICS_FLUENT_TAG + + def configure(conf) + super + end + + def start + if @run_interval + if !ENV["DEPLOYMENTS_CHUNK_SIZE"].nil? && !ENV["DEPLOYMENTS_CHUNK_SIZE"].empty? && ENV["DEPLOYMENTS_CHUNK_SIZE"].to_i > 0 + @DEPLOYMENTS_CHUNK_SIZE = ENV["DEPLOYMENTS_CHUNK_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kubestate_deployments::start: setting to default value since got DEPLOYMENTS_CHUNK_SIZE nil or empty") + @DEPLOYMENTS_CHUNK_SIZE = 500 end + $log.info("in_kubestate_deployments::start : DEPLOYMENTS_CHUNK_SIZE @ #{@DEPLOYMENTS_CHUNK_SIZE}") + + @finished = false + @condition = ConditionVariable.new + @mutex = Mutex.new + @thread = Thread.new(&method(:run_periodic)) end - - def shutdown - if @run_interval - @mutex.synchronize { - @finished = true - @condition.signal - } - @thread.join - end + end + + def shutdown + if @run_interval + @mutex.synchronize { + @finished = true + @condition.signal + } + @thread.join end - - def enumerate - begin - deploymentList = nil - currentTime = Time.now - batchTime = currentTime.utc.iso8601 - - #set the running total for this batch to 0 - @deploymentsRunningTotal = 0 - - # Initializing continuation token to nil - continuationToken = nil - $log.info("in_kubestate_deployments::enumerate : Getting deployments from Kube API @ #{Time.now.utc.iso8601}") - continuationToken, deploymentList = KubernetesApiClient.getResourcesAndContinuationToken("deployments?limit=#{@DEPLOYMENTS_CHUNK_SIZE}", api_group: @DEPLOYMENTS_API_GROUP) - $log.info("in_kubestate_deployments::enumerate : Done getting deployments from Kube API @ #{Time.now.utc.iso8601}") + end + + def enumerate + begin + deploymentList = nil + currentTime = Time.now + batchTime = currentTime.utc.iso8601 + + #set the running total for this batch to 0 + @deploymentsRunningTotal = 0 + + # Initializing continuation token to nil + continuationToken = nil + $log.info("in_kubestate_deployments::enumerate : Getting deployments from Kube API @ #{Time.now.utc.iso8601}") + continuationToken, deploymentList = KubernetesApiClient.getResourcesAndContinuationToken("deployments?limit=#{@DEPLOYMENTS_CHUNK_SIZE}", api_group: @DEPLOYMENTS_API_GROUP) + $log.info("in_kubestate_deployments::enumerate : Done getting deployments from Kube API @ #{Time.now.utc.iso8601}") + if (!deploymentList.nil? && !deploymentList.empty? && deploymentList.key?("items") && !deploymentList["items"].nil? && !deploymentList["items"].empty?) + $log.info("in_kubestate_deployments::enumerate : number of deployment items :#{deploymentList["items"].length} from Kube API @ #{Time.now.utc.iso8601}") + parse_and_emit_records(deploymentList, batchTime) + else + $log.warn "in_kubestate_deployments::enumerate:Received empty deploymentList" + end + + #If we receive a continuation token, make calls, process and flush data until we have processed all data + while (!continuationToken.nil? && !continuationToken.empty?) + continuationToken, deploymentList = KubernetesApiClient.getResourcesAndContinuationToken("deployments?limit=#{@DEPLOYMENTS_CHUNK_SIZE}&continue=#{continuationToken}", api_group: @DEPLOYMENTS_API_GROUP) if (!deploymentList.nil? && !deploymentList.empty? && deploymentList.key?("items") && !deploymentList["items"].nil? && !deploymentList["items"].empty?) + $log.info("in_kubestate_deployments::enumerate : number of deployment items :#{deploymentList["items"].length} from Kube API @ #{Time.now.utc.iso8601}") parse_and_emit_records(deploymentList, batchTime) else $log.warn "in_kubestate_deployments::enumerate:Received empty deploymentList" end - - #If we receive a continuation token, make calls, process and flush data until we have processed all data - while (!continuationToken.nil? && !continuationToken.empty?) - continuationToken, deploymentList = KubernetesApiClient.getResourcesAndContinuationToken("deployments?limit=#{@DEPLOYMENTS_CHUNK_SIZE}&continue=#{continuationToken}", api_group: @DEPLOYMENTS_API_GROUP) - if (!deploymentList.nil? && !deploymentList.empty? && deploymentList.key?("items") && !deploymentList["items"].nil? && !deploymentList["items"].empty?) - parse_and_emit_records(deploymentList, batchTime) - else - $log.warn "in_kubestate_deployments::enumerate:Received empty deploymentList" - end + end + + # Setting this to nil so that we dont hold memory until GC kicks in + deploymentList = nil + + $log.info("successfully emitted a total of #{@deploymentsRunningTotal} kube_state_deployment metrics") + # Flush AppInsights telemetry once all the processing is done, only if the number of events flushed is greater than 0 + if (@deploymentsRunningTotal > @@deploymentsCount) + @@deploymentsCount = @deploymentsRunningTotal + end + if (((DateTime.now.to_time.to_i - @@telemetryLastSentTime).abs) / 60) >= Constants::KUBE_STATE_TELEMETRY_FLUSH_INTERVAL_IN_MINUTES + #send telemetry + $log.info "sending deployemt telemetry..." + ApplicationInsightsUtility.sendMetricTelemetry("MaxDeploymentCount", @@deploymentsCount, {}) + #reset last sent value & time + @@deploymentsCount = 0 + @@telemetryLastSentTime = DateTime.now.to_time.to_i + end + rescue => errorStr + $log.warn "in_kubestate_deployments::enumerate:Failed in enumerate: #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::enumerate:Failed in enumerate: #{errorStr}") + end + end # end enumerate + + def parse_and_emit_records(deployments, batchTime = Time.utc.iso8601) + metricItems = [] + insightsMetricsEventStream = MultiEventStream.new + begin + metricInfo = deployments + metricInfo["items"].each do |deployment| + deploymentName = deployment["metadata"]["name"] + deploymentNameSpace = deployment["metadata"]["namespace"] + deploymentCreatedTime = "" + if !deployment["metadata"]["creationTimestamp"].nil? + deploymentCreatedTime = deployment["metadata"]["creationTimestamp"] + end + deploymentStrategy = "RollingUpdate" #default when not specified as per spec + if !deployment["spec"]["strategy"].nil? && !deployment["spec"]["strategy"]["type"].nil? + deploymentStrategy = deployment["spec"]["strategy"]["type"] end - - # Setting this to nil so that we dont hold memory until GC kicks in - deploymentList = nil - - $log.info("successfully emitted a total of #{@deploymentsRunningTotal} kube_state_deployment metrics") - # Flush AppInsights telemetry once all the processing is done, only if the number of events flushed is greater than 0 - if (@deploymentsRunningTotal > @@deploymentsCount) - @@deploymentsCount = @deploymentsRunningTotal + deploymentSpecReplicas = 1 #default is 1 as per k8s spec + if !deployment["spec"]["replicas"].nil? + deploymentSpecReplicas = deployment["spec"]["replicas"] end - if (((DateTime.now.to_time.to_i - @@telemetryLastSentTime).abs)/60 ) >= Constants::KUBE_STATE_TELEMETRY_FLUSH_INTERVAL_IN_MINUTES - #send telemetry - $log.info "sending deployemt telemetry..." - ApplicationInsightsUtility.sendMetricTelemetry("MaxDeploymentCount", @@deploymentsCount, {}) - #reset last sent value & time - @@deploymentsCount = 0 - @@telemetryLastSentTime = DateTime.now.to_time.to_i + deploymentStatusReadyReplicas = 0 + if !deployment["status"]["readyReplicas"].nil? + deploymentStatusReadyReplicas = deployment["status"]["readyReplicas"] end - rescue => errorStr - $log.warn "in_kubestate_deployments::enumerate:Failed in enumerate: #{errorStr}" - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::enumerate:Failed in enumerate: #{errorStr}") + deploymentStatusUpToDateReplicas = 0 + if !deployment["status"]["updatedReplicas"].nil? + deploymentStatusUpToDateReplicas = deployment["status"]["updatedReplicas"] + end + deploymentStatusAvailableReplicas = 0 + if !deployment["status"]["availableReplicas"].nil? + deploymentStatusAvailableReplicas = deployment["status"]["availableReplicas"] + end + + metricItem = {} + metricItem["CollectionTime"] = batchTime + metricItem["Computer"] = @NodeName + metricItem["Name"] = Constants::INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_DEPLOYMENT_STATE + metricItem["Value"] = deploymentStatusReadyReplicas + metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN + metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE + + metricTags = {} + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = @ClusterId + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = @ClusterName + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_NAME] = deploymentName + metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = deploymentNameSpace + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STRATEGY] = deploymentStrategy + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME] = deploymentCreatedTime + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_SPEC_REPLICAS] = deploymentSpecReplicas + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_UPDATED] = deploymentStatusUpToDateReplicas + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_AVAILABLE] = deploymentStatusAvailableReplicas + + metricItem["Tags"] = metricTags + + metricItems.push(metricItem) end - end # end enumerate - - def parse_and_emit_records(deployments, batchTime = Time.utc.iso8601) - metricItems = [] - insightsMetricsEventStream = MultiEventStream.new - begin - metricInfo = deployments - metricInfo["items"].each do |deployment| - deploymentName = deployment["metadata"]["name"] - deploymentNameSpace = deployment["metadata"]["namespace"] - deploymentCreatedTime = "" - if !deployment["metadata"]["creationTimestamp"].nil? - deploymentCreatedTime = deployment["metadata"]["creationTimestamp"] - end - deploymentStrategy = "RollingUpdate" #default when not specified as per spec - if !deployment["spec"]["strategy"].nil? && !deployment["spec"]["strategy"]["type"].nil? - deploymentStrategy = deployment["spec"]["strategy"]["type"] - end - deploymentSpecReplicas = 1 #default is 1 as per k8s spec - if !deployment["spec"]["replicas"].nil? - deploymentSpecReplicas = deployment["spec"]["replicas"] - end - deploymentStatusReadyReplicas = 0 - if !deployment["status"]["readyReplicas"].nil? - deploymentStatusReadyReplicas = deployment["status"]["readyReplicas"] - end - deploymentStatusUpToDateReplicas = 0 - if !deployment["status"]["updatedReplicas"].nil? - deploymentStatusUpToDateReplicas = deployment["status"]["updatedReplicas"] - end - deploymentStatusAvailableReplicas = 0 - if !deployment["status"]["availableReplicas"].nil? - deploymentStatusAvailableReplicas = deployment["status"]["availableReplicas"] - end - - metricItem = {} - metricItem["CollectionTime"] = batchTime - metricItem["Computer"] = @NodeName - metricItem["Name"] = Constants::INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_DEPLOYMENT_STATE - metricItem["Value"] = deploymentStatusReadyReplicas - metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN - metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE - - metricTags = {} - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = @ClusterId - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = @ClusterName - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_NAME] = deploymentName - metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = deploymentNameSpace - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STRATEGY ] = deploymentStrategy - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME] = deploymentCreatedTime - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_SPEC_REPLICAS] = deploymentSpecReplicas - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_UPDATED] = deploymentStatusUpToDateReplicas - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_DEPLOYMENT_STATUS_REPLICAS_AVAILABLE] = deploymentStatusAvailableReplicas - - - metricItem["Tags"] = metricTags - - metricItems.push(metricItem) - end - - time = Time.now.to_f - metricItems.each do |insightsMetricsRecord| - wrapper = { - "DataType" => "INSIGHTS_METRICS_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], - } - insightsMetricsEventStream.add(time, wrapper) if wrapper - end - - router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream - $log.info("successfully emitted #{metricItems.length()} kube_state_deployment metrics") - @deploymentsRunningTotal = @deploymentsRunningTotal + metricItems.length() - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) - $log.info("kubestatedeploymentsInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") - end - rescue => error - $log.warn("in_kubestate_deployments::parse_and_emit_records failed: #{error} ") - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::parse_and_emit_records failed: #{error}") + + time = Time.now.to_f + metricItems.each do |insightsMetricsRecord| + wrapper = { + "DataType" => "INSIGHTS_METRICS_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], + } + insightsMetricsEventStream.add(time, wrapper) if wrapper + end + + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + $log.info("successfully emitted #{metricItems.length()} kube_state_deployment metrics") + + @deploymentsRunningTotal = @deploymentsRunningTotal + metricItems.length() + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) + $log.info("kubestatedeploymentsInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") end - + rescue => error + $log.warn("in_kubestate_deployments::parse_and_emit_records failed: #{error} ") + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::parse_and_emit_records failed: #{error}") end - - def run_periodic - @mutex.lock + end + + def run_periodic + @mutex.lock + done = @finished + @nextTimeToRun = Time.now + @waitTimeout = @run_interval + until done + @nextTimeToRun = @nextTimeToRun + @run_interval + @now = Time.now + if @nextTimeToRun <= @now + @waitTimeout = 1 + @nextTimeToRun = @now + else + @waitTimeout = @nextTimeToRun - @now + end + @condition.wait(@mutex, @waitTimeout) done = @finished - @nextTimeToRun = Time.now - @waitTimeout = @run_interval - until done - @nextTimeToRun = @nextTimeToRun + @run_interval - @now = Time.now - if @nextTimeToRun <= @now - @waitTimeout = 1 - @nextTimeToRun = @now - else - @waitTimeout = @nextTimeToRun - @now - end - @condition.wait(@mutex, @waitTimeout) - done = @finished - @mutex.unlock - if !done - begin - $log.info("in_kubestate_deployments::run_periodic.enumerate.start @ #{Time.now.utc.iso8601}") - enumerate - $log.info("in_kubestate_deployments::run_periodic.enumerate.end @ #{Time.now.utc.iso8601}") - rescue => errorStr - $log.warn "in_kubestate_deployments::run_periodic: enumerate Failed to retrieve kube deployments: #{errorStr}" - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::run_periodic: enumerate Failed to retrieve kube deployments: #{errorStr}") - end + @mutex.unlock + if !done + begin + $log.info("in_kubestate_deployments::run_periodic.enumerate.start @ #{Time.now.utc.iso8601}") + enumerate + $log.info("in_kubestate_deployments::run_periodic.enumerate.end @ #{Time.now.utc.iso8601}") + rescue => errorStr + $log.warn "in_kubestate_deployments::run_periodic: enumerate Failed to retrieve kube deployments: #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_deployments::run_periodic: enumerate Failed to retrieve kube deployments: #{errorStr}") end - @mutex.lock end - @mutex.unlock + @mutex.lock end + @mutex.unlock end -end \ No newline at end of file + end +end diff --git a/source/plugins/ruby/in_kubestate_hpa.rb b/source/plugins/ruby/in_kubestate_hpa.rb index 3ce63a75a..afecf8e3b 100644 --- a/source/plugins/ruby/in_kubestate_hpa.rb +++ b/source/plugins/ruby/in_kubestate_hpa.rb @@ -2,231 +2,236 @@ # frozen_string_literal: true module Fluent - class Kube_Kubestate_HPA_Input < Input - Plugin.register_input("kubestatehpa", self) - @@istestvar = ENV["ISTEST"] - - - def initialize - super - require "yajl/json_gem" - require "yajl" - require "time" - - require_relative "KubernetesApiClient" - require_relative "oms_common" - require_relative "omslog" - require_relative "ApplicationInsightsUtility" - require_relative "constants" - - # roughly each HPA is 3k - # 2000 HPAs account to approximately 6-7MB - @HPA_CHUNK_SIZE = 2000 - @HPA_API_GROUP = "autoscaling" - - # telemetry - @hpaCount = 0 - - @NodeName = OMS::Common.get_hostname - @ClusterId = KubernetesApiClient.getClusterId - @ClusterName = KubernetesApiClient.getClusterName - end - - config_param :run_interval, :time, :default => 60 - config_param :tag, :string, :default => Constants::INSIGHTSMETRICS_FLUENT_TAG - - def configure(conf) - super - end - - def start - if @run_interval - @finished = false - @condition = ConditionVariable.new - @mutex = Mutex.new - @thread = Thread.new(&method(:run_periodic)) + class Kube_Kubestate_HPA_Input < Input + Plugin.register_input("kubestatehpa", self) + @@istestvar = ENV["ISTEST"] + + def initialize + super + require "yajl/json_gem" + require "yajl" + require "time" + + require_relative "KubernetesApiClient" + require_relative "oms_common" + require_relative "omslog" + require_relative "ApplicationInsightsUtility" + require_relative "constants" + + # refer tomlparser-agent-config for defaults + # this configurable via configmap + @HPA_CHUNK_SIZE = 0 + + @HPA_API_GROUP = "autoscaling" + + # telemetry + @hpaCount = 0 + + @NodeName = OMS::Common.get_hostname + @ClusterId = KubernetesApiClient.getClusterId + @ClusterName = KubernetesApiClient.getClusterName + end + + config_param :run_interval, :time, :default => 60 + config_param :tag, :string, :default => Constants::INSIGHTSMETRICS_FLUENT_TAG + + def configure(conf) + super + end + + def start + if @run_interval + if !ENV["HPA_CHUNK_SIZE"].nil? && !ENV["HPA_CHUNK_SIZE"].empty? && ENV["HPA_CHUNK_SIZE"].to_i > 0 + @HPA_CHUNK_SIZE = ENV["HPA_CHUNK_SIZE"].to_i + else + # this shouldnt happen just setting default here as safe guard + $log.warn("in_kubestate_hpa::start: setting to default value since got HPA_CHUNK_SIZE nil or empty") + @HPA_CHUNK_SIZE = 2000 end + $log.info("in_kubestate_hpa::start : HPA_CHUNK_SIZE @ #{@HPA_CHUNK_SIZE}") + + @finished = false + @condition = ConditionVariable.new + @mutex = Mutex.new + @thread = Thread.new(&method(:run_periodic)) end - - def shutdown - if @run_interval - @mutex.synchronize { - @finished = true - @condition.signal - } - @thread.join - end + end + + def shutdown + if @run_interval + @mutex.synchronize { + @finished = true + @condition.signal + } + @thread.join end - - def enumerate - begin - hpaList = nil - currentTime = Time.now - batchTime = currentTime.utc.iso8601 - - @hpaCount = 0 - - # Initializing continuation token to nil - continuationToken = nil - $log.info("in_kubestate_hpa::enumerate : Getting HPAs from Kube API @ #{Time.now.utc.iso8601}") - continuationToken, hpaList = KubernetesApiClient.getResourcesAndContinuationToken("horizontalpodautoscalers?limit=#{@HPA_CHUNK_SIZE}", api_group: @HPA_API_GROUP) - $log.info("in_kubestate_hpa::enumerate : Done getting HPAs from Kube API @ #{Time.now.utc.iso8601}") + end + + def enumerate + begin + hpaList = nil + currentTime = Time.now + batchTime = currentTime.utc.iso8601 + + @hpaCount = 0 + + # Initializing continuation token to nil + continuationToken = nil + $log.info("in_kubestate_hpa::enumerate : Getting HPAs from Kube API @ #{Time.now.utc.iso8601}") + continuationToken, hpaList = KubernetesApiClient.getResourcesAndContinuationToken("horizontalpodautoscalers?limit=#{@HPA_CHUNK_SIZE}", api_group: @HPA_API_GROUP) + $log.info("in_kubestate_hpa::enumerate : Done getting HPAs from Kube API @ #{Time.now.utc.iso8601}") + if (!hpaList.nil? && !hpaList.empty? && hpaList.key?("items") && !hpaList["items"].nil? && !hpaList["items"].empty?) + parse_and_emit_records(hpaList, batchTime) + else + $log.warn "in_kubestate_hpa::enumerate:Received empty hpaList" + end + + #If we receive a continuation token, make calls, process and flush data until we have processed all data + while (!continuationToken.nil? && !continuationToken.empty?) + continuationToken, hpaList = KubernetesApiClient.getResourcesAndContinuationToken("horizontalpodautoscalers?limit=#{@HPA_CHUNK_SIZE}&continue=#{continuationToken}", api_group: @HPA_API_GROUP) if (!hpaList.nil? && !hpaList.empty? && hpaList.key?("items") && !hpaList["items"].nil? && !hpaList["items"].empty?) parse_and_emit_records(hpaList, batchTime) else $log.warn "in_kubestate_hpa::enumerate:Received empty hpaList" end - - #If we receive a continuation token, make calls, process and flush data until we have processed all data - while (!continuationToken.nil? && !continuationToken.empty?) - continuationToken, hpaList = KubernetesApiClient.getResourcesAndContinuationToken("horizontalpodautoscalers?limit=#{@HPA_CHUNK_SIZE}&continue=#{continuationToken}", api_group: @HPA_API_GROUP) - if (!hpaList.nil? && !hpaList.empty? && hpaList.key?("items") && !hpaList["items"].nil? && !hpaList["items"].empty?) - parse_and_emit_records(hpaList, batchTime) - else - $log.warn "in_kubestate_hpa::enumerate:Received empty hpaList" + end + + # Setting this to nil so that we dont hold memory until GC kicks in + hpaList = nil + + # Flush AppInsights telemetry once all the processing is done, only if the number of events flushed is greater than 0 + if (@hpaCount > 0) + # this will not be a useful telemetry, as hpa counts will not be huge, just log for now + $log.info("in_kubestate_hpa::hpaCount= #{hpaCount}") + #ApplicationInsightsUtility.sendMetricTelemetry("HPACount", @hpaCount, {}) + end + rescue => errorStr + $log.warn "in_kubestate_hpa::enumerate:Failed in enumerate: #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::enumerate:Failed in enumerate: #{errorStr}") + end + end # end enumerate + + def parse_and_emit_records(hpas, batchTime = Time.utc.iso8601) + metricItems = [] + insightsMetricsEventStream = MultiEventStream.new + begin + metricInfo = hpas + metricInfo["items"].each do |hpa| + hpaName = hpa["metadata"]["name"] + hpaNameSpace = hpa["metadata"]["namespace"] + hpaCreatedTime = "" + if !hpa["metadata"]["creationTimestamp"].nil? + hpaCreatedTime = hpa["metadata"]["creationTimestamp"] + end + hpaSpecMinReplicas = 1 #default is 1 as per k8s spec + if !hpa["spec"]["minReplicas"].nil? + hpaSpecMinReplicas = hpa["spec"]["minReplicas"] + end + hpaSpecMaxReplicas = 0 + if !hpa["spec"]["maxReplicas"].nil? + hpaSpecMaxReplicas = hpa["spec"]["maxReplicas"] + end + hpaSpecScaleTargetKind = "" + hpaSpecScaleTargetName = "" + if !hpa["spec"]["scaleTargetRef"].nil? + if !hpa["spec"]["scaleTargetRef"]["kind"].nil? + hpaSpecScaleTargetKind = hpa["spec"]["scaleTargetRef"]["kind"] + end + if !hpa["spec"]["scaleTargetRef"]["name"].nil? + hpaSpecScaleTargetName = hpa["spec"]["scaleTargetRef"]["name"] end end - - # Setting this to nil so that we dont hold memory until GC kicks in - hpaList = nil - - # Flush AppInsights telemetry once all the processing is done, only if the number of events flushed is greater than 0 - if (@hpaCount > 0) - # this will not be a useful telemetry, as hpa counts will not be huge, just log for now - $log.info("in_kubestate_hpa::hpaCount= #{hpaCount}") - #ApplicationInsightsUtility.sendMetricTelemetry("HPACount", @hpaCount, {}) + hpaStatusCurrentReplicas = 0 + if !hpa["status"]["currentReplicas"].nil? + hpaStatusCurrentReplicas = hpa["status"]["currentReplicas"] end - rescue => errorStr - $log.warn "in_kubestate_hpa::enumerate:Failed in enumerate: #{errorStr}" - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::enumerate:Failed in enumerate: #{errorStr}") + hpaStatusDesiredReplicas = 0 + if !hpa["status"]["desiredReplicas"].nil? + hpaStatusDesiredReplicas = hpa["status"]["desiredReplicas"] + end + + hpaStatuslastScaleTime = "" + if !hpa["status"]["lastScaleTime"].nil? + hpaStatuslastScaleTime = hpa["status"]["lastScaleTime"] + end + + metricItem = {} + metricItem["CollectionTime"] = batchTime + metricItem["Computer"] = @NodeName + metricItem["Name"] = Constants::INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_HPA_STATE + metricItem["Value"] = hpaStatusCurrentReplicas + metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN + metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE + + metricTags = {} + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = @ClusterId + metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = @ClusterName + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_NAME] = hpaName + metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = hpaNameSpace + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME] = hpaCreatedTime + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MIN_REPLICAS] = hpaSpecMinReplicas + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MAX_REPLICAS] = hpaSpecMaxReplicas + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_KIND] = hpaSpecScaleTargetKind + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_NAME] = hpaSpecScaleTargetName + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_DESIRED_REPLICAS] = hpaStatusDesiredReplicas + metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_LAST_SCALE_TIME] = hpaStatuslastScaleTime + + metricItem["Tags"] = metricTags + + metricItems.push(metricItem) end - end # end enumerate - - def parse_and_emit_records(hpas, batchTime = Time.utc.iso8601) - metricItems = [] - insightsMetricsEventStream = MultiEventStream.new - begin - metricInfo = hpas - metricInfo["items"].each do |hpa| - hpaName = hpa["metadata"]["name"] - hpaNameSpace = hpa["metadata"]["namespace"] - hpaCreatedTime = "" - if !hpa["metadata"]["creationTimestamp"].nil? - hpaCreatedTime = hpa["metadata"]["creationTimestamp"] - end - hpaSpecMinReplicas = 1 #default is 1 as per k8s spec - if !hpa["spec"]["minReplicas"].nil? - hpaSpecMinReplicas = hpa["spec"]["minReplicas"] - end - hpaSpecMaxReplicas = 0 - if !hpa["spec"]["maxReplicas"].nil? - hpaSpecMaxReplicas = hpa["spec"]["maxReplicas"] - end - hpaSpecScaleTargetKind = "" - hpaSpecScaleTargetName = "" - if !hpa["spec"]["scaleTargetRef"].nil? - if !hpa["spec"]["scaleTargetRef"]["kind"].nil? - hpaSpecScaleTargetKind = hpa["spec"]["scaleTargetRef"]["kind"] - end - if !hpa["spec"]["scaleTargetRef"]["name"].nil? - hpaSpecScaleTargetName = hpa["spec"]["scaleTargetRef"]["name"] - end - - end - hpaStatusCurrentReplicas = 0 - if !hpa["status"]["currentReplicas"].nil? - hpaStatusCurrentReplicas = hpa["status"]["currentReplicas"] - end - hpaStatusDesiredReplicas = 0 - if !hpa["status"]["desiredReplicas"].nil? - hpaStatusDesiredReplicas = hpa["status"]["desiredReplicas"] - end - - hpaStatuslastScaleTime = "" - if !hpa["status"]["lastScaleTime"].nil? - hpaStatuslastScaleTime = hpa["status"]["lastScaleTime"] - end - - - metricItem = {} - metricItem["CollectionTime"] = batchTime - metricItem["Computer"] = @NodeName - metricItem["Name"] = Constants::INSIGHTSMETRICS_METRIC_NAME_KUBE_STATE_HPA_STATE - metricItem["Value"] = hpaStatusCurrentReplicas - metricItem["Origin"] = Constants::INSIGHTSMETRICS_TAGS_ORIGIN - metricItem["Namespace"] = Constants::INSIGHTSMETRICS_TAGS_KUBESTATE_NAMESPACE - - metricTags = {} - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERID] = @ClusterId - metricTags[Constants::INSIGHTSMETRICS_TAGS_CLUSTERNAME] = @ClusterName - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_NAME] = hpaName - metricTags[Constants::INSIGHTSMETRICS_TAGS_K8SNAMESPACE] = hpaNameSpace - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_CREATIONTIME] = hpaCreatedTime - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MIN_REPLICAS] = hpaSpecMinReplicas - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_MAX_REPLICAS] = hpaSpecMaxReplicas - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_KIND] = hpaSpecScaleTargetKind - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_SPEC_SCALE_TARGET_NAME] = hpaSpecScaleTargetName - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_DESIRED_REPLICAS] = hpaStatusDesiredReplicas - metricTags[Constants::INSIGHTSMETRICS_TAGS_KUBE_STATE_HPA_STATUS_LAST_SCALE_TIME] = hpaStatuslastScaleTime - - - metricItem["Tags"] = metricTags - - metricItems.push(metricItem) - end - time = Time.now.to_f - metricItems.each do |insightsMetricsRecord| - wrapper = { - "DataType" => "INSIGHTS_METRICS_BLOB", - "IPName" => "ContainerInsights", - "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], - } - insightsMetricsEventStream.add(time, wrapper) if wrapper - end - - router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream - $log.info("successfully emitted #{metricItems.length()} kube_state_hpa metrics") - if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) - $log.info("kubestatehpaInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") - end - rescue => error - $log.warn("in_kubestate_hpa::parse_and_emit_records failed: #{error} ") - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::parse_and_emit_records failed: #{error}") + time = Time.now.to_f + metricItems.each do |insightsMetricsRecord| + wrapper = { + "DataType" => "INSIGHTS_METRICS_BLOB", + "IPName" => "ContainerInsights", + "DataItems" => [insightsMetricsRecord.each { |k, v| insightsMetricsRecord[k] = v }], + } + insightsMetricsEventStream.add(time, wrapper) if wrapper + end + + router.emit_stream(Constants::INSIGHTSMETRICS_FLUENT_TAG, insightsMetricsEventStream) if insightsMetricsEventStream + $log.info("successfully emitted #{metricItems.length()} kube_state_hpa metrics") + if (!@@istestvar.nil? && !@@istestvar.empty? && @@istestvar.casecmp("true") == 0 && insightsMetricsEventStream.count > 0) + $log.info("kubestatehpaInsightsMetricsEmitStreamSuccess @ #{Time.now.utc.iso8601}") end - + rescue => error + $log.warn("in_kubestate_hpa::parse_and_emit_records failed: #{error} ") + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::parse_and_emit_records failed: #{error}") end - - def run_periodic - @mutex.lock + end + + def run_periodic + @mutex.lock + done = @finished + @nextTimeToRun = Time.now + @waitTimeout = @run_interval + until done + @nextTimeToRun = @nextTimeToRun + @run_interval + @now = Time.now + if @nextTimeToRun <= @now + @waitTimeout = 1 + @nextTimeToRun = @now + else + @waitTimeout = @nextTimeToRun - @now + end + @condition.wait(@mutex, @waitTimeout) done = @finished - @nextTimeToRun = Time.now - @waitTimeout = @run_interval - until done - @nextTimeToRun = @nextTimeToRun + @run_interval - @now = Time.now - if @nextTimeToRun <= @now - @waitTimeout = 1 - @nextTimeToRun = @now - else - @waitTimeout = @nextTimeToRun - @now - end - @condition.wait(@mutex, @waitTimeout) - done = @finished - @mutex.unlock - if !done - begin - $log.info("in_kubestate_hpa::run_periodic.enumerate.start @ #{Time.now.utc.iso8601}") - enumerate - $log.info("in_kubestate_hpa::run_periodic.enumerate.end @ #{Time.now.utc.iso8601}") - rescue => errorStr - $log.warn "in_kubestate_hpa::run_periodic: enumerate Failed to retrieve kube hpas: #{errorStr}" - ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::run_periodic: enumerate Failed to retrieve kube hpas: #{errorStr}") - end + @mutex.unlock + if !done + begin + $log.info("in_kubestate_hpa::run_periodic.enumerate.start @ #{Time.now.utc.iso8601}") + enumerate + $log.info("in_kubestate_hpa::run_periodic.enumerate.end @ #{Time.now.utc.iso8601}") + rescue => errorStr + $log.warn "in_kubestate_hpa::run_periodic: enumerate Failed to retrieve kube hpas: #{errorStr}" + ApplicationInsightsUtility.sendExceptionTelemetry("in_kubestate_hpa::run_periodic: enumerate Failed to retrieve kube hpas: #{errorStr}") end - @mutex.lock end - @mutex.unlock + @mutex.lock end + @mutex.unlock end -end \ No newline at end of file + end +end diff --git a/source/plugins/ruby/kubernetes_container_inventory.rb b/source/plugins/ruby/kubernetes_container_inventory.rb index 4fe728579..69beca493 100644 --- a/source/plugins/ruby/kubernetes_container_inventory.rb +++ b/source/plugins/ruby/kubernetes_container_inventory.rb @@ -189,34 +189,47 @@ def getContainersInfoMap(podItem, isWindows) return containersInfoMap end - def obtainContainerEnvironmentVars(containerId) - $log.info("KubernetesContainerInventory::obtainContainerEnvironmentVars @ #{Time.now.utc.iso8601}") + def obtainContainerEnvironmentVars(containerId) envValueString = "" begin - unless @@containerCGroupCache.has_key?(containerId) - $log.info("KubernetesContainerInventory::obtainContainerEnvironmentVars fetching cGroup parent pid @ #{Time.now.utc.iso8601} for containerId: #{containerId}") + isCGroupPidFetchRequired = false + if !@@containerCGroupCache.has_key?(containerId) + isCGroupPidFetchRequired = true + else + cGroupPid = @@containerCGroupCache[containerId] + if cGroupPid.nil? || cGroupPid.empty? + isCGroupPidFetchRequired = true + @@containerCGroupCache.delete(containerId) + elsif !File.exist?("/hostfs/proc/#{cGroupPid}/environ") + isCGroupPidFetchRequired = true + @@containerCGroupCache.delete(containerId) + end + end + + if isCGroupPidFetchRequired Dir["/hostfs/proc/*/cgroup"].each do |filename| begin - if File.file?(filename) && File.foreach(filename).grep(/#{containerId}/).any? + if File.file?(filename) && File.exist?(filename) && File.foreach(filename).grep(/#{containerId}/).any? # file full path is /hostfs/proc//cgroup - cGroupPid = filename.split("/")[3] - if @@containerCGroupCache.has_key?(containerId) - tempCGroupPid = @@containerCGroupCache[containerId] - if tempCGroupPid > cGroupPid + cGroupPid = filename.split("/")[3] + if is_number?(cGroupPid) + if @@containerCGroupCache.has_key?(containerId) + tempCGroupPid = @@containerCGroupCache[containerId] + if tempCGroupPid.to_i > cGroupPid.to_i + @@containerCGroupCache[containerId] = cGroupPid + end + else @@containerCGroupCache[containerId] = cGroupPid - end - else - @@containerCGroupCache[containerId] = cGroupPid + end end end - rescue SystemCallError # ignore Error::ENOENT,Errno::ESRCH which is expected if any of the container gone while we read - end - end + rescue SystemCallError # ignore Error::ENOENT,Errno::ESRCH which is expected if any of the container gone while we read + end + end end cGroupPid = @@containerCGroupCache[containerId] if !cGroupPid.nil? && !cGroupPid.empty? - environFilePath = "/hostfs/proc/#{cGroupPid}/environ" - $log.info("KubernetesContainerInventory::obtainContainerEnvironmentVars cGroupPid: #{cGroupPid} environFilePath: #{environFilePath} for containerId: #{containerId}") + environFilePath = "/hostfs/proc/#{cGroupPid}/environ" if File.exist?(environFilePath) # Skip environment variable processing if it contains the flag AZMON_COLLECT_ENV=FALSE # Check to see if the environment variable collection is disabled for this container. @@ -229,8 +242,7 @@ def obtainContainerEnvironmentVars(containerId) if !envVars.nil? && !envVars.empty? envVars = envVars.split("\0") envValueString = envVars.to_json - envValueStringLength = envValueString.length - $log.info("KubernetesContainerInventory::environment vars filename @ #{environFilePath} envVars size @ #{envValueStringLength}") + envValueStringLength = envValueString.length if envValueStringLength >= 200000 lastIndex = envValueString.rindex("\",") if !lastIndex.nil? @@ -341,5 +353,8 @@ def deleteCGroupCacheEntryForDeletedContainer(containerId) ApplicationInsightsUtility.sendExceptionTelemetry(error) end end + def is_number?(value) + true if Integer(value) rescue false + end end end diff --git a/source/plugins/ruby/out_mdm.rb b/source/plugins/ruby/out_mdm.rb index 1c805255a..6238eb51a 100644 --- a/source/plugins/ruby/out_mdm.rb +++ b/source/plugins/ruby/out_mdm.rb @@ -50,6 +50,10 @@ def initialize @cluster_identity = nil @isArcK8sCluster = false @get_access_token_backoff_expiry = Time.now + + @mdm_exceptions_hash = {} + @mdm_exceptions_count = 0 + @mdm_exception_telemetry_time_tracker = DateTime.now.to_time.to_i end def configure(conf) @@ -221,10 +225,49 @@ def format(tag, time, record) end end + def exception_aggregator(error) + begin + errorStr = error.to_s + if (@mdm_exceptions_hash[errorStr].nil?) + @mdm_exceptions_hash[errorStr] = 1 + else + @mdm_exceptions_hash[errorStr] += 1 + end + #Keeping track of all exceptions to send the total in the last flush interval as a metric + @mdm_exceptions_count += 1 + rescue => error + @log.info "Error in MDM exception_aggregator method: #{error}" + ApplicationInsightsUtility.sendExceptionTelemetry(error) + end + end + + def flush_mdm_exception_telemetry + begin + #Flush out exception telemetry as a metric for the last 30 minutes + timeDifference = (DateTime.now.to_time.to_i - @mdm_exception_telemetry_time_tracker).abs + timeDifferenceInMinutes = timeDifference / 60 + if (timeDifferenceInMinutes >= Constants::MDM_EXCEPTIONS_METRIC_FLUSH_INTERVAL) + telemetryProperties = {} + telemetryProperties["ExceptionsHashForFlushInterval"] = @mdm_exceptions_hash.to_json + telemetryProperties["FlushInterval"] = Constants::MDM_EXCEPTIONS_METRIC_FLUSH_INTERVAL + ApplicationInsightsUtility.sendMetricTelemetry(Constants::MDM_EXCEPTION_TELEMETRY_METRIC, @mdm_exceptions_count, telemetryProperties) + # Resetting values after flushing + @mdm_exceptions_count = 0 + @mdm_exceptions_hash = {} + @mdm_exception_telemetry_time_tracker = DateTime.now.to_time.to_i + end + rescue => error + @log.info "Error in flush_mdm_exception_telemetry method: #{error}" + ApplicationInsightsUtility.sendExceptionTelemetry(error) + end + end + # This method is called every flush interval. Send the buffer chunk to MDM. # 'chunk' is a buffer chunk that includes multiple formatted records def write(chunk) begin + # Adding this before trying to flush out metrics, since adding after can lead to metrics never being sent + flush_mdm_exception_telemetry if (!@first_post_attempt_made || (Time.now > @last_post_attempt_time + retry_mdm_post_wait_minutes * 60)) && @can_send_data_to_mdm post_body = [] chunk.msgpack_each { |(tag, record)| @@ -247,7 +290,8 @@ def write(chunk) end end rescue Exception => e - ApplicationInsightsUtility.sendExceptionTelemetry(e) + # Adding exceptions to hash to aggregate and send telemetry for all write errors + exception_aggregator(e) @log.info "Exception when writing to MDM: #{e}" raise e end @@ -282,7 +326,6 @@ def send_to_mdm(post_body) else @log.info "Failed to Post Metrics to MDM : #{e} Response: #{response}" end - #@log.info "MDM request : #{post_body}" @log.debug_backtrace(e.backtrace) if !response.code.empty? && response.code == 403.to_s @log.info "Response Code #{response.code} Updating @last_post_attempt_time" @@ -297,15 +340,15 @@ def send_to_mdm(post_body) @log.info "HTTPServerException when POSTing Metrics to MDM #{e} Response: #{response}" raise e end + # Adding exceptions to hash to aggregate and send telemetry for all 400 error codes + exception_aggregator(e) rescue Errno::ETIMEDOUT => e @log.info "Timed out when POSTing Metrics to MDM : #{e} Response: #{response}" @log.debug_backtrace(e.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(e) raise e rescue Exception => e @log.info "Exception POSTing Metrics to MDM : #{e} Response: #{response}" @log.debug_backtrace(e.backtrace) - ApplicationInsightsUtility.sendExceptionTelemetry(e) raise e end end diff --git a/source/plugins/ruby/podinventory_to_mdm.rb b/source/plugins/ruby/podinventory_to_mdm.rb index 834515969..77370e284 100644 --- a/source/plugins/ruby/podinventory_to_mdm.rb +++ b/source/plugins/ruby/podinventory_to_mdm.rb @@ -80,14 +80,14 @@ class Inventory2MdmConvertor @@pod_phase_values = ["Running", "Pending", "Succeeded", "Failed", "Unknown"] @process_incoming_stream = false - def initialize(custom_metrics_azure_regions) + def initialize() @log_path = "/var/opt/microsoft/docker-cimprov/log/mdm_metrics_generator.log" @log = Logger.new(@log_path, 1, 5000000) @pod_count_hash = {} @no_phase_dim_values_hash = {} @pod_count_by_phase = {} @pod_uids = {} - @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability(custom_metrics_azure_regions) + @process_incoming_stream = CustomMetricsUtils.check_custom_metrics_availability @log.debug "After check_custom_metrics_availability process_incoming_stream #{@process_incoming_stream}" @log.debug { "Starting podinventory_to_mdm plugin" } end