Gangams/fix rs ooming (#473)

* optimize kpi * optimize kube node inventory * add flags for events, deployments and hpa * have separate function parseNodeLimits * refactor code * fix crash * fix bug with service name * fix bugs related to get service name * update oom fix test agent * debug logs * fix service label issue * update to latest agent and enable ephemeral annotation * change stream size to 200 from 250 * update yaml * adjust chunksizes * add ruby gc env * yaml changes for cioomtest11282020-3 * telemetry to track pods latency * service count telemetry * rename variables * wip * nodes inventory telemetry * configmap changes * add emit streams in configmap * yaml updates * fix copy and paste bug * add todo comments * fix node latency telemetry bug * update yaml with latest test image * fix bug * upping rs memory change * fix mdm bug with final emit stream * update to latest image * fix pr feedback * fix pr feedback * rename health config to agent config * fix max allowed hpa chunk size * update to use 1k pod chunk since validated on 1.18+ * remove debug logs * minor updates * move defaults to common place * chart updates * final oomfix agent * update to use prod image so that can be validated with build pipeline * fix typo in comment
microsoft · Dec 16, 2020 · 064bc06 · 064bc06
1 parent 9061201
commit 064bc06
Show file tree

Hide file tree

Showing 14 changed files with 1,534 additions and 1,166 deletions.
diff --git a/build/linux/installer/datafiles/base_container.data b/build/linux/installer/datafiles/base_container.data
@@ -123,7 +123,7 @@ MAINTAINER:              'Microsoft Corporation'
 /opt/tomlparser-mdm-metrics-config.rb;                          build/linux/installer/scripts/tomlparser-mdm-metrics-config.rb;     755; root; root
 /opt/tomlparser-metric-collection-config.rb;                    build/linux/installer/scripts/tomlparser-metric-collection-config.rb;     755; root; root
 
-/opt/tomlparser-health-config.rb;                               build/linux/installer/scripts/tomlparser-health-config.rb;     755; root; root
+/opt/tomlparser-agent-config.rb;                                build/linux/installer/scripts/tomlparser-agent-config.rb;     755; root; root
 /opt/tomlparser.rb;                                             build/common/installer/scripts/tomlparser.rb;     755; root; root
 /opt/td-agent-bit-conf-customizer.rb;                           build/common/installer/scripts/td-agent-bit-conf-customizer.rb;     755; root; root
 /opt/ConfigParseErrorLogger.rb;                                 build/common/installer/scripts/ConfigParseErrorLogger.rb;           755; root; root

diff --git a/build/linux/installer/scripts/tomlparser-agent-config.rb b/build/linux/installer/scripts/tomlparser-agent-config.rb
@@ -0,0 +1,172 @@
+#!/usr/local/bin/ruby
+
+#this should be require relative in Linux and require in windows, since it is a gem install on windows
+@os_type = ENV["OS_TYPE"]
+if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
+  require "tomlrb"
+else
+  require_relative "tomlrb"
+end
+
+require_relative "ConfigParseErrorLogger"
+
+@configMapMountPath = "/etc/config/settings/agent-settings"
+@configSchemaVersion = ""
+@enable_health_model = false
+
+# 250 Node items (15KB per node) account to approximately 4MB
+@nodesChunkSize = 250
+# 1000 pods (10KB per pod) account to approximately 10MB
+@podsChunkSize = 1000
+# 4000 events (1KB per event) account to approximately 4MB
+@eventsChunkSize = 4000
+# roughly each deployment is 8k
+# 500 deployments account to approximately 4MB
+@deploymentsChunkSize = 500
+# roughly each HPA is 3k
+# 2000 HPAs account to approximately 6-7MB
+@hpaChunkSize = 2000
+# stream batch sizes to avoid large file writes
+# too low will consume higher disk iops
+@podsEmitStreamBatchSize = 200
+@nodesEmitStreamBatchSize = 100
+
+# higher the chunk size rs pod memory consumption higher and lower api latency
+# similarly lower the value, helps on the memory consumption but incurrs additional round trip latency
+# these needs to be tuned be based on the workload
+# nodes
+@nodesChunkSizeMin = 100
+@nodesChunkSizeMax = 400
+# pods
+@podsChunkSizeMin = 250
+@podsChunkSizeMax = 1500
+# events
+@eventsChunkSizeMin = 2000
+@eventsChunkSizeMax = 10000
+# deployments
+@deploymentsChunkSizeMin = 500
+@deploymentsChunkSizeMax = 1000
+# hpa
+@hpaChunkSizeMin = 500
+@hpaChunkSizeMax = 2000
+
+# emit stream sizes to prevent lower values which costs disk i/o
+# max will be upto the chunk size
+@podsEmitStreamBatchSizeMin = 50
+@nodesEmitStreamBatchSizeMin = 50
+
+def is_number?(value)
+  true if Integer(value) rescue false
+end
+
+# Use parser to parse the configmap toml file to a ruby structure
+def parseConfigMap
+  begin
+    # Check to see if config map is created
+    if (File.file?(@configMapMountPath))
+      puts "config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values"
+      parsedConfig = Tomlrb.load_file(@configMapMountPath, symbolize_keys: true)
+      puts "config::Successfully parsed mounted config map"
+      return parsedConfig
+    else
+      puts "config::configmap container-azm-ms-agentconfig for agent settings not mounted, using defaults"
+      return nil
+    end
+  rescue => errorStr
+    ConfigParseErrorLogger.logError("Exception while parsing config map for agent settings : #{errorStr}, using defaults, please check config map for errors")
+    return nil
+  end
+end
+
+# Use the ruby structure created after config parsing to set the right values to be used as environment variables
+def populateSettingValuesFromConfigMap(parsedConfig)
+  begin
+    if !parsedConfig.nil? && !parsedConfig[:agent_settings].nil?
+      if !parsedConfig[:agent_settings][:health_model].nil? && !parsedConfig[:agent_settings][:health_model][:enabled].nil?
+        @enable_health_model = parsedConfig[:agent_settings][:health_model][:enabled]
+        puts "enable_health_model = #{@enable_health_model}"
+      end
+      chunk_config = parsedConfig[:agent_settings][:chunk_config]
+      if !chunk_config.nil?
+        nodesChunkSize = chunk_config[:NODES_CHUNK_SIZE]
+        if !nodesChunkSize.nil? && is_number?(nodesChunkSize) && (@nodesChunkSizeMin..@nodesChunkSizeMax) === nodesChunkSize.to_i
+          @nodesChunkSize = nodesChunkSize.to_i
+          puts "Using config map value: NODES_CHUNK_SIZE = #{@nodesChunkSize}"
+        end
+
+        podsChunkSize = chunk_config[:PODS_CHUNK_SIZE]
+        if !podsChunkSize.nil? && is_number?(podsChunkSize) && (@podsChunkSizeMin..@podsChunkSizeMax) === podsChunkSize.to_i
+          @podsChunkSize = podsChunkSize.to_i
+          puts "Using config map value: PODS_CHUNK_SIZE = #{@podsChunkSize}"
+        end
+
+        eventsChunkSize = chunk_config[:EVENTS_CHUNK_SIZE]
+        if !eventsChunkSize.nil? && is_number?(eventsChunkSize) && (@eventsChunkSizeMin..@eventsChunkSizeMax) === eventsChunkSize.to_i
+          @eventsChunkSize = eventsChunkSize.to_i
+          puts "Using config map value: EVENTS_CHUNK_SIZE = #{@eventsChunkSize}"
+        end
+
+        deploymentsChunkSize = chunk_config[:DEPLOYMENTS_CHUNK_SIZE]
+        if !deploymentsChunkSize.nil? && is_number?(deploymentsChunkSize) && (@deploymentsChunkSizeMin..@deploymentsChunkSizeMax) === deploymentsChunkSize.to_i
+          @deploymentsChunkSize = deploymentsChunkSize.to_i
+          puts "Using config map value: DEPLOYMENTS_CHUNK_SIZE = #{@deploymentsChunkSize}"
+        end
+
+        hpaChunkSize = chunk_config[:HPA_CHUNK_SIZE]
+        if !hpaChunkSize.nil? && is_number?(hpaChunkSize) && (@hpaChunkSizeMin..@hpaChunkSizeMax) === hpaChunkSize.to_i
+          @hpaChunkSize = hpaChunkSize.to_i
+          puts "Using config map value: HPA_CHUNK_SIZE = #{@hpaChunkSize}"
+        end
+
+        podsEmitStreamBatchSize = chunk_config[:PODS_EMIT_STREAM_BATCH_SIZE]
+        if !podsEmitStreamBatchSize.nil? && is_number?(podsEmitStreamBatchSize) &&
+           podsEmitStreamBatchSize.to_i <= @podsChunkSize && podsEmitStreamBatchSize.to_i >= @podsEmitStreamBatchSizeMin
+          @podsEmitStreamBatchSize = podsEmitStreamBatchSize.to_i
+          puts "Using config map value: PODS_EMIT_STREAM_BATCH_SIZE = #{@podsEmitStreamBatchSize}"
+        end
+        nodesEmitStreamBatchSize = chunk_config[:NODES_EMIT_STREAM_BATCH_SIZE]
+        if !nodesEmitStreamBatchSize.nil? && is_number?(nodesEmitStreamBatchSize) &&
+           nodesEmitStreamBatchSize.to_i <= @nodesChunkSize && nodesEmitStreamBatchSize.to_i >= @nodesEmitStreamBatchSizeMin
+          @nodesEmitStreamBatchSize = nodesEmitStreamBatchSize.to_i
+          puts "Using config map value: NODES_EMIT_STREAM_BATCH_SIZE = #{@nodesEmitStreamBatchSize}"
+        end
+      end
+    end
+  rescue => errorStr
+    puts "config::error:Exception while reading config settings for agent configuration setting - #{errorStr}, using defaults"
+    @enable_health_model = false
+  end
+end
+
+@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"]
+puts "****************Start Config Processing********************"
+if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version , so hardcoding it
+  configMapSettings = parseConfigMap
+  if !configMapSettings.nil?
+    populateSettingValuesFromConfigMap(configMapSettings)
+  end
+else
+  if (File.file?(@configMapMountPath))
+    ConfigParseErrorLogger.logError("config::unsupported/missing config schema version - '#{@configSchemaVersion}' , using defaults, please use supported schema version")
+  end
+  @enable_health_model = false
+end
+
+# Write the settings to file, so that they can be set as environment variables
+file = File.open("agent_config_env_var", "w")
+
+if !file.nil?
+  file.write("export AZMON_CLUSTER_ENABLE_HEALTH_MODEL=#{@enable_health_model}\n")
+  file.write("export NODES_CHUNK_SIZE=#{@nodesChunkSize}\n")
+  file.write("export PODS_CHUNK_SIZE=#{@podsChunkSize}\n")
+  file.write("export EVENTS_CHUNK_SIZE=#{@eventsChunkSize}\n")
+  file.write("export DEPLOYMENTS_CHUNK_SIZE=#{@deploymentsChunkSize}\n")
+  file.write("export HPA_CHUNK_SIZE=#{@hpaChunkSize}\n")
+  file.write("export PODS_EMIT_STREAM_BATCH_SIZE=#{@podsEmitStreamBatchSize}\n")
+  file.write("export NODES_EMIT_STREAM_BATCH_SIZE=#{@nodesEmitStreamBatchSize}\n")
+  # Close file after writing all environment variables
+  file.close
+else
+  puts "Exception while opening file for writing config environment variables"
+  puts "****************End Config Processing********************"
+end
diff --git a/build/linux/installer/scripts/tomlparser-health-config.rb b/build/linux/installer/scripts/tomlparser-health-config.rb
diff --git a/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml b/charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml
@@ -95,7 +95,7 @@ data:
      <match oms.containerinsights.KubePodInventory**>
       type out_oms
       log_level debug
-      num_threads 5
+      num_threads 2
       buffer_chunk_limit 4m
       buffer_type file
       buffer_path %STATE_DIR_WS%/out_oms_kubepods*.buffer
@@ -108,24 +108,24 @@ data:
      </match>
 
      <match oms.containerinsights.KubePVInventory**>
-     type out_oms
-     log_level debug
-     num_threads 5
-     buffer_chunk_limit 4m
-     buffer_type file
-     buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
-     buffer_queue_limit 20
-     buffer_queue_full_action drop_oldest_chunk
-     flush_interval 20s
-     retry_limit 10
-     retry_wait 5s
-     max_retry_wait 5m
+      type out_oms
+      log_level debug
+      num_threads 5
+      buffer_chunk_limit 4m
+      buffer_type file
+      buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
+      buffer_queue_limit 20
+      buffer_queue_full_action drop_oldest_chunk
+      flush_interval 20s
+      retry_limit 10
+      retry_wait 5s
+      max_retry_wait 5m
     </match>
 
      <match oms.containerinsights.KubeEvents**>
       type out_oms
       log_level debug
-      num_threads 5
+      num_threads 2
       buffer_chunk_limit 4m
       buffer_type file
       buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer
@@ -155,7 +155,7 @@ data:
      <match oms.containerinsights.KubeNodeInventory**>
       type out_oms
       log_level debug
-      num_threads 5
+      num_threads 2
       buffer_chunk_limit 4m
       buffer_type file
       buffer_path %STATE_DIR_WS%/state/out_oms_kubenodes*.buffer
@@ -184,7 +184,7 @@ data:
      <match oms.api.KubePerf**>
       type out_oms
       log_level debug
-      num_threads 5
+      num_threads 2
       buffer_chunk_limit 4m
       buffer_type file
       buffer_path %STATE_DIR_WS%/out_oms_kubeperf*.buffer

diff --git a/charts/azuremonitor-containers/values.yaml b/charts/azuremonitor-containers/values.yaml
@@ -81,6 +81,15 @@ omsagent:
   deployment:
     affinity:
       nodeAffinity:
+        # affinity to schedule on to ephemeral os node if its available
+        preferredDuringSchedulingIgnoredDuringExecution:
+          - weight: 1
+            preference:
+              matchExpressions:
+              - key: storageprofile
+                operator: NotIn
+                values:
+                - managed
         requiredDuringSchedulingIgnoredDuringExecution:
           nodeSelectorTerms:
             - labelSelector:

diff --git a/kubernetes/linux/Dockerfile b/kubernetes/linux/Dockerfile
@@ -15,6 +15,7 @@ ENV HOST_VAR /hostfs/var
 ENV AZMON_COLLECT_ENV False
 ENV KUBE_CLIENT_BACKOFF_BASE 1
 ENV KUBE_CLIENT_BACKOFF_DURATION 0
+ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR 0.9
 RUN /usr/bin/apt-get update && /usr/bin/apt-get install -y libc-bin wget openssl curl sudo python-ctypes init-system-helpers  net-tools rsyslog cron vim dmidecode apt-transport-https gnupg && rm -rf /var/lib/apt/lists/*
 COPY setup.sh main.sh defaultpromenvvariables defaultpromenvvariables-rs mdsd.xml envmdsd $tmpdir/
 WORKDIR ${tmpdir}

diff --git a/kubernetes/linux/main.sh b/kubernetes/linux/main.sh
@@ -171,14 +171,14 @@ done
 source config_env_var
 
 
-#Parse the configmap to set the right environment variables for health feature.
-/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-health-config.rb
+#Parse the configmap to set the right environment variables for agent config.
+/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-agent-config.rb
 
-cat health_config_env_var | while read line; do
+cat agent_config_env_var | while read line; do
     #echo $line
     echo $line >> ~/.bashrc
 done
-source health_config_env_var
+source agent_config_env_var
 
 #Parse the configmap to set the right environment variables for network policy manager (npm) integration.
 /opt/microsoft/omsagent/ruby/bin/ruby tomlparser-npm-config.rb
@@ -429,7 +429,7 @@ echo "export DOCKER_CIMPROV_VERSION=$DOCKER_CIMPROV_VERSION" >> ~/.bashrc
 
 #region check to auto-activate oneagent, to route container logs,
 #Intent is to activate one agent routing for all managed clusters with region in the regionllist, unless overridden by configmap
-# AZMON_CONTAINER_LOGS_ROUTE  will have route (if any) specified in the config map 
+# AZMON_CONTAINER_LOGS_ROUTE  will have route (if any) specified in the config map
 # AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE will have the final route that we compute & set, based on our region list logic
 echo "************start oneagent log routing checks************"
 # by default, use configmap route for safer side
@@ -462,9 +462,9 @@ else
   echo "current region is not in oneagent regions..."
 fi
 
-if [ "$isoneagentregion" = true ]; then 
+if [ "$isoneagentregion" = true ]; then
    #if configmap has a routing for logs, but current region is in the oneagent region list, take the configmap route
-   if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then   
+   if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then
       AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE=$AZMON_CONTAINER_LOGS_ROUTE
       echo "oneagent region is true for current region:$currentregion and config map logs route is not empty. so using config map logs route as effective route:$AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE"
    else #there is no configmap route, so route thru oneagent
@@ -511,7 +511,7 @@ if [ ! -e "/etc/config/kube.conf" ]; then
 
             echo "starting mdsd ..."
             mdsd -l -e ${MDSD_LOG}/mdsd.err -w ${MDSD_LOG}/mdsd.warn -o ${MDSD_LOG}/mdsd.info -q ${MDSD_LOG}/mdsd.qos &
-            
+
             touch /opt/AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE_V2
       fi
    fi