Skip to content

Commit

Permalink
Gangams/fix rs ooming (#473)
Browse files Browse the repository at this point in the history
* optimize kpi

* optimize kube node inventory

* add flags for events, deployments and hpa

* have separate function parseNodeLimits

* refactor code

* fix crash

* fix bug with service name

* fix bugs related to get service name

* update oom fix test agent

* debug logs

* fix service label issue

* update to latest agent and enable ephemeral annotation

* change stream size to 200 from 250

* update yaml

* adjust chunksizes

* add ruby gc env

* yaml changes for cioomtest11282020-3

* telemetry to track pods latency

* service count telemetry

* rename variables

* wip

* nodes inventory telemetry

* configmap changes

* add emit streams in configmap

* yaml updates

* fix copy and paste bug

* add todo comments

* fix node latency telemetry bug

* update yaml with latest test image

* fix bug

* upping rs memory change

* fix mdm bug with final emit stream

* update to latest image

* fix pr feedback

* fix pr feedback

* rename health config to agent config

* fix max allowed hpa chunk size

* update to use 1k pod chunk since validated on 1.18+

* remove debug logs

* minor updates

* move defaults to common place

* chart updates

* final oomfix agent

* update to use prod image so that can be validated with build pipeline

* fix typo in comment
  • Loading branch information
ganga1980 authored Dec 16, 2020
1 parent 9061201 commit 064bc06
Show file tree
Hide file tree
Showing 14 changed files with 1,534 additions and 1,166 deletions.
2 changes: 1 addition & 1 deletion build/linux/installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ MAINTAINER: 'Microsoft Corporation'
/opt/tomlparser-mdm-metrics-config.rb; build/linux/installer/scripts/tomlparser-mdm-metrics-config.rb; 755; root; root
/opt/tomlparser-metric-collection-config.rb; build/linux/installer/scripts/tomlparser-metric-collection-config.rb; 755; root; root

/opt/tomlparser-health-config.rb; build/linux/installer/scripts/tomlparser-health-config.rb; 755; root; root
/opt/tomlparser-agent-config.rb; build/linux/installer/scripts/tomlparser-agent-config.rb; 755; root; root
/opt/tomlparser.rb; build/common/installer/scripts/tomlparser.rb; 755; root; root
/opt/td-agent-bit-conf-customizer.rb; build/common/installer/scripts/td-agent-bit-conf-customizer.rb; 755; root; root
/opt/ConfigParseErrorLogger.rb; build/common/installer/scripts/ConfigParseErrorLogger.rb; 755; root; root
Expand Down
172 changes: 172 additions & 0 deletions build/linux/installer/scripts/tomlparser-agent-config.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
#!/usr/local/bin/ruby

#this should be require relative in Linux and require in windows, since it is a gem install on windows
@os_type = ENV["OS_TYPE"]
if !@os_type.nil? && !@os_type.empty? && @os_type.strip.casecmp("windows") == 0
require "tomlrb"
else
require_relative "tomlrb"
end

require_relative "ConfigParseErrorLogger"

@configMapMountPath = "/etc/config/settings/agent-settings"
@configSchemaVersion = ""
@enable_health_model = false

# 250 Node items (15KB per node) account to approximately 4MB
@nodesChunkSize = 250
# 1000 pods (10KB per pod) account to approximately 10MB
@podsChunkSize = 1000
# 4000 events (1KB per event) account to approximately 4MB
@eventsChunkSize = 4000
# roughly each deployment is 8k
# 500 deployments account to approximately 4MB
@deploymentsChunkSize = 500
# roughly each HPA is 3k
# 2000 HPAs account to approximately 6-7MB
@hpaChunkSize = 2000
# stream batch sizes to avoid large file writes
# too low will consume higher disk iops
@podsEmitStreamBatchSize = 200
@nodesEmitStreamBatchSize = 100

# higher the chunk size rs pod memory consumption higher and lower api latency
# similarly lower the value, helps on the memory consumption but incurrs additional round trip latency
# these needs to be tuned be based on the workload
# nodes
@nodesChunkSizeMin = 100
@nodesChunkSizeMax = 400
# pods
@podsChunkSizeMin = 250
@podsChunkSizeMax = 1500
# events
@eventsChunkSizeMin = 2000
@eventsChunkSizeMax = 10000
# deployments
@deploymentsChunkSizeMin = 500
@deploymentsChunkSizeMax = 1000
# hpa
@hpaChunkSizeMin = 500
@hpaChunkSizeMax = 2000

# emit stream sizes to prevent lower values which costs disk i/o
# max will be upto the chunk size
@podsEmitStreamBatchSizeMin = 50
@nodesEmitStreamBatchSizeMin = 50

def is_number?(value)
true if Integer(value) rescue false
end

# Use parser to parse the configmap toml file to a ruby structure
def parseConfigMap
begin
# Check to see if config map is created
if (File.file?(@configMapMountPath))
puts "config::configmap container-azm-ms-agentconfig for agent settings mounted, parsing values"
parsedConfig = Tomlrb.load_file(@configMapMountPath, symbolize_keys: true)
puts "config::Successfully parsed mounted config map"
return parsedConfig
else
puts "config::configmap container-azm-ms-agentconfig for agent settings not mounted, using defaults"
return nil
end
rescue => errorStr
ConfigParseErrorLogger.logError("Exception while parsing config map for agent settings : #{errorStr}, using defaults, please check config map for errors")
return nil
end
end

# Use the ruby structure created after config parsing to set the right values to be used as environment variables
def populateSettingValuesFromConfigMap(parsedConfig)
begin
if !parsedConfig.nil? && !parsedConfig[:agent_settings].nil?
if !parsedConfig[:agent_settings][:health_model].nil? && !parsedConfig[:agent_settings][:health_model][:enabled].nil?
@enable_health_model = parsedConfig[:agent_settings][:health_model][:enabled]
puts "enable_health_model = #{@enable_health_model}"
end
chunk_config = parsedConfig[:agent_settings][:chunk_config]
if !chunk_config.nil?
nodesChunkSize = chunk_config[:NODES_CHUNK_SIZE]
if !nodesChunkSize.nil? && is_number?(nodesChunkSize) && (@nodesChunkSizeMin..@nodesChunkSizeMax) === nodesChunkSize.to_i
@nodesChunkSize = nodesChunkSize.to_i
puts "Using config map value: NODES_CHUNK_SIZE = #{@nodesChunkSize}"
end

podsChunkSize = chunk_config[:PODS_CHUNK_SIZE]
if !podsChunkSize.nil? && is_number?(podsChunkSize) && (@podsChunkSizeMin..@podsChunkSizeMax) === podsChunkSize.to_i
@podsChunkSize = podsChunkSize.to_i
puts "Using config map value: PODS_CHUNK_SIZE = #{@podsChunkSize}"
end

eventsChunkSize = chunk_config[:EVENTS_CHUNK_SIZE]
if !eventsChunkSize.nil? && is_number?(eventsChunkSize) && (@eventsChunkSizeMin..@eventsChunkSizeMax) === eventsChunkSize.to_i
@eventsChunkSize = eventsChunkSize.to_i
puts "Using config map value: EVENTS_CHUNK_SIZE = #{@eventsChunkSize}"
end

deploymentsChunkSize = chunk_config[:DEPLOYMENTS_CHUNK_SIZE]
if !deploymentsChunkSize.nil? && is_number?(deploymentsChunkSize) && (@deploymentsChunkSizeMin..@deploymentsChunkSizeMax) === deploymentsChunkSize.to_i
@deploymentsChunkSize = deploymentsChunkSize.to_i
puts "Using config map value: DEPLOYMENTS_CHUNK_SIZE = #{@deploymentsChunkSize}"
end

hpaChunkSize = chunk_config[:HPA_CHUNK_SIZE]
if !hpaChunkSize.nil? && is_number?(hpaChunkSize) && (@hpaChunkSizeMin..@hpaChunkSizeMax) === hpaChunkSize.to_i
@hpaChunkSize = hpaChunkSize.to_i
puts "Using config map value: HPA_CHUNK_SIZE = #{@hpaChunkSize}"
end

podsEmitStreamBatchSize = chunk_config[:PODS_EMIT_STREAM_BATCH_SIZE]
if !podsEmitStreamBatchSize.nil? && is_number?(podsEmitStreamBatchSize) &&
podsEmitStreamBatchSize.to_i <= @podsChunkSize && podsEmitStreamBatchSize.to_i >= @podsEmitStreamBatchSizeMin
@podsEmitStreamBatchSize = podsEmitStreamBatchSize.to_i
puts "Using config map value: PODS_EMIT_STREAM_BATCH_SIZE = #{@podsEmitStreamBatchSize}"
end
nodesEmitStreamBatchSize = chunk_config[:NODES_EMIT_STREAM_BATCH_SIZE]
if !nodesEmitStreamBatchSize.nil? && is_number?(nodesEmitStreamBatchSize) &&
nodesEmitStreamBatchSize.to_i <= @nodesChunkSize && nodesEmitStreamBatchSize.to_i >= @nodesEmitStreamBatchSizeMin
@nodesEmitStreamBatchSize = nodesEmitStreamBatchSize.to_i
puts "Using config map value: NODES_EMIT_STREAM_BATCH_SIZE = #{@nodesEmitStreamBatchSize}"
end
end
end
rescue => errorStr
puts "config::error:Exception while reading config settings for agent configuration setting - #{errorStr}, using defaults"
@enable_health_model = false
end
end

@configSchemaVersion = ENV["AZMON_AGENT_CFG_SCHEMA_VERSION"]
puts "****************Start Config Processing********************"
if !@configSchemaVersion.nil? && !@configSchemaVersion.empty? && @configSchemaVersion.strip.casecmp("v1") == 0 #note v1 is the only supported schema version , so hardcoding it
configMapSettings = parseConfigMap
if !configMapSettings.nil?
populateSettingValuesFromConfigMap(configMapSettings)
end
else
if (File.file?(@configMapMountPath))
ConfigParseErrorLogger.logError("config::unsupported/missing config schema version - '#{@configSchemaVersion}' , using defaults, please use supported schema version")
end
@enable_health_model = false
end

# Write the settings to file, so that they can be set as environment variables
file = File.open("agent_config_env_var", "w")

if !file.nil?
file.write("export AZMON_CLUSTER_ENABLE_HEALTH_MODEL=#{@enable_health_model}\n")
file.write("export NODES_CHUNK_SIZE=#{@nodesChunkSize}\n")
file.write("export PODS_CHUNK_SIZE=#{@podsChunkSize}\n")
file.write("export EVENTS_CHUNK_SIZE=#{@eventsChunkSize}\n")
file.write("export DEPLOYMENTS_CHUNK_SIZE=#{@deploymentsChunkSize}\n")
file.write("export HPA_CHUNK_SIZE=#{@hpaChunkSize}\n")
file.write("export PODS_EMIT_STREAM_BATCH_SIZE=#{@podsEmitStreamBatchSize}\n")
file.write("export NODES_EMIT_STREAM_BATCH_SIZE=#{@nodesEmitStreamBatchSize}\n")
# Close file after writing all environment variables
file.close
else
puts "Exception while opening file for writing config environment variables"
puts "****************End Config Processing********************"
end
73 changes: 0 additions & 73 deletions build/linux/installer/scripts/tomlparser-health-config.rb

This file was deleted.

32 changes: 16 additions & 16 deletions charts/azuremonitor-containers/templates/omsagent-rs-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ data:
<match oms.containerinsights.KubePodInventory**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubepods*.buffer
Expand All @@ -108,24 +108,24 @@ data:
</match>
<match oms.containerinsights.KubePVInventory**>
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
type out_oms
log_level debug
num_threads 5
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubepv*.buffer
buffer_queue_limit 20
buffer_queue_full_action drop_oldest_chunk
flush_interval 20s
retry_limit 10
retry_wait 5s
max_retry_wait 5m
</match>
<match oms.containerinsights.KubeEvents**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubeevents*.buffer
Expand Down Expand Up @@ -155,7 +155,7 @@ data:
<match oms.containerinsights.KubeNodeInventory**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/state/out_oms_kubenodes*.buffer
Expand Down Expand Up @@ -184,7 +184,7 @@ data:
<match oms.api.KubePerf**>
type out_oms
log_level debug
num_threads 5
num_threads 2
buffer_chunk_limit 4m
buffer_type file
buffer_path %STATE_DIR_WS%/out_oms_kubeperf*.buffer
Expand Down
9 changes: 9 additions & 0 deletions charts/azuremonitor-containers/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,15 @@ omsagent:
deployment:
affinity:
nodeAffinity:
# affinity to schedule on to ephemeral os node if its available
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: storageprofile
operator: NotIn
values:
- managed
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- labelSelector:
Expand Down
1 change: 1 addition & 0 deletions kubernetes/linux/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ ENV HOST_VAR /hostfs/var
ENV AZMON_COLLECT_ENV False
ENV KUBE_CLIENT_BACKOFF_BASE 1
ENV KUBE_CLIENT_BACKOFF_DURATION 0
ENV RUBY_GC_HEAP_OLDOBJECT_LIMIT_FACTOR 0.9
RUN /usr/bin/apt-get update && /usr/bin/apt-get install -y libc-bin wget openssl curl sudo python-ctypes init-system-helpers net-tools rsyslog cron vim dmidecode apt-transport-https gnupg && rm -rf /var/lib/apt/lists/*
COPY setup.sh main.sh defaultpromenvvariables defaultpromenvvariables-rs mdsd.xml envmdsd $tmpdir/
WORKDIR ${tmpdir}
Expand Down
16 changes: 8 additions & 8 deletions kubernetes/linux/main.sh
Original file line number Diff line number Diff line change
Expand Up @@ -171,14 +171,14 @@ done
source config_env_var


#Parse the configmap to set the right environment variables for health feature.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-health-config.rb
#Parse the configmap to set the right environment variables for agent config.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-agent-config.rb

cat health_config_env_var | while read line; do
cat agent_config_env_var | while read line; do
#echo $line
echo $line >> ~/.bashrc
done
source health_config_env_var
source agent_config_env_var

#Parse the configmap to set the right environment variables for network policy manager (npm) integration.
/opt/microsoft/omsagent/ruby/bin/ruby tomlparser-npm-config.rb
Expand Down Expand Up @@ -429,7 +429,7 @@ echo "export DOCKER_CIMPROV_VERSION=$DOCKER_CIMPROV_VERSION" >> ~/.bashrc

#region check to auto-activate oneagent, to route container logs,
#Intent is to activate one agent routing for all managed clusters with region in the regionllist, unless overridden by configmap
# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map
# AZMON_CONTAINER_LOGS_ROUTE will have route (if any) specified in the config map
# AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE will have the final route that we compute & set, based on our region list logic
echo "************start oneagent log routing checks************"
# by default, use configmap route for safer side
Expand Down Expand Up @@ -462,9 +462,9 @@ else
echo "current region is not in oneagent regions..."
fi

if [ "$isoneagentregion" = true ]; then
if [ "$isoneagentregion" = true ]; then
#if configmap has a routing for logs, but current region is in the oneagent region list, take the configmap route
if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then
if [ ! -z $AZMON_CONTAINER_LOGS_ROUTE ]; then
AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE=$AZMON_CONTAINER_LOGS_ROUTE
echo "oneagent region is true for current region:$currentregion and config map logs route is not empty. so using config map logs route as effective route:$AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE"
else #there is no configmap route, so route thru oneagent
Expand Down Expand Up @@ -511,7 +511,7 @@ if [ ! -e "/etc/config/kube.conf" ]; then

echo "starting mdsd ..."
mdsd -l -e ${MDSD_LOG}/mdsd.err -w ${MDSD_LOG}/mdsd.warn -o ${MDSD_LOG}/mdsd.info -q ${MDSD_LOG}/mdsd.qos &

touch /opt/AZMON_CONTAINER_LOGS_EFFECTIVE_ROUTE_V2
fi
fi
Expand Down
Loading

0 comments on commit 064bc06

Please sign in to comment.