Skip to content

Commit

Permalink
Smart Gateway Performance Enhancement Work (#109)
Browse files Browse the repository at this point in the history
* Changes to match new internal SG metrics names

* Changing exported_instance to host

* plugin_instance relabelling no longer being done in sg2

* type becomes type_instance

* Applied relevant changes to the new alerts

* Fixing plugin vs type instance on memory alarm

* Adjusted order of labels in smoketest

* Need longer interval for rates with default 10s scrape interval

* OCP metrics label change from pod_name to pod

* Raise the linkCapacity of QDR->bridge

* Adds a new edgeListener on 5673 with linkCapacity 25000
* Adjusts metrics SG bridge to connect to new listener
* Minimizes presettled metrics loss during throughput testing
  * Should also provide performance improvements in unsettled mode bursts
* Other SG modes can keep using 5672 until we converge the SG code

* New CRD and CSV for v1.0.3

* Updated scorecard paths

* Added amqpDataSource to metrics SG template

* This is so I can test changes here:
infrawatch/smart-gateway-operator#54

* In preparation to merge with:
https://github.com/infrawatch/service-telemetry-operator/pull/93/files#diff-33675527951f20f9727fb4be5c84a746R8

* Putting quickstart.sh back until build/run-ci.yaml is a true replacement

* No ability to deploy published artifacts without building
* No support for ephemeral storage

* Adding back quickstart configs

* Setup SG3 CI system (#107)

* Enhance CI automation (#106)

* Add ServiceTelemetry overrides

Allows ServiceTelemetry overrides to be expressed via Ansible extra-vars. Adds the four
main overrides you would expect in a ServiceTelemetry object, with appropriate defaults
set.

Will also allow passing in the service_telemetry_manifest as a whole object like what we
do with the Service Telemetry Operator.

* Allow for per-repo branch overrides

Allow for per-repo branch overrides for the Smart Gateway Operator and Smart Gateway
repositories via sgo_branch and sg_branch (respectively).

* Add functionality around quickstart.sh

Add some functionality that was replaced when I dropped the quickstart.sh. Adds
some of this functionality back in and also adds some new stuff.

* Fix syntax error

* Add back quickstart.sh

Adds back a quickstart.sh that simulates the same result as the old
quickstart.sh

* Better CSV modification support

Also adds some tags to make skipping over builds for testing much
easier.

* Debugging ci.yml firing

* Make sure namespace is set before using it

* Fix syntax and documentation

* Test locally first kids

* Drop CI debug lines

* Clean up working repo clones

* Copy CSV into working directory

On subsequent runs the in-place modification of the CSV can cause issues either
in the development environment, or re-runs of the CI system. Copying the CSV out
of the in-place repo into a working location, and then modifying in-place results
in a cleaner setup.

By doing the copy of the CSV files, we can drop the need to force clone the supporting
repositories.

Also cleans up some shell commands that were commented out now that they are being
dealt with via the replace module. Removes the extra commands added to ci.yml.

* Changes to infrared-openstack.sh for OSP13 (#102)

* Migrate OSP16 script to OSP13 directory

Uses a multi-cloud stf-connectors.yaml style configuration which directly loads
the resource lists rather than a list of environment files. Uses the same script
as used in OSP16 but subs out the network configuration for a vlan type setup and the
latest paths for async puddle.

* Working deployment of OSP13

* Migrate changes to align to existing docs

Update PR to align to existing documentation and testing the group has been working on.
Adjust the stf-connectors.yaml.template to better reflect what we've been testing.

Deployment by default will result in presettle: true which is bad for reliability of
message delivery.

* Get closer alignment to OSP16 setup

* Enable deployment of metric SG for Ceilometer data (#93)

* Enable deployment of metric SG for Ceilometer data

Depends-On: infrawatch/smart-gateway#83
Depends-On: infrawatch/smart-gateway-operator#48

* Add smoketest for Ceilometer data

* Listen on correct channel

* Ceilometer smoketest tuning

Makes smoketest_ceilometer_entrypoint.sh being executed during smoketest job.

* Use data source setting for metrics too
	# Please enter the commit message for your changes. Lines starting
	# with '#' will be ignored, and an empty message aborts the commit.
	#
	# On branch mmagr-amqp10connections
	# Changes to be committed:
	#	modified:   roles/servicetelemetry/templates/manifest_smartgateway_metrics.j2
	#

* Finish ceilometer events smoke test

* Do not use hardcoded timestamps

* Finish ceilometer metrics smoketests

* Increase timeout

* Add container names

* Validate also Ceilometer metrics SG

* Update tests/smoketest/smoketest_ceilometer_entrypoint.sh

* Update tests/smoketest/smoketest_collectd_entrypoint.sh

Co-authored-by: Martin Magr <[email protected]>
Co-authored-by: Leif Madsen <[email protected]>

* Implement CI updates for SG3

* Correct value for SG bridge image path

* Lock operator-courier to 2.1.7

Lock operator-courier to 2.1.7 until we can figure out what is wrong with our CSV/CRD setup
or until the operator-courier issue noted in the related issue is resolved.

Related: #108

* Update build/stf-run-ci/tasks/main.yml

Co-authored-by: Chris Sibbitt <[email protected]>

* Adjust README to match run-ci.yaml methods

Co-authored-by: Martin Mágr <[email protected]>
Co-authored-by: Martin Magr <[email protected]>
Co-authored-by: Chris Sibbitt <[email protected]>

Co-authored-by: Chris Sibbitt <[email protected]>
Co-authored-by: Martin Mágr <[email protected]>
Co-authored-by: Martin Magr <[email protected]>
  • Loading branch information
4 people authored Jul 13, 2020
1 parent 433e403 commit 0bd9498
Show file tree
Hide file tree
Showing 19 changed files with 565 additions and 98 deletions.
6 changes: 3 additions & 3 deletions .osdk-scorecard.yaml
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
scorecard:
version: v1alpha2
output: text
bundle: deploy/olm-catalog/service-telemetry-operator/1.0.2/metadata
bundle: deploy/olm-catalog/service-telemetry-operator/1.0.3/metadata
plugins:
- basic:
cr-manifest:
- "deploy/crds/infra.watch_v1alpha1_servicetelemetry_cr.yaml"
csv-path: "deploy/olm-catalog/service-telemetry-operator/1.0.2/service-telemetry-operator.v1.0.2.clusterserviceversion.yaml"
csv-path: "deploy/olm-catalog/service-telemetry-operator/1.0.3/service-telemetry-operator.v1.0.3.clusterserviceversion.yaml"
- olm:
cr-manifest:
- "deploy/crds/infra.watch_v1alpha1_servicetelemetry_cr.yaml"
csv-path: "deploy/olm-catalog/service-telemetry-operator/1.0.2/service-telemetry-operator.v1.0.2.clusterserviceversion.yaml"
csv-path: "deploy/olm-catalog/service-telemetry-operator/1.0.3/service-telemetry-operator.v1.0.3.clusterserviceversion.yaml"
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ language: python
git:
depth: 1
install:
- pip install operator-courier
- pip install operator-courier==2.1.7
- pip install ansible-lint
script:
- operator-courier verify --ui_validate_io deploy/olm-catalog/service-telemetry-operator
Expand Down
14 changes: 10 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,14 +46,20 @@ The quickest way to start up Service Telemetry Framework for development is to
run the `quickstart.sh` script located in the `deploy/` directory after starting
up a [CodeReady Containers](https://github.com/code-ready/crc) environment.

core operator code like this:
To deploy a local build of the Service Telemetry Operator itself, start by
running `ansible-playbook build/run-ci.yaml`. If you have code to coordinate
across the supporting InfraWatch repositories, you can pass the
`working_branch` paramater to the `--extra-vars` flag like so:

```shell
./build/build.sh &&\
./build/push_container2ocp.sh &&\
oc delete po -l name=service-telemetry-operator
ansible-playbook \
--extra-vars working_branch="username-new_feature" \
build/run-ci.yaml
```

Additional flags for overriding various branch and path names is documented in
`build/stf-run-ci/README.md`.

## CI

### Travis
Expand Down
8 changes: 5 additions & 3 deletions build/stf-run-ci/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
stf-run-ci
=========
==========

Run the Service Telemetry Framework CI system. This role is intended to be
called from a playbook running locally on a preconfigured test system.
Expand All @@ -9,7 +9,7 @@ Requirements
------------

- CodeReady Containers
- Ansible
- Ansible 2.9 (tested)
- `oc` command line tool

Variables
Expand All @@ -22,8 +22,10 @@ choose to override:
| ------------------------------ | ------------ | --------- | ------------------------------------ |
| `__deploy_stf` | {true,false} | true | Whether to deploy an instance of STF |
| `__local_build_enabled` | {true,false} | true | Whether to deploySTF from local built artifacts. Also see `working_branch`, `sg_branch`, `sgo_branch` |
| `sg_branch` | <git_branch> | master | Which Smart Gateway git branch to checkout |
| `sgo_branch` | <git_branch> | master | Which Smart Gateway Operator git branch to checkout |
| `sg_branch` | <git_branch> | master | Which Smart Gateway git branch to checkout |
| `sg_core_branch` | <git_branch> | master | Which Smart Gateway Core git branch to checkout |
| `sg_bridge_branch` | <git_branch> | master | Which Smart Gateway Bridge git branch to checkout |
| `__service_telemetry_events_enabled` | {true,false} | true | Whether to enable events support in ServiceTelemetry |
| `__service_telemetry_high_availability_enabled` | {true,false} | false | Whether to enable high availability support in ServiceTelemetry |
| `__service_telemetry_metrics_enabled` | {true,false} | true | Whether to enable metrics support in ServiceTelemetry |
Expand Down
32 changes: 31 additions & 1 deletion build/stf-run-ci/tasks/clone_repos.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
---
# clone our other repositories into this repo
# NOTE: since you can't loop against blocks (and we're using them for failure #
# recovery when the request branch doesn't exist) we have to define each
# of these separately rather than using a loop.
- name: Get Smart Gateway Operator
block:
- name: Try cloning same-named branch or override branch
Expand All @@ -14,7 +17,7 @@
dest: working/smart-gateway-operator
version: master

- name: Get Smart Gateway
- name: Get Smart Gateway (legacy)
block:
- name: Try cloning same-named branch or override branch
git:
Expand All @@ -28,3 +31,30 @@
dest: working/smart-gateway
version: master

- name: Get sg-core
block:
- name: Try cloning same-named branch or override branch
git:
repo: https://github.com/infrawatch/sg-core
dest: working/sg-core
version: "{{ sg_core_branch | default(branch, true) }}"
rescue:
- name: Get master branch because same-named doesn't exist
git:
repo: https://github.com/infrawatch/sg-core
dest: working/sg-core
version: master

- name: Get sg-bridge
block:
- name: Try cloning same-named branch or override branch
git:
repo: https://github.com/infrawatch/sg-bridge
dest: working/sg-bridge
version: "{{ sg_bridge_branch | default(branch, true) }}"
rescue:
- name: Get master branch because same-named doesn't exist
git:
repo: https://github.com/infrawatch/sg-bridge
dest: working/sg-bridge
version: master
6 changes: 5 additions & 1 deletion build/stf-run-ci/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@
branch: "{{ working_branch | default('master') }}"
namespace: "{{ working_namespace | default('service-telemetry') }}"

- name: Set default image paths when skipping local builds
- name: Set default image paths for local builds
set_fact:
sg_image_path: image-registry.openshift-image-registry.svc:5000/{{ namespace }}/smart-gateway:latest
sgo_image_path: image-registry.openshift-image-registry.svc:5000/{{ namespace }}/smart-gateway-operator:latest
sto_image_path: image-registry.openshift-image-registry.svc:5000/{{ namespace }}/service-telemetry-operator:latest
sg_core_image_path: image-registry.openshift-image-registry.svc:5000/{{ namespace }}/sg-core:latest
sg_bridge_image_path: image-registry.openshift-image-registry.svc:5000/{{ namespace }}/sg-bridge:latest

- block:
- name: Setup supporting repositories
Expand All @@ -28,6 +30,8 @@
- { name: service-telemetry-operator, dockerfile_path: build/Dockerfile, image_reference_name: sto_image_path, working_build_dir: ../ }
- { name: smart-gateway-operator, dockerfile_path: build/Dockerfile, image_reference_name: sgo_image_path, working_build_dir: ./working/smart-gateway-operator }
- { name: smart-gateway, dockerfile_path: Dockerfile, image_reference_name: sg_image_path, working_build_dir: ./working/smart-gateway }
- { name: sg-core, dockerfile_path: build/Dockerfile, image_reference_name: sg_core_image_path, working_build_dir: ./working/sg-core }
- { name: sg-bridge, dockerfile_path: build/Dockerfile, image_reference_name: sg_bridge_image_path, working_build_dir: ./working/sg-bridge }
loop_control:
loop_var: artifact
tags:
Expand Down
14 changes: 13 additions & 1 deletion build/stf-run-ci/tasks/setup_stf_local_build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,24 @@
- name: Copy SGO CSV to working directory
command: cp working/smart-gateway-operator/deploy/olm-catalog/smart-gateway-operator/{{ sgo_current_csv.stdout }}/smart-gateway-operator.v{{ sgo_current_csv.stdout }}.clusterserviceversion.yaml working/

- name: Replace SG image path in SGO CSV
- name: Replace SG (legacy) image path in SGO CSV
replace:
path: working/smart-gateway-operator.v{{ sgo_current_csv.stdout }}.clusterserviceversion.yaml
regexp: '(\s+)value: quay\.io/infrawatch/smart-gateway\:.+$'
replace: '\1value: {{ sg_image_path }}'

- name: Replace SG core image path in SGO CSV
replace:
path: working/smart-gateway-operator.v{{ sgo_current_csv.stdout }}.clusterserviceversion.yaml
regexp: '(\s+)value: quay\.io/infrawatch/sg-core\:.+$'
replace: '\1value: {{ sg_core_image_path }}'

- name: Replace SG bridge image path in SGO CSV
replace:
path: working/smart-gateway-operator.v{{ sgo_current_csv.stdout }}.clusterserviceversion.yaml
regexp: '(\s+)value: quay\.io/infrawatch/sg-bridge\:.+$'
replace: '\1value: {{ sg_bridge_image_path }}'

- name: Replace SGO image path in SGO CSV
replace:
path: working/smart-gateway-operator.v{{ sgo_current_csv.stdout }}.clusterserviceversion.yaml
Expand Down
16 changes: 8 additions & 8 deletions deploy/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -139,47 +139,47 @@ spec:
annotations:
summary: CPU usage high (warning)
expr: >-
sum without(cpu,type) (collectd_cpu_percent{type=~"user|system"}) / sum without(cpu,type) (collectd_cpu_percent{type="idle"}) > 0.5
sum without(plugin_instance,type_instance) (collectd_cpu_percent{type_instance=~"user|system"}) / sum without(plugin_instance,type_instance) (collectd_cpu_percent{type_instance="idle"}) > 0.5
for: 10m
- alert: high cpu
labels:
severity: critical
annotations:
summary: CPU usage high (critical)
expr: >-
sum without(cpu,type) (collectd_cpu_percent{type=~"user|system"}) / sum without(cpu,type) (collectd_cpu_percent{type="idle"}) > 0.7
sum without(plugin_instance,type_instance) (collectd_cpu_percent{type_instance=~"user|system"}) / sum without(plugin_instance,type_instance) (collectd_cpu_percent{type_instance="idle"}) > 0.7
for: 10m
- alert: inode usage
labels:
severity: warning
annotations:
summary: Inodes usage (warning)
expr: >-
sum without (endpoint,service,type) (collectd_df_df_inodes{df="root",type="used"})/ (sum without (endpoint,service,type) (collectd_df_df_inodes{df="root",type=~"free|used"})) > 0.6
sum without (endpoint,service,type_instance) (collectd_df_df_inodes{plugin_instance="root",type_instance="used"})/ (sum without (endpoint,service,type_instance) (collectd_df_df_inodes{plugin_instance="root",type_instance=~"free|used"})) > 0.6
for: 10m
- alert: inode usage
labels:
severity: critical
annotations:
summary: Inodes usage (critical)
expr: >-
sum without (endpoint,service,type) (collectd_df_df_inodes{df="root",type="used"})/ (sum without (endpoint,service,type) (collectd_df_df_inodes{df="root",type=~"free|used"})) > 0.8
sum without (endpoint,service,type_instance) (collectd_df_df_inodes{plugin_instance="root",type_instance="used"})/ (sum without (endpoint,service,type_instance) (collectd_df_df_inodes{plugin_instance="root",type_instance=~"free|used"})) > 0.8
for: 10m
- alert: hugepages
labels:
severity: warning
annotations:
summary: Hugepages (warning)
expr: >-
sum without (type) (collectd_hugepages_vmpage_number{type="free"})/ sum without (type) (collectd_hugepages_vmpage_number) < 0.2
sum without (type_instance) (collectd_hugepages_vmpage_number{type_instance="free"})/ sum without (type_instance) (collectd_hugepages_vmpage_number) < 0.2
for: 10m
- alert: hugepages
labels:
severity: critical
annotations:
summary: Hugepages (warning)
expr: >-
sum without (type) (collectd_hugepages_vmpage_number{type="free"})/ sum without (type) (collectd_hugepages_vmpage_number) < 0.1
sum without (type_instance) (collectd_hugepages_vmpage_number{type_instance="free"})/ sum without (type_instance) (collectd_hugepages_vmpage_number) < 0.1
for: 10m
- alert: load longterm
labels:
Expand Down Expand Up @@ -235,13 +235,13 @@ spec:
annotations:
summary: memory low (warning)
expr: >-
sum without(memory) (collectd_memory{memory="used"})/sum without(memory) (collectd_memory) > 0.8 and sum without(memory) (collectd_memory{memory="used"})/sum without(memory) (collectd_memory) < 0.9
sum without(type_instance) (collectd_memory{type_instance="used"})/sum without(type_instance) (collectd_memory) > 0.8 and sum without(type_instance) (collectd_memory{type_instance="used"})/sum without(type_instance) (collectd_memory) < 0.9
for: 10m
- alert: memory low
labels:
severity: critical
annotations:
summary: memory low (critical)
expr: >-
sum without(memory) (collectd_memory{memory="used"})/sum without(memory) (collectd_memory) >= 0.9
sum without(type_instance) (collectd_memory{type_instance="used"})/sum without(type_instance) (collectd_memory) >= 0.9
for: 10m
8 changes: 8 additions & 0 deletions deploy/configs/default.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
KIND_SERVICETELEMETRY="apiVersion: infra.watch/v1alpha1
kind: ServiceTelemetry
metadata:
name: stf-default
namespace: ${OCP_PROJECT}
spec:
metricsEnabled: true
eventsEnabled: true"
1 change: 1 addition & 0 deletions deploy/configs/nostf.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
KIND_SERVICETELEMETRY=""
10 changes: 10 additions & 0 deletions deploy/configs/quicklab.bash
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# NOTE: namespace is hardcoded because the namespace is embedded in the certs loaded by this configuration
KIND_SERVICETELEMETRY="apiVersion: infra.watch/v1alpha1
kind: ServiceTelemetry
metadata:
name: stf-default
namespace: service-telemetry
spec:
metricsEnabled: true
eventsEnabled: true
storageEphemeralEnabled: true"
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: servicetelemetrys.infra.watch
spec:
group: infra.watch
names:
kind: ServiceTelemetry
listKind: ServiceTelemetryList
plural: servicetelemetrys
singular: servicetelemetry
scope: Namespaced
version: v1alpha1
subresources:
status: {}
versions:
- name: v1alpha1
served: true
storage: true
validation:
openAPIV3Schema:
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#resources'
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#types-kinds'
type: string
metadata:
description: Metadata definition for the ServiceTelemetry object
type: object
spec:
description: Specification of the desired behavior of the Service Telemetry Operator.
properties:
metricsEnabled:
description: Whether the Service Telemetry Operator should enable components related to metrics collection and storage.
type: boolean
eventsEnabled:
description: Whether the Service Telemetry Operator should enable components related to events collection and storage.
type: boolean
highAvailabilityEnabled:
description: Whether to deploy the services in HA mode.
type: boolean
storageEphemeralEnabled:
description: Request ephemeral storage (non-persistent, development use only) in the storage backends such as Prometheus and ElasticSearch.
type: boolean
prometheusStorageClass:
description: Storage class name used for Prometheus PVC
type: string
prometheusStorageResources:
description: Storage resource definition for Prometheus
type: string
prometheusStorageSelector:
description: Storage selector definition for Prometheus
type: string
prometheusPvcStorageRequest:
description: PVC storage requested size for Prometheus
type: string
alertmanagerStorageClass:
description: Storage class name used for Alertmanager PVC
type: string
alertmanagerStorageResources:
description: Storage resource definition for Alertmanager
type: string
alertmanagerStorageSelector:
description: Storage selector definition for Alertmanager
type: string
alertmanagerPvcStorageRequest:
description: PVC storage requested size for Alertmanager
type: string
status:
description: Status results of an instance of Service Telemetry
properties:
conditions:
description: The resulting conditions when a Service Telemetry is instantiated
items:
properties:
status:
type: string
type:
type: string
reason:
type: string
lastTransitionTime:
type: string
type: object
type: array
type: object
Loading

0 comments on commit 0bd9498

Please sign in to comment.