Skip to content

Commit

Permalink
Development Guide
Browse files Browse the repository at this point in the history
  • Loading branch information
camrossi committed Jul 18, 2024
1 parent e2abd06 commit 16cbc5f
Show file tree
Hide file tree
Showing 2 changed files with 319 additions and 0 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ flowchart-elk
V -->|No| SL
```

# Stack Development
If you want to contribute to this project star from [Here](docs/development.md)

# Stack Deployment

## Pre Requisites
Expand Down
316 changes: 316 additions & 0 deletions docs/development.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Development Guide

## Creating/Editing Dashboards

Assuming the data required is available (see the [Data Collection](#data-collection) section) you can develop or modify Grafana Dashboard directly in the UI.
Once you are happy with the result you can copy-paste the JSON model in a new file (or update an existing one) and place it in the [dashboards](../charts/aci-monitoring-stack/dashboards) folder.
The Helm chart creates a new [ConfigMap](../charts/aci-monitoring-stack/templates/grafana-configmap-dashboards.yaml) for each file in the dashboards folder and mounts it into the Grafana Container.
The ConfigMap name is the Filename without the extension.

For example the `fabric-wide-capacity.json` file creates a ConfigMap called `fabric-wide-capacity-dashboard`


## Data Collection

Data is currently collected in two ways:

- Syslog Ingestion: The ACI Side Config "decides" what to send and assuming the correct logging level is selected you can then build the dashboards in grafana using Loki as a data source. You can take a look at the `Contract Drops Logs` dashboard for inspiration.

- [ACI Exporter](https://github.com/opsdis/aci-exporter) Queries: which queries and how the data is collected is highly customizable.

### ACI Exporter and Prometheus

The general idea is to use aci-exporter to convert ACI Rest API Calls in the Prometheus exposition format.

The exporter also have the capability to directly scrape individual switches using the aci-exporter inbuilt http based service discovery. Doing direct spine and leaf queries is typical useful in large fabrics, where doing all api calls through the APIC can put a high load on the APIC and result in high response time.

**Note:** In the context of this HELM Chart a query **MUST** be executed against a switch if possible. Any code submission that does not adhere to this convention will not be accepted.

#### ACI Exporter Quick Start

Before working on aci-exporter, Prometheus and Grafana at the same time I strongly suggest to take a look at the [ACI Exporter](https://github.com/opsdis/aci-exporter) git repo and understand how it works and how is configured.

Here a complete example to get you started (you need to [install go](https://go.dev/doc/install))

- Clone the Repo and Compile the exporter
```bash
git clone https://github.com/opsdis/aci-exporter.git
go build -o build/aci-exporter *.go
```
- Create a basic config file for you ACI Fabric

```yaml
fabrics:
fab1:
username: foo
password: bar
apic:
- https://apic1
- https://apic2
```
- ACI Exporter will, by default, load the queries it can execute from the `config.d` directory. For now we don't want that so we can start the exporter with this command that will just load the bare minimum config to access the fabric.

```bash
./build/aci-exporter -config fab1.yaml -config_dir /dev/null
{"level":"info","msg":"Configuration directory do not exist - stat /home/cisco/aci-exporter/dev/null: no such file or directory","time":"2024-07-18T14:17:59+10:00"}
{"fabric":"fab1","level":"info","msg":"Configured fabric","time":"2024-07-18T14:17:59+10:00"}
{"config_file":"/home/cisco/aci-exporter/fab1.yaml","level":"info","msg":"aci-exporter starting","port":9643,"read_timeout":0,"time":"2024-07-18T14:17:59+10:00","version":"undefined","write_timeout":0}
```

- Now ACI Exporter is running on our host on port 9643, let's try a Service Discovery just run a HTTP request against the `/sd` URL.

``` bash
curl http://aci-exporter-ip:9643/sd
[
{
"targets": [
"fab1#10.67.185.106"
],
"labels": {
"__meta_aci_exporter_fabric": "fab1",
"__meta_address": "10.0.0.1",
"__meta_dn": "topology/pod-1/node-1/sys",
"__meta_fabricDomain": "ACI Fabric1",
"__meta_fabricId": "1",
"__meta_id": "1",
"__meta_inbMgmtAddr": "192.168.68.2",
"__meta_name": "fab1-apic1",
"__meta_nameAlias": "",
"__meta_nodeType": "unspecified",
"__meta_oobMgmtAddr": "10.67.185.106",
"__meta_podId": "1",
"__meta_role": "controller",
"__meta_serial": "WZP233907GX",
"__meta_siteId": "0",
"__meta_state": "in-service",
"__meta_version": "6.1(1a)"
}
},
=== SNIP ===
]
```

This should return a list with all the Controllers and Switches in your fabric and is what Prometheus uses for its own service discovery.
Now let's try to build a query to check the `interface operation state and speed`.

- The ACI Class we can use for this query is `ethpmPhysIf`
- This class is available both on the APIC as well as from the Switches: we will run this query **against the switchers** because it is the core principle for this HELM chart and it scales better.
- *Tip:* If you use Visual Studio Code you can install the `Thunder Client` to test API Calls.

Every switch will return one `ethpmPhysIf` object for every interface. An example is provided below:
```yaml
{
"ethpmPhysIf": {
"attributes": {
"accessVlan": "unknown",
"allowedVlans": "",
"backplaneMac": "00:DE:FB:21:37:E2",
"bundleBupId": "2",
"bundleIndex": "unspecified",
"cfgAccessVlan": "unknown",
"cfgNativeVlan": "unknown",
"childAction": "",
"currErrIndex": "4294967295",
"diags": "none",
"dn": "sys/phys-[eth1/34]/phys",
"encap": "3",
"errDisTimerRunning": "no",
"errVlanStatusHt": "0",
"errVlans": "",
"hwBdId": "0",
"hwResourceId": "0",
"intfT": "phy",
"iod": "38",
"lastErrors": "0",
"lastLinkStChg": "2024-07-10T00:50:13.841+00:00",
"media": "2",
"modTs": "never",
"monPolDn": "uni/infra/moninfra-default",
"nativeVlan": "unknown",
"numOfSI": "0",
"operBitset": "3-4",
"operDceMode": "edge",
"operDuplex": "full",
"operEEERxWkTime": "0",
"operEEEState": "not-applicable",
"operEEETxWkTime": "0",
"operErrDisQual": "none",
"operFecMode": "disable-fec",
"operFlowCtrl": "0",
"operMdix": "auto",
"operMode": "trunk",
"operModeDetail": "unknown",
"operPhyEnSt": "unknown",
"operRouterMac": "00:00:00:00:00:00",
"operSpeed": "10G",
"operSt": "up",
"operStQual": "none",
"operStQualCode": "0",
"operVlans": "",
"osSum": "failed",
"portCfgWaitFlags": "0",
"primaryVlan": "vlan-1",
"resetCtr": "1",
"siList": "",
"status": "",
"txT": "unknown",
"usage": "discovery",
"userCfgdFlags": "1",
"vdcId": "1"
}
}
}
```

Of all the various properties of `ethpmPhysIf` we need only 3:

- `operSpeed`: The speed
- `operSt`: If the port is UP/Down
- `dn`: We use the content of the `dn` to extract two lables:
- `interface_type`: Physical, Port-Channel etc...
- `interface`: The interface name, i.e. Eth1/1

With these infos we can create 2 metrics that I am gonna call:

- `interface_oper_speed`
- `interface_oper_state`

Both metrics will be labeled with the`interface_type` and `interface` (name). However we are faced with an issue... Promethesu can only ingest numbers so we can't just pass `40G` or `up` as a valid metric.

Thankfully one of the many ACI Exporter capabilities is to perform `value_transform` so we can write something like this:

```yaml
value_transform:
'unknown': 0
'100M': 100000000
'1G': 1000000000
'10G': 10000000000
'25G': 25000000000
'40G': 40000000000
'100G': 100000000000
'400G': 400000000000
value_transform:
'unknown': 0
'down': 1
'up': 2
'link-up': 3
```
To convert text to numbers and allow Prometheus to ingest this data.

Lastly we need to also extract the `labels` from the `dn`. The format for this specific class is always something similar to `"sys/phys-[eth1/34]/phys"` to do this ACI Exporter employs RegEx, below an example:

```yaml
labels:
# The field in the json used to parse the labels from
- property_name: ethpmPhysIf.attributes.dn
regex: "^sys/(?P<interface_type>[a-z]+)-\\[(?P<interface>[^\\]]+)\\]/"
```
This named RegEx will create 2 new labels `interface_type` and `interface` and map them to the interface ID and Type. If you want to experiment with RegEx you can use https://regex101.com/, just select `Golang` also since the aci-exporter config is in yaml some character needs to be double escapade (`\\`).

Putting all together a `class_queries` will look like this:
```yaml
class_queries:
# This is the name of the query. A query can generate multiple metrics.
node_interface_info:
# Interface speed and status
class_name: ethpmPhysIf
metrics:
# The name of the metrics without prefix and unit
- name: interface_oper_speed
value_name: ethpmPhysIf.attributes.operSpeed
unit: bps
type: gauge
help: The current operational speed of the interface, in bits per second.
value_transform:
'unknown': 0
'100M': 100000000
'1G': 1000000000
'10G': 10000000000
'25G': 25000000000
'40G': 40000000000
'100G': 100000000000
'400G': 400000000000
- name: interface_oper_state
# The field in the json that is used as the metric value, qualified path (gjson) under imdata
value_name: ethpmPhysIf.attributes.operSt
# Type
type: gauge
# Help text without prefix of metrics name
help: The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up)
# A string to float64 transform table of the value
value_transform:
'unknown': 0
'down': 1
'up': 2
'link-up': 3
# The labels to extract as regex
labels:
# The field in the json used to parse the labels from
- property_name: ethpmPhysIf.attributes.dn
regex: "^sys/(?P<interface_type>[a-z]+)-\\[(?P<interface>[^\\]]+)\\]/"
```

Now Copy Paste this into the config file.

Based on the service discovery we executed before we have all the required infos to run a query against a switch, the aci-exporter URL has the following format:

`/probe?target=<__meta_aci_exporter_fabric>&node=<__meta_inbMgmtAddr|__meta_oobMgmtAddr>&queries=<query1,query2,etc>`
Here an example:

```bash
curl "http://aci-exporter-ip:9643/probe?target=fab1&node=192.168.68.8&queries=node_interface_info"
# HELP aci_interface_oper_speed_bps The current operational speed of the interface, in bits per second.
# TYPE aci_interface_oper_speed_bps gauge
aci_interface_oper_speed_bps{fabric="fab1",interface="eth1/1",interface_type="phys"} 1e+11
aci_interface_oper_speed_bps{fabric="fab1",interface="eth1/2",interface_type="phys"} 1e+11
aci_interface_oper_speed_bps{fabric="fab1",interface="eth1/3",interface_type="phys"} 4e+11
# HELP aci_interface_oper_state The current operational state of the interface. (0=unknown, 1=down, 2=up, 3=link-up)
# TYPE aci_interface_oper_state gauge
aci_interface_oper_state{fabric="fab1",interface="eth1/1",interface_type="phys"} 2
aci_interface_oper_state{fabric="fab1",interface="eth1/2",interface_type="phys"} 2
aci_interface_oper_state{fabric="fab1",interface="eth1/3",interface_type="phys"} 1
# HELP aci_scrape_duration_seconds The duration, in seconds, of the last scrape of the fabric
# TYPE aci_scrape_duration_seconds gauge
aci_scrape_duration_seconds{fabric="fab1"} 0.038120213
# HELP aci_up The connection state 1=UP, 0=DOWN
# TYPE aci_up gauge
aci_up{fabric="fab1"} 1
```

What happens with the ACI Monitoring Stack is that Prometheus executes queries against aci-exporter.

### Adding New Queries

ACI Monitoring Stack comes pre-configured with a lot of queries and are all located in the [config.d](../charts/aci-monitoring-stack/config.d) folder. The majority of these queries are directed against the switches and are pre-pended with the `node` keyword however it is not aci-exporter that decides where to send the queries. This is based on the URL used by Prometheus, check the [ScrapeConfigs](../charts/aci-monitoring-stack/templates/prometheus/configmap-config.yaml).

You will see there are 2 type of scrape configs:
- *-aci-exporter-**apics**: This will execute queries against the APICs
- *-aci-exporter-**switches**: This will execute queries against the individual switches.

Selection between APIC or Switches is done by using different re-labeling configs for Prometheus. Most likely you won't need to change the re-labeling config.

To add a new query follow these steps:

- Develop a new ACI-Exporter query and test is with `curl` to ensure it returns the expected data
- Add the query to one of the files in the [config.d](../charts/aci-monitoring-stack/config.d) folder or create a new file if your query dosen't belong to any of the existing categoris.
- add the query name in the `queries` list of the APIC or Switches inside the [ScrapeConfigs](../charts/aci-monitoring-stack/templates/prometheus/configmap-config.yaml).

Below a scrape config example:
```yaml
- job_name: {{ $k }}-aci-exporter-apics
scrape_interval: 5m
scrape_timeout: 4m
metrics_path: /probe
params:
# List of the queries to execute at the fabric level. They need to match the aci-exporter config
# DO NOT INSERT SPACES and use \ for next line or aci-exporter will not be able to parse the queries
queries:
- "health,fabric_node_info,max_capacity,max_global_pctags,\
vlans,static_binding_info,node_count,object_count,\
ps_power_usage,apic_hw_sensors,controller_topsystem"
```

0 comments on commit 16cbc5f

Please sign in to comment.