IoT Edge Logging and Monitoring Solution (pronounced Lm's) provides an implementation of an observability solution for a common IoT scenario. It demonstrates best practices and techniques regarding both observability pillars:
- Measuring and Monitoring
- Troubleshooting
In order to go beyond just abstract considerations, the sample implementation is based on a "real-life" use case:
The La Niña service measures surface temperature in Pacific Ocean to predict La Niña winters. There is a number of buoys in the ocean with IoT Edge devices sending the surface temperature to Azure Cloud. The telemetry data with the temperature is pre-processed by a custom module on the IoT Edge device before sending it to the cloud. In the cloud the data is processed by a chain of two backend Azure Functions and saved to a Blob Storage. The clients of the service (sending data to ML inference workflow, decision making systems, various UIs, etc.) can pick up messages with temperature data from the Blob Storage.
The topology of the sample is represented on the following diagram:
There is an IoT Edge device with Temperature Sensor
custom module (C#) that generates some temperature value and sends it upstream with a telemetry message. This message is routed to another custom module Filter
(C#). This module checks the received temperature against a threshold window (0-100 degrees) and if it the temperature is within the window, the FilterModule sends the telemetry message to the cloud.
In the cloud the message is processed by the backend. The backend consists of a chain of two Azure Functions and a Storage Account. Azure .Net Function picks up the telemetry message from the IoT Hub events endpoint, processes it and sends it to Azure Java Function. The Java function saves the message to the storage account container.
The clients of the La Niña service have some expectations from it. These expectation may be defined as Service Level Objectives reflecting the following factors:
- Coverage. The data is coming from the majority of installed buoys
- Freshness. The data coming from the buoys is fresh and relevant
- Throughput. The temperature data is delivered from the buoys without significant delays.
- Correctness. The ratio of lost messages (errors) is small
These factors can be measured with the following Service Level Indicators:
Indicator | Factors |
---|---|
Ratio of on-line devices to the total number of devices | Coverage |
Ratio of devices reporting frequently to the number of reporting devices | Freshness, Throughput |
Ratio of devices successfully delivering messages to the total number of devices | Correctness |
Ratio of devices delivering messages fast to the total number of devices | Throughput |
To determine if the service satisfies the client's expectations, the Service Level Indicators (SLIs) measurements are compared to the values defined in the formal Service Level Objectives (SLOs):
Statement | Factor |
---|---|
90% of devices reported metrics no longer than 10 mins ago (were online) for the observation interval | Coverage |
95% of online devices send temperature 10 times per minute for the observation interval | Freshness, Throughput |
99% of online devices deliver messages successfully with less than 5% of errors for the observation interval | Correctness |
95% of online devices deliver 90th percentile of messages within 50ms for the observation interval | Throughput |
Measuring templates applicable to all SLIs:
- Observation interval: 24h
- Aggregation interval: 10 mins
- Measurements frequency: 5 min
- What is measured: interaction between IoT Device and IoT Hub, further consumption of the temperature data is out of scope.
Service Level Indicators are measured by the means of Metrics. An IoT Hub device comes with system modules edgeHub
and edgeAgent
. These modules expose through a Prometheus endpoint a list of built-in metrics that are collected and pushed to Azure Monitor Log Analytics service by the Metrics Collector module running on the IoT Edge device.
Note: Alternatively, metrics can be delivered to the cloud through IoT Hub messages channel and then submitted to the Log Analytics with a cloud workflow. See Cloud Workflow Sample for the details on this architecture pattern.
SLOs and corresponding SLIs are monitored with Azure Monitor Workbooks. To achieve the best user experience the workbooks system follows the glance -> scan -> commit concept:
- Glance. SLIs at the fleet level
- Scan. Details on how devices contribute to SLIs. Easy to identify "problematic" devices.
- Commit. Details on a specific device
Another monitoring instrument, which is used besides the workbooks, is Alerts. In addition to SLIs defined in SLOs, Alerts monitor secondary metrics (KPIs) to predict and prevent the defined SLOs violations:
Metric | Factor |
---|---|
Device last seen time | Coverage |
Device upstream message ratio (messages per min) | Freshness, Throughput |
Device messages Queue Len | Throughput |
Device messages Latency | Throughput |
Device CPU, Memory, disk usage | Coverage, Freshness, Throughput |
Device messages error ratio | Correctness |
While Measuring and Monitoring allows to observe and predict the system behavior, compare it to the defined expectations and ultimately detect existing or potential issues, the Troubleshooting lets identify and locate the cause of the issue.
There are two observability instruments serving the troubleshooting purposes: Traces and Logs. In this sample Traces show how a telemetry message with the ocean surface temperature is traveling from the sensor to the storage in the cloud, what is invoking what and with what parameters. Logs give information on what is happening inside each system component during this process. The real power of Traces and Logs comes when they are correlated. With that it's possible to read the logs of a specific system component, such as a module on IoT device or a backend function, while it was processing a specific telemetry message.
It is very common in IoT scenarios when there is only one way connectivity from the devices to the cloud. Due to unstable and complicated networking setup there is no way to connect from the cloud to the devices at scale. This sample is built with this limitation in mind, so the observability data (as any data) is supposed to be pushed to the cloud rather than pulled. Please refer to the Overview of Distributed Tracing with IoT Edge for the detailed considerations and different architecture patterns on distributed tracing.
The C# components of the sample, such as device modules and backend Azure .Net Function use OpenTelemetry for .Net to produce tracing data.
IoT Edge modules Tempperature Sensor
and Filter
export the logs and tracing data via OTLP protocol to the OpenTelemetryCollector module, running on the same edge device. The OpenTelemetryCollector
module, in its turn, exports logs and traces to Azure Monitor Application Insights service. This scenario assumes there is a stable connection from the device to the cloud service. Refer to the OpenTelemetry for offline devices for the offline scenario recommendations.
Alternatively, for the use cases when there is a connection from the cloud to IoT devices, logs (and logs only) from the devices can be delivered on request using direct method invocation. See Cloud Workflow Sample for the details on this architecture pattern.
The Azure .Net backend Function sends the tracing data to Application Insights with Azure Monitor Open Telemetry direct exporter. It also send correlated logs directly to Application Insights with a configured ILogger instance.
The Java backend function uses OpenTelemetry auto-instrumentation Java agent to produce and export tracing data and correlated logs to the Application Insights instance.
The IoT Edge module Tempperature Sensor
starts the whole process and therefore it starts an OpenTelemetry trace. It puts a W3C traceparent value into the outgoing message property. The Filter
receives the message on the device, extracts the traceparent
property and uses it to continue the trace with a new span. The module puts a new value of the traceparent
(with the new parent_id) into the outgoing message. The .Net Azure Function retrieves the message from the IoT Hub endpoint, extracts the traceparent
property, continues the same trace with a new span and sends the new traceparent
value in the header of the HTTP request to the Azure Java Function. The Azure Java Function is auto-instrumented with OpenTelemetry, so the framework "understands" the traceparent
header, starts a new span in the same trace and creates the following spans while communicating to Azure Blob Storage and Managed Identity service.
As a result, the entire end-to-end process from the sensor to the storage can be monitored with Application Map in Application Insights:
Blobs in Azure Storage with the IoT messages are tagged with the trace_id
(Operation Id
in Application Insights) value. We can find and investigate in details end-to-end transaction for every message.
We can go deeper and drill down and explore correlated logs for a specific trace or a specific span. In Application Insights terminology Operation Id
corresponds to TraceId
and Id
corresponds to SpanId
:
Besides Application Insights, the OpenTelemetryCollector
module can be configured to export the tracing data to alternative observability backends, working on the factory floor (for example Jaeger or Zipkin). This enables the scenarios when the device goes offline but we still want to analyze what is happening on the device or we want to do the device analysis without a roundtrip to the cloud.
Note: Jaeger/Zipkin installation is not included in this sample. If you have a Jaeger installation that you want to work with this sample, provide a value of the JAEGER_ENDPOINT
environment variable (e.g. http://myjaeger:14268/api/traces) in the device deployment template.
Implementing DevOps practices is a common way to handle with the growing complexity of observability solution and related operational costs. This sample comes with the following Azure Pipelines:
Infrastructure-as-code pipeline provisions all necessary Azure resources for this sample. It is referencing iot-monitoring
variable group that should be created manually in your Azure DevOps project with the following variables:
Variable | Description | Example |
---|---|---|
AZURE_RESOURCE_MANAGER_CONNECTION_NAME | Name of ARM service connection in the Azure DevOps project | iotsample-arm-con |
AZURE_LOCATION | Azure Location | West US 2 |
RG_NAME | Azure Resource Group Name | iot-e2e-rg |
IOT_ENV_SUFFIX | Unique suffix that will be added to all provisioned resources | iote2esampl |
Observability-as-code pipeline deploys a sample Workbook and a set of Alerts and assigns them to IoT Hub.
It requires to add the following variables in the iot-monitoring
variable group (in addition to the variables defined for IaC):
Variable | Description |
---|---|
AZURE_SUBSCRIPTION_ID | Azure subscription Id where IoT Hub is provisioned |
CI/CD pipeline performs the following:
- Builds IoT Edge Modules Docker Images
- Runs a local smoke test to check the IoT Edge Modules containers work without errors
- Pushes the images to ACR (provisioned by the IaC pipeline)
- Builds and archives to zip files backend Azure Functions
- Publishes artifacts consisting of IoT Edge devices deployment profiles and backend functions archives
- Creates a new IoT Edge device deployment in IoT Hub
- Runs a smoke test to check the deployment is applied and the devices are up and running
- Deploys backend Azure Functions
It requires to add the following variables in the iot-monitoring
variable group (in addition to the variables defined for IaC):
Variable | Description |
---|---|
LOG_ANALYTICS_SHARED_KEY | Log Analytics Shared Key, used by devices to export metrics |
LOG_ANALYTICS_WSID | Log Analytics Workspace Id, used by devices to export metrics |
APPINSIGHTS_INSTRUMENTATION_KEY | Application Insights Instrumentation Key, used by devices to export logs and traces |
While Azure Pipelines is a must have for a production environment, the sample comes with an alternative and convenient option for the quick deploy-n-play
. You can deploy everything with a PowerShell script ./Scripts/deploy.ps1
. The script provisions all necessary Azure resources, deploys a sample workbook and alerts, deploys IoT Edge Modules and backend Azure Functions.
Note: The script prompts to select one of the available deployment options: End-to-End Sample
or Cloud Workflow Sample
. The End-to-End Sample
option deploys the sample described above in this document and Cloud Workflow Sample
option deploys a sample of a cloud workflow to process logs uploaded by the device to a blob storage container, as well as metrics arriving as device-to-cloud messages in IoT Hub. Refer to the Cloud Workflow Sample for the details.
In order to successfully deploy this sample with a script, you will need the following:
- PowerShell.
- Azure CLI version 2.21 or later.
- An Azure account with an active subscription. Create one for free.
Verify your prerequisites to ensure you have the right version for Azure CLI. Open a PowerShell terminal and follow the instructions below:
- Run
az --version
to check that the Azure CLI version is 2.21 or later. - Run
az login
to sign in to Azure and verify an active subscription.
Clone the repository:
git clone https://github.com/Azure-Samples/iotedge-logging-and-monitoring-solution.git
cd iotedge-logging-and-monitoring-solution\
.\Scripts\deploy.ps1