Skip to content

Commit

Permalink
Merge pull request #169 from microsoft/develop
Browse files Browse the repository at this point in the history
FO 3.1.17
  • Loading branch information
GitTorre authored Sep 13, 2021
2 parents 534e7a4 + aaf8bad commit cb71f15
Show file tree
Hide file tree
Showing 62 changed files with 2,576 additions and 1,072 deletions.
5 changes: 1 addition & 4 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,4 @@ ASALocalRun/
**/PublishProfiles/Cloud.xml
/FabricObserver/observer_logs
/FabricObserver/PackageRoot/Data/Plugins/SampleNewObserver.dll
/nuget.exe
/FabricObserver/PackageRoot/Data/Plugins/ContainerObserver
/FabricObserver/PackageRoot/Data/Plugins/FabricObserverMdm
/FabricObserver/PackageRoot/Config/containerobserver.config.json
/nuget.exe
8 changes: 4 additions & 4 deletions Build-SFPkgs.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ function Build-SFPkg {
try {
Push-Location $scriptPath

Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Linux.SelfContained.3.1.16" "$scriptPath\bin\release\FabricObserver\linux-x64\self-contained\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Linux.FrameworkDependent.3.1.16" "$scriptPath\bin\release\FabricObserver\linux-x64\framework-dependent\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Linux.SelfContained.3.1.17" "$scriptPath\bin\release\FabricObserver\linux-x64\self-contained\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Linux.FrameworkDependent.3.1.17" "$scriptPath\bin\release\FabricObserver\linux-x64\framework-dependent\FabricObserverType"

Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Windows.SelfContained.3.1.16" "$scriptPath\bin\release\FabricObserver\win-x64\self-contained\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Windows.FrameworkDependent.3.1.16" "$scriptPath\bin\release\FabricObserver\win-x64\framework-dependent\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Windows.SelfContained.3.1.17" "$scriptPath\bin\release\FabricObserver\win-x64\self-contained\FabricObserverType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricObserver.Windows.FrameworkDependent.3.1.17" "$scriptPath\bin\release\FabricObserver\win-x64\framework-dependent\FabricObserverType"
}
finally {
Pop-Location
Expand Down
98 changes: 92 additions & 6 deletions Documentation/Observers.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,14 @@ Service Fabric Error Health Events can block upgrades and other important Fabric

| Observer | Description |
| :--- | :--- |
| [AppObserver](#appobserver) | Monitors CPU usage, Memory use, and Disk space availability for Service Fabric Application services (processes) and their spawn (child processes). Alerts when user-supplied thresholds are reached. |
| [AppObserver](#appobserver) | Monitors CPU usage, Memory use, and Disk space availability for Service Fabric Application services (processes) and their spawn (child processes). Alerts when user-supplied thresholds are breached. |
| [AzureStorageUploadObserver](#azurestorageuploadobserver) | Runs periodically (do set its RunInterval setting) and will upload dmp files that AppObserver creates when you set dumpProcessOnError to true. It will clean up files after successful upload. |
| [CertificateObserver](#certificateobserver) | Monitors the expiration date of the cluster certificate and any other certificates provided by the user. Warns when close to expiration. |
| [DiskObserver](#diskobserver) | Monitors, storage disk information like capacity and IO rates. Alerts when user-supplied thresholds are reached. |
| [ContainerObserver](#containerobserver) | Monitors container CPU and Memory use. Alerts when user-supplied thresholds are breached. |
| [DiskObserver](#diskobserver) | Monitors, storage disk information like capacity and IO rates. Alerts when user-supplied thresholds are breached. |
| [FabricSystemObserver](#fabricsystemobserver) | Monitors CPU usage, Memory use, and Disk space availability for Service Fabric System services (compare to AppObserver) |
| [NetworkObserver](#networkobserver) | Monitors outbound connection state for user-supplied endpoints (hostname/port pairs), i.e. it checks that the node can reach specific endpoints. |
| [NodeObserver](#nodeobserver) | This observer monitors VM level resource usage across CPU, Memory, firewall rules, static and dynamic ports (aka ephemeral ports). |
| [NodeObserver](#nodeobserver) | This observer monitors VM level resource usage across CPU, Memory, firewall rules, static and dynamic ports (aka ephemeral ports), File Handles (Linux). |
| [OSObserver](#osobserver) | Records basic OS properties across OS version, OS health status, physical/virtual memory use, number of running processes, number of active TCP ports (active/ephemeral), number of enabled firewall rules, list of recent patches/hotfixes. |
| [SFConfigurationObserver](#sfconfigurationobserver) | Records information about the currently installed Service Fabric runtime environment. |

Expand Down Expand Up @@ -93,9 +94,9 @@ Finally, if you do not launch child processes from your services please disable


### Input
JSON config file supplied by user, stored in PackageRoot/Observers.Data folder. This data contains JSON arrays
JSON config file supplied by user, stored in PackageRoot\Config folder. This configuration is composed of JSON array
objects which constitute Service Fabric Apps (identified by service URI's). Users supply Error/Warning thresholds for CPU use, Memory use and Disk
IO, ports. Memory values are supplied as number of megabytes... CPU and Disk Space values are provided as percentages (integers: so, 80 = 80%...)...
IO, ports. Memory values are supplied as number of megabytes or percentage use. CPU and Disk Space values are provided as percentages (integers: so, 80 = 80%)
**Please note that you can omit any of these properties. You can also supply 0 as the value, which means that threshold
will be ignored (they are not omitted below so you can see what a fully specified object looks like).
We recommend you omit all Error thresholds until you become more comfortable with the behavior of your services and the side effects they have on machine resources**.
Expand Down Expand Up @@ -199,7 +200,7 @@ All dmp files are compressed to zip files before uploading to your storage accou
A note on resource usage: This feature is intended for the exceptional case - when your app service is truly doing something really wrong (like leaking memory, ports, handles). Make sure that you set your Error thresholds to meaningfully high values. Internally, FabricObserver will only dump a configured amount of times in a specified time window per service, per observed metric. The idea
is to not eat your local disk space and use up too much CPU for too long. Please be mindful of how you utilize this **debugging** feature. It is best to enable it in Test and Staging clusters to find the egregious bugs in your service code *before* you ship your services to production clusters.

#### Encrypting your secrets
#### Encrypting your secrets (Optional, but recommended)

It is very important that you generate an encrypted Connection String or Account Key string in a supported way: Use Service Fabric's Invoke-ServiceFabricEncryptText PowerShell cmdlet with your Cluster thumbprint or cert name/location.
Please see the [related documentation with samples](https://docs.microsoft.com/en-us/powershell/module/servicefabric/invoke-servicefabricencrypttext?view=azureservicefabricps). It is really easy to do! Non-encrypted strings are supported, but we do not recommend using them. The decision is yours to own.
Expand Down Expand Up @@ -242,6 +243,33 @@ Example AzureStorageUploadObserver configuration in ApplicationManifest.xml:
<Parameter Name="AzureStorageUploadObserverZipFileCompressionLevel" DefaultValue="Optimal" />
```

You do not need to encrypt your keys, but that is up to you to decide. We recommend that you do. If you do not want to, then:

In Settings.xml you must change IsEncryted to false:

```Xml
<Section Name="AzureStorageUploadObserverConfiguration">
<!-- For Authenticating to your Storage Account, you can either provide a Connection String OR an Account Name and Account Key.
NOTE: If you do not plan on encrypting your account secrets, then set IsEncrypted to false both here and in
the AzureStorageUploadObserverConfiguration Section in ApplicationManifest.xml. -->
<Parameter Name="AzureStorageConnectionString" Value="" IsEncrypted="false" MustOverride="true" />
...
<Parameter Name="AzureStorageAccountKey" Value="" IsEncrypted="false" MustOverride="true" />
</Section>
```

In ApplicationManifest.xml you must change IsEncrypted to false:

```XML
<Section Name="AzureStorageUploadObserverConfiguration">
...
<Parameter Name="AzureStorageConnectionString" Value="[AzureStorageUploadObserverStorageConnectionString]" IsEncrypted="false" />
<!-- OR use Account Name/Account Key pair if NOT using Connection String.. -->
<Parameter Name="AzureStorageAccountName" Value="[AzureStorageUploadObserverStorageAccountName]" />
<Parameter Name="AzureStorageAccountKey" Value="[AzureStorageUploadObserverStorageAccountKey]" IsEncrypted="false" />
...
</Section>
```

## CertificateObserver
Monitors the expiration date of the cluster certificate and any other certificates provided by the user.
Expand Down Expand Up @@ -271,6 +299,64 @@ Monitors the expiration date of the cluster certificate and any other certificat
<Parameter
```

## ContainerObserver
Monitors CPU and Memory use of Service Fabric containerized (docker) services.

**In order for ContainerObserver to function properly on Windows, FabricObserver must be configured to run as Admin or System user.** This is not the case for Linux deployments.


### Configuration

XML:

Settings.xml

```XML
<!-- NOTE: FabricObserver must run as System or Admin on *Windows* in order to run ContainerObserver successfully. This is not the case for Linux. -->
<Section Name="ContainerObserverConfiguration">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="ClusterOperationTimeoutSeconds" Value="120" />
<Parameter Name="EnableCSVDataLogging" Value="" MustOverride="true" />
<Parameter Name="EnableEtw" Value="" MustOverride="true"/>
<Parameter Name="EnableTelemetry" Value="" MustOverride="true" />
<Parameter Name="EnableVerboseLogging" Value="" MustOverride="true" />
<Parameter Name="RunInterval" Value="" MustOverride="true" />
<Parameter Name="ConfigFileName" Value="" MustOverride="true" />
</Section>
```
Overridable XML settings are locatated in ApplicationManifest.xml, as always.

JSON:

Configuration file supplied by user, stored in PackageRoot\\Config folder.

Example JSON config file located in **PackageRoot\\Config** folder (ContainerObserver.config.json). This is an example of a configuration that applies
to all Service Fabric containerized services running on the virtual machine.
```JSON
[
{
"targetApp": "*",
"cpuWarningLimitPercent": 60,
"memoryWarningLimitMb": 1048
}
]
```
Settings descriptions:

All settings are optional, ***except targetApp***, and can be omitted if you don't want to track. For memory use thresholds, you must supply MB values (a la 1024 for 1GB).

| Setting | Description |
| :--- | :--- |
| **targetApp** | App URI string to observe. Required. |
| **memoryErrorLimitMb** | Maximum container memory use set in Megabytes that should generate a Fabric Error. |
| **memoryWarningLimitMb**| Minimum container memory set in Megabytes that should generate a Fabric Warning. |
| **cpuErrorLimitPercent** | Maximum CPU percentage that should generate a Fabric Error. |
| **cpuWarningLimitPercent** | Minimum CPU percentage that should generate a Fabric Warning. |

### Notes

**In order for ContainerObserver to function properly on Windows, FabricObserver must be configured to run as Admin or System user.** This is not the case for Linux deployments.

## DiskObserver
This observer monitors, records and analyzes storage disk information.
Depending upon configuration settings, it signals disk health status
Expand Down
84 changes: 84 additions & 0 deletions Documentation/OperationalTelemetry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
## FabricObserver Operational Telemetry

FabricObserver operational data is transmitted to Microsoft and contains information about FabricObserver. This information helps us understand which observers matter in the real world, what type of environment they run in, and how many services are being monitored. This information will help us make sure we invest time in the right places. This data does not contain PII or any information about the services running in your cluster or the data handled by the applications. Nor do we capture the configurations set for FO.

**This information is only used by the Service Fabric team and will be stored (data retention) for no more than 90 days.**

Disabling / Enabling transmission of Operational Data:

Transmission of operational data is controlled by a setting and can be easily turned off. ObserverManagerEnableOperationalTelemetry setting in ApplicationManifest.xml controls transmission of Operational data.

Setting the value to false as below will immediately stop the transmission of operational data:

**\<Parameter Name="ObserverManagerEnableOperationalTelemetry" DefaultValue="false" />**

#### Questions we want to answer from data:

- Health of FO
- If FO crashes with an unhandled exception that can be caught, related error information will be sent to us (this will include the offending FO stack). This will help us improve quality.
- Enabled Observers
- Helps us focus effort on the most useful observers.
- Are there any FO plugins running?
- Is FO finding issues (generating health events)? This data is represented in the total number of Warnings/Errors an observer finds in an 8 hour window.
- This telemetry is sent every 8 hours and internal error/warning counters are reset after each telemetry transmission.

#### Operational data details:

Here is a full example of exactly what is sent in one of these telemetry events, in this case, from an SFRP cluster:

```JSON
{
"EventName": "OperationalEvent",
"TaskName": "FabricObserver",
"EventRunInterval": "08:00:00",
"ClusterId": "50bf5602-1611-459c-aed2-45b960e9eb16",
"ClusterType": "SFRP",
"NodeNameHash": "1672329571",
"FOVersion": "3.1.17",
"HasPlugins": "False",
"UpTime": "00:00:27.2535830",
"Timestamp": "2021-08-26T20:51:42.8588118Z",
"OS": "Windows",
"EnabledObserverCount": 5,
"AppObserverTotalMonitoredApps": 4,
"AppObserverTotalMonitoredServiceProcesses": 6,
"AppObserverErrorDetections": 0,
"AppObserverWarningDetections": 0,
"CertificateObserverErrorDetections": 0,
"CertificateObserverWarningDetections": 0,
"DiskObserverErrorDetections": 0,
"DiskObserverWarningDetections": 0,
"NodeObserverErrorDetections": 0,
"NodeObserverWarningDetections": 0,
"OSObserverErrorDetections": 0,
"OSObserverWarningDetections": 0
}
```

Let's take a look at the data and why we think it is useful to share with us. We'll go through each object property in the JSON above.
- **EventName** - this is the name of the telemetry event.
- **TaskName** - this specifies that the event is from FabricObserver.
- **EventRunInterval** - this is how often this telemetry is sent from a node in a cluster.
- **ClusterId** - this is used to both uniquely identify a telemetry event and to correlate data that comes from a cluster.
- **ClusterType** - this is the type of cluster: Standalone or SFRP.
- **NodeNameHash** - this is a hashed expression of the name of the Fabric node from where the data originates. It is used to correlate data from specific nodes in a cluster (the hashed node name will be known to be part of the cluster with a specific cluster id).
- **FOVersion** - this is the internal version of FO (if you have your own version naming, we will only know what the FO code version is (not your specific FO app version name)).
- **HasPlugins** - this inform us about whether or not FO plugins are being used (we would love to know if folks are using the plugin model).
- **UpTime** - this is the amount of time FO has been running since it last started.
- **Timestamp** - this is the time, in UTC, when FO sent the telemetry.
- **OS** - this is the operating system FO is running on (Windows or Linux).
- **AppObserverTotalMonitoredApps** - this is the total number of deployed applications AppObserver is monitoring.
- **AppObserverTotalMonitoredServiceProcesses** - this is the total number of processes AppObserver is monitoring.
- **AppObserverErrorDetections** - this is how many Error level health events AppObserver generated in an 8 hour window.
- **AppObserverWarningDetections** - this is how many Warning level health events AppObserver generated in an 8 hour window.
- **[Built-in]ObserverErrorDetections** - this is how many Error level health events [Built-in]Observer generated in an 8 hour window.
- **[Built-in]ObserverWarningDetections** - this is how many Error level health events [Built-in]Observer generated in an 8 hour window.


Note that specific plugin data, besides whether or not plugins are in use, is not captured. Only agnostic data from built-in (ship with FO) observers is collected.

If the ClusterType is not SFRP then a TenantId (Guid) is sent for use in the same way we use ClusterId.

This information will **really** help us understand how FO is doing out there and we would greatly appreciate you sharing it with us!


2 changes: 1 addition & 1 deletion Documentation/Plugins.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

**FabricObserver version 3.1.0 introduces a refactored plugin implementation that will break existing plugins. The changes required by plugin authors are trivial, however. Please see the [SampleObserver project](/SampleObserverPlugin) for a complete sample observer plugin implementation with code comments and readme with examples of the new format.**

This document is a simple overview of how to get started with building an observer plugin. Also, for a more advanced sample, please see [ContainerObserver](https://github.com/gittorre/containerobserver).
This document is a simple overview of how to get started with building an observer plugin. Also, for a more advanced sample, please see [ContainerObserver](https://github.com/gittorre/containerobserver) reference project (ContainerObserver is a part of FO as of 3.1.17).

Note: The plugin model depends on the following packages, which **must have the same versions in both your plugin project and FabricObserver**:

Expand Down
2 changes: 1 addition & 1 deletion Documentation/Using.md
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,7 @@ $appParams = @{ "FabricSystemObserverEnabled" = "true"; "FabricSystemObserverMem
Then execute the application upgrade with

```Powershell
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationTypeVersion 3.1.16 -ApplicationParameter $appParams -Monitored -FailureAction rollback
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricObserver -ApplicationTypeVersion 3.1.17 -ApplicationParameter $appParams -Monitored -FailureAction rollback
```

Note: On *Linux*, this will restart FO processes (one at a time, UD Walk with safety checks) due to the way Linux Capabilites work. In a nutshell, for any kind of application upgrade, we have to re-run the FO setup script to get the Capabilities in place. For Windows, FO processes will NOT be restarted.
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<ItemGroup>
<Compile Remove="Utilities\MemoryUsage\**" />
<EmbeddedResource Remove="Utilities\MemoryUsage\**" />
<None Remove="Utilities\MemoryUsage\**" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.AspNet.WebApi.Client" Version="5.2.7" />
<PackageReference Include="Microsoft.ApplicationInsights.DependencyCollector" Version="2.17.0" />
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ namespace FabricObserver.Observers.Utilities
{
public interface IProcessInfoProvider
{
float GetProcessPrivateWorkingSetInMB(int processId);
float GetProcessWorkingSetMb(int processId, bool getPrivateWorkingSet = false);

/// <summary>
/// Returns the number of allocated (in use) file handles for a specified process.
Expand All @@ -27,7 +27,5 @@ public interface IProcessInfoProvider
/// <param name="process"></param>
/// <returns></returns>
List<(string ProcName, int Pid)> GetChildProcessInfo(int processId);

void Dispose();
}
}
Loading

0 comments on commit cb71f15

Please sign in to comment.