From 5c8215f09ee6ecc4130e575753c9672bc24d72c9 Mon Sep 17 00:00:00 2001
From: Tudor Golubenco <tudor@elastic.co>
Date: Thu, 23 Feb 2017 13:26:27 +0100
Subject: [PATCH] Filebeat modules dev guide (#3616)

Basic guide on creating new filebeat modules.
---
 filebeat/docs/modules-dev-guide.asciidoc      | 303 ++++++++++++++++++
 filebeat/docs/modules-overview.asciidoc       |   2 +-
 filebeat/docs/modules_list.asciidoc           |   2 +
 .../configuration/filebeat-options.asciidoc   |   1 +
 filebeat/scripts/docs_collector.py            |   2 +
 filebeat/scripts/module/fileset/manifest.yml  |   2 +-
 .../metricset-details.asciidoc                |   1 +
 7 files changed, 311 insertions(+), 2 deletions(-)
 create mode 100644 filebeat/docs/modules-dev-guide.asciidoc

diff --git a/filebeat/docs/modules-dev-guide.asciidoc b/filebeat/docs/modules-dev-guide.asciidoc
new file mode 100644
index 00000000000..59e2a2368d7
--- /dev/null
+++ b/filebeat/docs/modules-dev-guide.asciidoc
@@ -0,0 +1,303 @@
+[[filebeat-modules-devguide]]
+== Developer guide: creating a new module
+
+This guide will walk you through creating a new Filebeat module.
+
+All Filebeat modules currently live in the main
+https://github.com/elastic/beats[Beats] repository. To clone the repository and
+build Filebeat (which you will need for testing), please follow the general
+Beats https://github.com/elastic/beats/blob/master/CONTRIBUTING.md[CONTRIBUTING]
+guide.
+
+=== Overview
+
+Each Filebeat module is composed by one or more "filesets". We usually create a
+module for each service that we support (`nginx` for Nginx, `mysql` for Mysql,
+etc.) and a fileset for each type of logs that that service creates. For
+example, the Nginx module has `access` and `error` filesets. You can contribute
+a new module (with at least one fileset) or only a new fileset into an existing
+module.
+
+=== Creating a new fileset
+
+Regardless of whether you are creating a fielset in a new module or in an
+existing module, the procedure is similar. Run the following command in the
+`filebeat` folder:
+
+[source,bash]
+----
+make create-metricset
+----
+
+You'll be prompted to enter a module and metricset name. Only use characters
+`[a-z]` and, if required, underscores (`_`). No other characters are allowed.
+For the module name, you can us either a new module name or an existing module
+name. If the module doesn't exist, it will be created.
+
+In this guide we use `{fileset}` and `{module}` as placeholders for the fileset
+and module names. You need to replace these with the actual names you entered in
+the previous command.
+
+After running the above make command, you'll find the fileset, along with its
+generated files, under `module/{module}/{fileset}`. This directory
+contains the following files:
+
+[source,bash]
+----
+module/{module}/{fileset}
+├── manifest.yml
+├── config
+│   └── {fileset}.yml
+├── ingest
+│   └── pipeline.json
+├── _meta
+│   └── fields.yml
+└── test
+----
+
+Let's look at these files one by one.
+
+==== manifest.yml
+
+The `manifest.yml` is the control file for the module, where variables are
+defined and the other files are referenced. It is a YAML file, but in many
+places in it, built-in or defined variables can be used using the
+`{{.variable}}` syntax.
+
+The `var` section of the file defines the fileset variables and their default
+values. The module variables can be referenced in the rest of the configuration
+files, and their value can be overridden at runtime via the Filebeat
+configuration.
+
+As the fileset creator, you can use any names for the variables you define. Each
+variable must have a default value. So in it's simplest form, this is how you
+can define a new variable:
+
+[source,yaml]
+----
+var:
+  - name: pipeline
+    default: with_plugins
+----
+
+Most fileset should have a `paths` variable defined, in which the default paths
+where the log files are found should be set:
+
+[source,yaml]
+----
+var:
+  - name: paths
+    default:
+      - /example/test.log*
+    os.darwin:
+      - /usr/local/example/test.log*
+      - /example/test.log*
+    os.windows:
+      - c:/programdata/example/logs/test.log*
+----
+
+There's quite a lot going on in the above, so let's break it down:
+
+* The name of the variable is `paths` and the default value is an array with one
+  element: `"/example/test.log*"`.
+* Note that variable values don't have to be strings,
+  they can be also numbers, objects, or like in this example, arrays.
+* We will use the `paths` variable to set the prospector
+  <<prospector-paths,paths>> setting, so "glob" values can be used here.
+* Besides the `default` value, the above also defines values for particular
+  operating systems: a default for darwin/OS X/macOS systems and a default for
+  Windows systems. These are introduced via the `os.darwin` and `os.windows`
+  keywords. The values under these keys become the default for the variable, if
+  Filebeat is executed on the respective OS.
+
+Besides the variable definition, the `manifest.yml` file also contains
+references to the ingest pipeline and prospector configuration to use (see next
+sections):
+
+[source,yaml]
+----
+ingest_pipeline: ingest/pipeline.json
+prospector: config/testfileset.yml
+----
+
+These should point to the respective files from the fileset.
+
+Note that when evaluating both of these, the variables are expanded, which
+enables you to select one file or the other depending on the value of a
+variable. For example:
+
+[source,yaml]
+----
+ingest_pipeline: ingest/{{.pipeline}}.json
+----
+
+This will the ingest pipeline file that matches the `pipeline` variable value in
+the file name.
+
+==== config/*.yml
+
+Under the `config/` folder there are the template files that generate Filebeat
+prospector configurations. The Filebeat prospectors are primarily responsible
+with tailing the files, filtering, and multi-line stitching, so that is what is
+configured here.
+
+A typical example, which is also what is automatically generated, looks like
+this:
+
+[source,yaml]
+----
+input_type: log
+paths:
+{{ range $i, $path := .paths }}
+ - {{$path}}
+{{ end }}
+exclude_files: [".gz$"]
+----
+
+The template needs to generate a valid Filebeat prospector configuration in the
+YAML format. The options accepted by the prospector configuration can be found
+in this <<configuration-filebeat-options,section>>.
+
+The templating language used in these files is the one defined in the
+https://golang.org/pkg/text/template/[Golang standard library]. In this example,
+the `paths` variable is used to construct the `paths` list for the
+<<prospector-paths>> option.
+
+Here is another example that also configures multiline stitching:
+
+[source,yaml]
+----
+input_type: log
+paths:
+{{ range $i, $path := .paths }}
+ - {{$path}}
+{{ end }}
+exclude_files: [".gz$"]
+multiline:
+  pattern: "^# User@Host: "
+  negate: true
+  match: after
+----
+
+While you can add multiple configuration files under the `config/` folder, only
+the one indicated by the `manifest.yml` file will be loaded. You can use
+variables to dynamically switch between configurations.
+
+==== ingest/*.json
+
+Under the `ingest/` folder there are the Elasticsearch
+{elasticsearch}/ingest.html[Ingest Node] pipeline configurations. The Ingest
+Node pipelines are responsible with parsing the log lines and doing other
+manipulations on the data.
+
+The files in this folder are JSON documents representing
+{elasticsearch}pipeline.html[pipeline definitions]. Just like with the `config/`
+folder, you can define multiple pipelines, but a single one is loaded at runtime
+based on the information from `manifest.yml`.
+
+The generator creates a JSON object similar to this one:
+
+[source,json]
+----
+{
+  "description": "Pipeline for parsing {module} {fileset} logs",
+  "processors": [
+    ],
+  "on_failure" : [{
+    "set" : {
+      "field" : "error",
+      "value" : "{{ _ingest.on_failure_message }}"
+    }
+  }]
+}
+----
+
+From here, you would typically add processors to the `processors` array to do
+the actual parsing. For details on how to use the Ingest Node processors, please
+visit the respective {elasticsearch}/ingest-processors.html[documentation]. In
+particular, you will likely find the Grok processor to be useful for parsing.
+Here is an example for parsing the Nginx access logs.
+
+[source,json]
+----
+{
+  "grok": {
+    "field": "message",
+    "patterns":[
+      "%{IPORHOST:nginx.access.remote_ip} - %{DATA:nginx.access.user_name} \\[%{HTTPDATE:nginx.access.time}\\] \"%{WORD:nginx.access.method} %{DATA:nginx.access.url} HTTP/%{NUMBER:nginx.access.http_version}\" %{NUMBER:nginx.access.response_code} %{NUMBER:nginx.access.body_sent.bytes} \"%{DATA:nginx.access.referrer}\" \"%{DATA:nginx.access.agent}\""
+      ],
+    "ignore_missing": true
+  }
+}
+----
+
+Note that you should follow the convention of naming of fields prefixed with the
+module and fileset name: `{module}.{fileset}.field`, e.g.
+`nginx.access.remote_ip`. Also, please review our
+{libbeat}/event-conventions.html[field naming conventions].
+
+While developing the pipeline definition, we recommend making use of the
+{elasticsearch}/simulate-pipeline-api.html[Simulate Pipeline API] for testing
+and quick iteration.
+
+==== _meta/fields.yml
+
+The `fields.yml` file contains the top level structure for the fields in your
+fieldset. It is used as the source of truth for:
+
+* the generated Elasticsearch mapping template
+* the generated Kibana index pattern
+* the generated documentation for the exported fields
+
+Besides the `fields.yml` file in the fileset, there is also a `fields.yml` file
+at the module level, placed under `module/{module}/_meta/fields.yml`, which
+should contain the fields defined at the module level, and the description of
+the module itself. In most cases, you should add the fields at the fileset
+level.
+
+==== test
+
+In the `test/` directory you should place sample log files generated by the
+service. We have integration tests, automatically executed by CI, that will run
+Filebeat on each of the log files under the `test/` folder and check that there
+are no parsing errors and that all fields are documented.
+
+In addition, assuming you have a `test.log` file, you can add a
+`test.log-expected.json` file in the same directory that contains the expected
+documents as they are found via an Elasticsearch search. In this case, the
+integration tests will automatically check that the result is the same on each
+run.
+
+=== Module level files
+
+Besides the files in the fileset folder, there is also data that needs to be
+filled at the module level.
+
+==== _meta/docs.asciidoc
+
+This file contains module specific documentation. You should include information
+about which versions of the service were tested and about which variables is
+each fileset defining.
+
+==== _meta/fields.yml
+
+The module level `fields.yml` contains descriptions for the module level fields.
+Please review and update the title and the descriptions in this file. The title
+is used as a title in the docs, so it's best to use capitalize it.
+
+==== _meta/kibana
+
+This folder contains the sample Kibana dashboards for this module. To create
+them, you can build them visually in Kibana and then run the following command:
+
+
+[source,shell]
+----
+$ cd filebeat/module/{module}/
+python ../../../dev-tools/export_dashboards.py --regex {module} --dir _meta/kibana
+----
+
+Where the `--regex` parameter should match the dashboard you want to export.
+
+You can find more details about the process of creating and exporting the Kibana
+dashboards by reading this {libbeat}/new-dashboards.html[this guide].
diff --git a/filebeat/docs/modules-overview.asciidoc b/filebeat/docs/modules-overview.asciidoc
index 9c00813559c..0b56b7e1e6d 100644
--- a/filebeat/docs/modules-overview.asciidoc
+++ b/filebeat/docs/modules-overview.asciidoc
@@ -53,7 +53,7 @@ enable the two plugins from the configuration page.
 
 This also assumes you have Nginx installed and writing logs in the default
 location and format. If you want to monitor another service for which a module
-exists, feel free to adjust the tutorial.
+exists, adjust the commands in the tutorial accordingly.
 
 You can start Filebeat with the following command:
 
diff --git a/filebeat/docs/modules_list.asciidoc b/filebeat/docs/modules_list.asciidoc
index a4655a21ab1..30d29bd6b51 100644
--- a/filebeat/docs/modules_list.asciidoc
+++ b/filebeat/docs/modules_list.asciidoc
@@ -7,6 +7,7 @@ This file is generated! See scripts/docs_collector.py
   * <<filebeat-module-mysql>>
   * <<filebeat-module-nginx>>
   * <<filebeat-module-system>>
+  * <<filebeat-modules-devguide>>
 
 
 --
@@ -16,3 +17,4 @@ include::modules/apache2.asciidoc[]
 include::modules/mysql.asciidoc[]
 include::modules/nginx.asciidoc[]
 include::modules/system.asciidoc[]
+include::modules-dev-guide.asciidoc[]
diff --git a/filebeat/docs/reference/configuration/filebeat-options.asciidoc b/filebeat/docs/reference/configuration/filebeat-options.asciidoc
index 9e7eab88c47..9e26fb8b083 100644
--- a/filebeat/docs/reference/configuration/filebeat-options.asciidoc
+++ b/filebeat/docs/reference/configuration/filebeat-options.asciidoc
@@ -33,6 +33,7 @@ One of the following input types:
 
 The value that you specify here is used as the `input_type` for each event published to Logstash and Elasticsearch.
 
+[[prospector-paths]]
 ===== paths
 
 A list of glob-based paths that should be crawled and fetched. All patterns
diff --git a/filebeat/scripts/docs_collector.py b/filebeat/scripts/docs_collector.py
index 4b495f72fd9..50f7711d52a 100644
--- a/filebeat/scripts/docs_collector.py
+++ b/filebeat/scripts/docs_collector.py
@@ -60,11 +60,13 @@ def collect(beat_name):
     module_list_output += "  * <<filebeat-modules-overview>>\n"
     for m, title in sorted(modules_list.iteritems()):
         module_list_output += "  * <<filebeat-module-" + m + ">>\n"
+    module_list_output += "  * <<filebeat-modules-devguide>>\n"
 
     module_list_output += "\n\n--\n\n"
     module_list_output += "include::modules-overview.asciidoc[]\n"
     for m, title in sorted(modules_list.iteritems()):
         module_list_output += "include::modules/" + m + ".asciidoc[]\n"
+    module_list_output += "include::modules-dev-guide.asciidoc[]\n"
 
     # Write module link list
     with open(os.path.abspath("docs") + "/modules_list.asciidoc", 'w') as f:
diff --git a/filebeat/scripts/module/fileset/manifest.yml b/filebeat/scripts/module/fileset/manifest.yml
index e8c4f0894cd..669f5e791c2 100644
--- a/filebeat/scripts/module/fileset/manifest.yml
+++ b/filebeat/scripts/module/fileset/manifest.yml
@@ -10,4 +10,4 @@ var:
       - c:/programdata/example/logs/test.log*
 
 ingest_pipeline: ingest/pipeline.json
-prospectors: config/{fileset}.yml
+prospector: config/{fileset}.yml
diff --git a/metricbeat/docs/developer-guide/metricset-details.asciidoc b/metricbeat/docs/developer-guide/metricset-details.asciidoc
index cdaa92de949..e8953702add 100644
--- a/metricbeat/docs/developer-guide/metricset-details.asciidoc
+++ b/metricbeat/docs/developer-guide/metricset-details.asciidoc
@@ -90,6 +90,7 @@ functionality of the metricset and `Fetch` method from the data mapping.
 The `fields.yml` file is used for different purposes:
 
 * Creates the Elasticsearch template
+* Creates the Kibana index pattern configuration
 * Creates the Exported Fields documentation for the metricset
 
 To make sure the Elasticsearch template is correct, it's important to keep this file up-to-date