Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache_spark] Add Apache Spark package #2811

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
/packages/activemq @elastic/integrations
/packages/akamai @elastic/security-external-integrations
/packages/apache @elastic/integrations
/packages/apache_spark @elastic/integrations
/packages/atlassian_bitbucket @elastic/security-external-integrations
/packages/atlassian_confluence @elastic/security-external-integrations
/packages/atlassian_jira @elastic/security-external-integrations
Expand Down
3 changes: 3 additions & 0 deletions packages/apache_spark/_dev/build/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dependencies:
ecs:
reference: [email protected]
96 changes: 96 additions & 0 deletions packages/apache_spark/_dev/build/docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Apache Spark

The Apache Spark integration collects and parses data using the Jolokia Metricbeat Module.
Copy link
Contributor

@akshay-saraswat akshay-saraswat Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would stating 'metricbeat module' confuse users into thinking that a separate metricbeat module is required for this integration. Should we instead say "Jolokia Input"? Although it's implicit from the requirements that the metricbeat module is not required. But it would be nice to make it explicit if you receive more comments to address and update this PR. Otherwise, this PR looks good to me. Don't update just for this nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the approval, @akshay-saraswat! This PR is the old PR which was split into data stream specific PRs as discussed. The other PRs have been approved and some of them are already merged. We do have one open PR (#3070) which is approved by Jaime Soriano Pastor and waiting for the approval from CODEOWNERS. If "Jolokia Input" sounds more intuitive to you, we'll update that PR with the same change, it's not a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Compatibility

This module has been tested against `Apache Spark version 3.2.0`

## Requirements

In order to ingest data from Apache Spark, you must know the full hosts for the Master and Worker nodes.

In order to gather Spark statistics, we need to download and enable Jolokia JVM Agent.

```
cd /usr/share/java/
wget -O jolokia-agent.jar http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.6/jolokia-jvm-1.3.6-agent.jar
```

As far, as Jolokia JVM Agent is downloaded, we should configure Apache Spark, to use it as JavaAgent and expose metrics via HTTP/Json. Edit spark-env.sh. It should be in `/usr/local/spark/conf` and add following parameters (Assuming that spark install folder is /usr/local/spark, if not change the path to one on which Spark is installed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HTTP/JSON

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/local/spark

```
export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/usr/share/java/jolokia-agent.jar=config=/usr/local/spark/conf/jolokia-master.properties"
```

Now, create `/usr/local/spark/conf/jolokia-master.properties` file with following content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use words: main/worker instead of master/worker everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to use main/worker everywhere in the README or everywhere else (like the field names)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere where it can be applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for passing links, I'll make the changes.

```
host=0.0.0.0
port=7777
agentContext=/jolokia
backlog=100

policyLocation=file:///usr/local/spark/conf/jolokia.policy
historyMaxEntries=10
debug=false
debugMaxEntries=100
maxDepth=15
maxCollectionSize=1000
maxObjects=0
```

Now we need to create /usr/local/spark/conf/jolokia.policy with following content:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if you can use ```xml to enable code coloring. Could you please check if it works?

<?xml version="1.0" encoding="utf-8"?>
<restrict>
<http>
<method>get</method>
<method>post</method>
</http>
<commands>
<command>read</command>
</commands>
</restrict>
```

Configure Agent with following in conf/bigdata.ini file:
```
[Spark-Master]
stats: http://127.0.0.1:7777/jolokia/read
```
Restart Spark master.

Follow the same set of steps for Spark Worker, Driver and Executor.

## Metrics

### Driver

This is the `driver` dataset.

{{event "driver"}}

{{fields "driver"}}

### Executors

This is the `executors` dataset.

{{event "executors"}}

{{fields "executors"}}

### Applications

This is the `applications` dataset.

{{event "applications"}}

{{fields "applications"}}

### Nodes

This is the `nodes` dataset.

{{event "nodes"}}

{{fields "nodes"}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
ARG SERVICE_VERSION=${SERVICE_VERSION:-3.2.0}
FROM docker.io/bitnami/spark:${SERVICE_VERSION}

USER root
RUN \
apt-get update && apt-get install -y \
wget
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is curl present in the image? Maybe we don't need to install wget :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thank you!


ENV JOLOKIA_VERSION=1.6.0
RUN mkdir /usr/share/java && \
wget "http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/${JOLOKIA_VERSION}/jolokia-jvm-${JOLOKIA_VERSION}-agent.jar" -O "/usr/share/java/jolokia-agent.jar" && \
echo 'export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/usr/share/java/jolokia-agent.jar=config=/spark/conf/jolokia-master.properties"' >> "/opt/bitnami/spark/conf/spark-env.sh"

RUN echo '*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink' >> "/opt/bitnami/spark/conf/metrics.properties" && \
echo '*.source.jvm.class=org.apache.spark.metrics.source.JvmSource' >> "/opt/bitnami/spark/conf/metrics.properties"
HEALTHCHECK --interval=1s --retries=90 CMD curl -f http://localhost:7777/jolokia/version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curl -f -s

Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
ARG SERVICE_VERSION=${SERVICE_VERSION:-3.2.0}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file and one for master look nearly the same. Is there an option to use a common Dockerfile and use ENV vars to override differences?

FROM docker.io/bitnami/spark:${SERVICE_VERSION}

USER root
RUN \
apt-get update && apt-get install -y \
wget

ENV JOLOKIA_VERSION=1.6.0
RUN mkdir /usr/share/java && \
wget "http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/${JOLOKIA_VERSION}/jolokia-jvm-${JOLOKIA_VERSION}-agent.jar" -O "/usr/share/java/jolokia-agent.jar" && \
echo 'export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/usr/share/java/jolokia-agent.jar=config=/spark/conf/jolokia-master.properties"' >> "/opt/bitnami/spark/conf/spark-env.sh"

RUN echo '*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink' >> "/opt/bitnami/spark/conf/metrics.properties" && \
echo '*.source.jvm.class=org.apache.spark.metrics.source.JvmSource' >> "/opt/bitnami/spark/conf/metrics.properties"
HEALTHCHECK --interval=1s --retries=90 CMD curl -f http://localhost:7778/jolokia/version
20 changes: 20 additions & 0 deletions packages/apache_spark/_dev/deploy/docker/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: '2'
services:
apache_spark:
hostname: apachesparkmaster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: apache-spark-main

It's easier to read with dashes :)

build:
context: ./Dockerfiles
dockerfile: Dockerfile-master
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_MASTER_OPTS="-javaagent:/usr/share/java/jolokia-agent.jar=config=/spark/conf/jolokia-master.properties"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move all/some of them to Dockerfiles instead?

ports:
- 7777
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please sort

- 8088
- 7077
volumes:
- ./jolokia-configs/master:/spark/conf/:rw
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[Spark-Master]
stats: http://127.0.0.1:7777/jolokia/read
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
host=0.0.0.0
port=7777
agentContext=/jolokia
backlog=100

policyLocation=file:///spark/conf/jolokia.policy
historyMaxEntries=10
debug=false
debugMaxEntries=100
maxDepth=15
maxCollectionSize=1000
maxObjects=0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<?xml version="1.0" encoding="utf-8"?>
<restrict>
<http>
<method>get</method>
<method>post</method>
</http>
<commands>
<command>read</command>
<command>list</command>
<command>search</command>
<command>version</command>
</commands>
</restrict>
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[Spark-Worker]
stats: http://127.0.0.1:7778/jolokia/read
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
host=0.0.0.0
port=7778
agentContext=/jolokia
backlog=100

policyLocation=file:///spark/conf/jolokia.policy
historyMaxEntries=10
debug=false
debugMaxEntries=100
maxDepth=15
maxCollectionSize=1000
maxObjects=0
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
<?xml version="1.0" encoding="utf-8"?>
<restrict>
<http>
<method>get</method>
<method>post</method>
</http>
<commands>
<command>read</command>
<command>list</command>
<command>search</command>
<command>version</command>
</commands>
</restrict>
4 changes: 4 additions & 0 deletions packages/apache_spark/_dev/deploy/variants.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
variants:
v3.2.0:
SERVICE_VERSION: 3.2.0
default: v3.2.0
7 changes: 7 additions & 0 deletions packages/apache_spark/changelog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# newer versions go on top

- version: "0.1.0"
changes:
- description: Apache Spark integration package with Driver, Executor, Master, Worker and ApplicationSource metrics
type: enhancement
link: https://github.com/elastic/integrations/pull/2811
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
metricsets: ["jmx"]
namespace: "metrics"
hosts:
{{#each hosts}}
- {{this}}
{{/each}}
path: {{path}}
period: {{period}}
jmx.mappings:
- mbean: 'metrics:name=application.*.runtime_ms,type=gauges'
attributes:
- attr: Value
field: application_source.runtime.ms
- mbean: 'metrics:name=application.*.cores,type=gauges'
attributes:
- attr: Value
field: application_source.cores
- mbean: 'metrics:name=application.*.status,type=gauges'
attributes:
- attr: Value
field: application_source.status
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
description: Pipeline for parsing Apache Spark application metrics.
processors:
- set:
field: ecs.version
value: '8.0.0'
- rename:
field: jolokia
target_field: apache_spark
ignore_missing: true
- set:
field: event.type
value: info
- set:
field: event.kind
value: metric
- set:
field: event.module
value: apache_spark
- script:
lang: painless
description: This script will add the name of application under key 'application_source.name'
if: ctx?.apache_spark?.metrics?.mbean?.contains("name=application") == true
source: >-
def bean_name = ctx.apache_spark.metrics.mbean.toString().splitOnToken(".");
def app_name = "";
if (bean_name[0].contains("name=application") == true) {
app_name = bean_name[1] + "." + bean_name[2];
}
ctx.apache_spark.metrics.application_source.name = app_name;
- remove:
field: apache_spark.metrics.mbean
ignore_failure: true
on_failure:
- set:
field: error.message
value: '{{ _ingest.on_failure_message }}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- name: data_stream.dataset
type: constant_keyword
description: Data stream dataset.
- name: data_stream.namespace
type: constant_keyword
description: Data stream namespace.
- name: data_stream.type
type: constant_keyword
description: Data stream type.
- name: '@timestamp'
type: date
description: Event timestamp.
12 changes: 12 additions & 0 deletions packages/apache_spark/data_stream/applications/fields/ecs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
- external: ecs
name: event.kind
- external: ecs
name: event.type
- external: ecs
name: ecs.version
- external: ecs
name: tags
- external: ecs
name: service.address
- external: ecs
name: service.type
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
- name: apache_spark.metrics
type: group
release: beta
fields:
- name: application_source
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called applications as the data stream.

type: group
fields:
- name: cores
type: long
description: |
Number of cores.
- name: name
type: keyword
description: |
Name of the application.
- name: runtime_ms
type: long
description: |
Time taken to run the application (ms).
- name: status
type: keyword
description: |
Current status of the application.
31 changes: 31 additions & 0 deletions packages/apache_spark/data_stream/applications/manifest.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
title: Apache Spark application metrics
type: metrics
release: beta
streams:
- input: jolokia/metrics
title: Apache Spark application metrics
description: Collect Apache Spark application metrics using Jolokia agent.
vars:
- name: hosts
type: text
title: Hosts
multi: true
description: |
Full hosts for the Jolokia for Apache Spark (https://spark_master:jolokia_port).
required: true
show_user: true
- name: path
type: text
title: Path
multi: false
required: true
show_user: false
default: /jolokia/?ignoreErrors=true&canonicalNaming=false
- name: period
type: text
title: Period
multi: false
required: true
show_user: true
default: 60s
template_path: "stream.yml.hbs"
Loading