[apache_spark] Add Apache Spark package #2811

yug-rajani · 2022-03-10T06:39:54Z

What does this PR do?

Generated the skeleton of Apache Spark integration package.
Added 4 data streams ( Nodes, Driver, Executors, Applications )
Added data collection logic.
Added the ingest pipelines.
Mapped fields according to the ECS schema and added Fields metadata in the appropriate yml files.
Added dashboards and visualizations.
Added system test cases.

Checklist

I have reviewed tips for building integrations and this pull request is aligned with them.
I have verified that all data streams collect metrics or logs.
I have added an entry to my package's changelog.yml file.
If I'm introducing a new feature, I have modified the Kibana version constraint in my package's manifest.yml file to point to the latest Elastic stack release (e.g. ^7.13.0).

How to test this PR locally

Clone integrations repo.
Install elastic-package locally.
Start elastic stack using elastic-package.
Move to integrations/packages/apache_spark directory.
Run the following command to run tests.

elastic-package test

Screenshots

elasticmachine · 2022-03-10T06:51:03Z

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2022-03-31T08:05:05.756+0000
Duration: 10 min 39 sec

Steps errors

Expand to view the steps failures

`Check integration: apache_spark`

Took 0 min 0 sec . View more details here
Description: ../../build/elastic-package check -v

`Google Storage Download`

Took 0 min 0 sec . View more details here

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.

lalit-satapathy

Could you populate "README.md" with all details needed? It does not have any details of the integration currently.

mtojek

Did you take a look at the Spark project documentation? Metrics are grouped based on instances, there isn't one big bucket for all metrics.

mtojek · 2022-03-10T11:48:08Z

packages/apache_spark/manifest.yml

+policy_templates:
+  - name: apache_spark
+    title: Apache Spark metrics
+    description: Collect Apache Spark metrics


Same problem as with Spring Boot, you need to divide metrics into logical groups.

I hope we have addressed this in our previous comment

mtojek · 2022-03-10T11:48:37Z

packages/apache_spark/manifest.yml

+    size: 600x600
+    type: image/png
+icons:
+  - src: /img/apache_spark-logo.svg


I guess you can use the right logo?

What do you think about using this one?
https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg

mtojek · 2022-03-10T11:49:22Z

packages/apache_spark/manifest.yml

+  - monitoring
+release: beta
+conditions:
+  kibana.version: ^7.16.2 || ^8.0.0


Hm... I think you can go with ^8.0.0 only. We don't need to publish this package for old releases.

The execution/sprint plan that we sent out mentioned ^7.16.0 and hence we kept this. But if you believe we should just keep ^8.0.0, we will update it.
CC: @akshay-saraswat

8.0 sounds fine to me too. Let's reduce the supported versions and hence the need for backward compatibility as much as we can. :)

Understood, thank you for the clarification. We will go with 8.0.

Made a similar comment on Nagios integration just now. Let's make it 8.2.0 instead. 8.0 is already released and we will not be able to publish this integration before 8.2.0 I am guessing.

++ on releasing new integrations only for the most recent version. I assume we don't test all the previous releases.

Sure, we'll update it once it has been verified and tested on our end. We'll follow the same for the future integrations as well.

mtojek · 2022-03-10T11:50:34Z

packages/apache_spark/changelog.yml

+
+- version: "0.1.0"
+  changes:
+    - description: Initial draft of the package


If you plan to merge this package with only this PR, it won't be a draft anymore :)

Yes, that's in the plan for future commits.
Does this look good?
Apache Spark integration package with Driver, Executor, Master, Worker and ApplicationSource metrics

mtojek · 2022-03-10T11:54:07Z

packages/apache_spark/docs/README.md

+
+This is a new integration created using the [elastic-package](https://github.com/elastic/elastic-package) tool.
+
+Consider using the README template file `_dev/build/docs/README.md`to generate a list of exported fields or include a sample event.


What is required from the end-user to make it integrated? What kind of changes should be introduced to Spark?

Yes, that's something we are still working on. We'll update the README with detailed steps of the changes required.

yug-rajani · 2022-03-10T13:59:58Z

Could you populate "README.md" with all details needed? It does not have any details of the integration currently.

@lalit-satapathy , yes that's something we are still working on. This PR is still in progress, so expect system tests, dashboards, and README in the later commits

yug-rajani · 2022-03-11T11:39:22Z

Did you take a look at the Spark project documentation? Metrics are grouped based on instances, there isn't one big bucket for all metrics.

@mtojek Yes, we did check out the documentation. However, we referred to the Hadoop integration where we have clubbed multiple metrics in a single data stream keeping in mind the future extensibility and thought of using the same approach here for Apache Spark.

elasticmachine · 2022-03-12T08:23:27Z

Pinging @elastic/integrations (Team:Integrations)

yug-rajani · 2022-03-13T06:23:25Z

/test

mtojek

@mtojek Yes, we did check out the documentation. However, we referred to the Hadoop integration where we have clubbed multiple metrics in a single data stream keeping in mind the future extensibility and thought of using the same approach here for Apache Spark.

I'm afraid that it isn't the way we'd like to follow. We'd rather see metrics split into multiple areas and add a new area if needed. We have to iterate on this. I will check the Hadoop integration and if it applies also there, I guess we have to refactor that one as well.

yug-rajani · 2022-03-15T09:00:07Z

/test

…to package_apache_spark

yug-rajani · 2022-03-22T06:00:57Z

Which provider or logs are you planning to use for the Cluster Manager metrics?

We haven't actually analyzed the Cluster Manager as of now. Also, it wasn't a part of the documentation which was mentioned in the PRD.

yug-rajani · 2022-03-22T06:19:42Z

We are facing an issue with the system tests for the ApplicationSource, Driver and Executor as we need to keep an application running in the container to collect the metrics. We are actively working on the same, and we'll update the PR as soon as it's done. Meanwhile, please feel free to add reviews to the other changes that are completed.

yug-rajani · 2022-03-23T05:26:03Z

Issue with system tests:
The system tests for the nodes data stream are working fine. However, in case of driver and executors data streams, the metrics would be fetched when any application is running.

If we run a sample application (for example, WordCount) as a part of the Dockerfile, it would be considered as a step and would complete when the application is ran completely, and hence the metrics are not found during the system tests. We also tried running the application in background (once and in an infinite loop) using entrypoint script, but were unsuccessful and were not able to hit Jolokia even after that.

Please let us know if you have any ideas on the same. We can push the code over the PR for you to take a look if you think that's a good idea.

mtojek

Yug, I did partially the review as some comments apply to many places. Please revisit the entire PR based on those.

mtojek · 2022-03-23T08:30:17Z

packages/apache_spark/_dev/build/docs/README.md

+export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/usr/share/java/jolokia-agent.jar=config=/usr/local/spark/conf/jolokia-master.properties"
+```
+
+Now, create `/usr/local/spark/conf/jolokia-master.properties` file with following content:


Please use words: main/worker instead of master/worker everywhere.

Are you suggesting to use main/worker everywhere in the README or everywhere else (like the field names)?

Everywhere where it can be applied.

See: https://issues.apache.org/jira/browse/SPARK-32004 and https://issues.apache.org/jira/browse/SPARK-32333

Got it. Thanks for passing links, I'll make the changes.

mtojek · 2022-03-23T08:30:34Z

packages/apache_spark/_dev/build/docs/README.md

+wget -O jolokia-agent.jar http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.6/jolokia-jvm-1.3.6-agent.jar
+```
+
+As far, as Jolokia JVM Agent is downloaded, we should configure Apache Spark, to use it as JavaAgent and expose metrics via HTTP/Json. Edit spark-env.sh. It should be in `/usr/local/spark/conf` and add following parameters (Assuming that spark install folder is /usr/local/spark, if not change the path to one on which Spark is installed):


nit: HTTP/JSON

mtojek · 2022-03-23T08:30:44Z

packages/apache_spark/_dev/build/docs/README.md

+wget -O jolokia-agent.jar http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.6/jolokia-jvm-1.3.6-agent.jar
+```
+
+As far, as Jolokia JVM Agent is downloaded, we should configure Apache Spark, to use it as JavaAgent and expose metrics via HTTP/Json. Edit spark-env.sh. It should be in `/usr/local/spark/conf` and add following parameters (Assuming that spark install folder is /usr/local/spark, if not change the path to one on which Spark is installed):


/usr/local/spark

mtojek · 2022-03-23T08:31:51Z

packages/apache_spark/_dev/build/docs/README.md

+```
+
+Now we need to create /usr/local/spark/conf/jolokia.policy with following content:
+```


I'm wondering if you can use ```xml to enable code coloring. Could you please check if it works?

mtojek · 2022-03-23T08:33:01Z

packages/apache_spark/_dev/deploy/docker/Dockerfiles/Dockerfile-master

+USER root
+RUN \
+  apt-get update && apt-get install -y \
+  wget


Is curl present in the image? Maybe we don't need to install wget :)

Good idea, thank you!

mtojek · 2022-03-23T10:23:00Z

packages/apache_spark/data_stream/driver/sample_event.json

+        "version": "8.0.0"
+    },
+    "apache_spark": {
+        "metrics": {


Not updated event?

Yes, this sample event is old because the system tests are only completed for the nodes data stream as of now. The sample event would be updated along with the same.

mtojek · 2022-03-23T10:23:49Z

packages/apache_spark/data_stream/executors/fields/fields.yml

+  type: group
+  release: beta
+  fields:
+    - name: executor


Same, data stream name is inconsistent with this field

mtojek · 2022-03-23T10:24:53Z

packages/apache_spark/data_stream/executors/fields/fields.yml

+          type: long
+        - name: cpu_time
+          type: long
+        - name: deserialize


Do you think we should skip these metrics?

I agree, but we followed the PRD where it has been mentioned to collect gauge and counter metrics.

Spark provides the following types of metrics - Gauge, Counter, Histogram, Meter, Timer. The most common types of metrics used in Spark instrumentation are gauges and counters. Hence, we will support only Gauge and Counter metrics from the following providers

How should we take it forward?
CC: @akshay-saraswat

Ok, I will pass the final decision to @akshay-saraswat.

BTW what you quoted here is the overall idea to collect different TYPES of metrics, not ALL metrics.

@yug-elastic I believe both deserialize and cpu_time are timer metrics. They are neither counter nor gauge. Please correct me if I am wrong. If not, discard them, please.

Oh, got it. Maybe this line in the Apache doc confused you - "namespace=executor (metrics are of type counter or gauge)". They are exposing it as a count but I don't think this is a count. It's time in seconds in my opinion.

Right, thanks for the clarification!

mtojek · 2022-03-23T10:25:24Z

packages/apache_spark/data_stream/executors/fields/fields.yml

+          type: long
+        - name: generated_method_size
+          type: long
+        - name: hive_client_calls


Do we need this metric?

Same answer as this comment.

Yes, our competitors collect it I believe. Let's keep it.

mtojek · 2022-03-23T10:25:54Z

packages/apache_spark/data_stream/executors/sample_event.json

+        "version": "8.0.0"
+    },
+    "apache_spark": {
+        "metrics": {


Same, outdated event

Yes, this sample event is old because the system tests are only completed for the nodes data stream as of now. The sample event would be updated along with the same.

akshay-saraswat · 2022-03-30T00:49:25Z

Which provider or logs are you planning to use for the Cluster Manager metrics?

We haven't actually analyzed the Cluster Manager as of now. Also, it wasn't a part of the documentation which was mentioned in the PRD.

I think Cluster Manager is just a component in the architecture and you get the metrics associated via Master or ApplicationMaster if I am not mistaken. Feel free to correct my understanding if you find anything different.

ruflin · 2022-03-30T07:01:44Z

This PR would strongly benefit from splitting it up into multiple PRs. One for the foundation of spark + 1 data stream and then have an additional PR for each data stream. As the comments in this PR are important and should not be lost, I recommend to switch this PR to draft, open a fresh new PR and copy over the parts needed for the first foundation PR and reference this PR there. Then go on an copy over each data stream directory to a PR per data stream as soon as the first PR is merged. This should make it possible to move forward much more quickly and more focused.

yug-rajani · 2022-03-30T19:16:56Z

This PR would strongly benefit from splitting it up into multiple PRs. One for the foundation of spark + 1 data stream and then have an additional PR for each data stream. As the comments in this PR are important and should not be lost, I recommend to switch this PR to draft, open a fresh new PR and copy over the parts needed for the first foundation PR and reference this PR there. Then go on an copy over each data stream directory to a PR per data stream as soon as the first PR is merged. This should make it possible to move forward much more quickly and more focused.

Thanks @ruflin, it's a good idea. Here are the links for the PRs that this PR is split into:
#2939
#2941
#2945
#2943
#3020 (follow-up PR for visualizations)

@mtojek JFYI, the change suggested in #2811 (comment) in this PR regarding dropping/not collecting some fields because they are too detailed are left to be covered in the above PRs. We'll update the same soon.

ruflin · 2022-03-31T07:35:12Z

Thanks for the split up @yug-elastic . Which of the PRs we should have a detailed look first? We need one PR to get merged first and then follow up with the other 3, this ensures we don't review some of the content twice.

Update: Looks like #2939 is the one?

yug-rajani · 2022-03-31T08:59:16Z

Update: Looks like #2939 is the one?

Yes, #2939 it the one. Thanks for the review, @ruflin! We'll address the review comments soon.

akshay-saraswat · 2022-04-27T19:07:28Z

packages/apache_spark/_dev/build/docs/README.md

@@ -0,0 +1,96 @@
+# Apache Spark
+
+The Apache Spark integration collects and parses data using the Jolokia Metricbeat Module.


nit: Would stating 'metricbeat module' confuse users into thinking that a separate metricbeat module is required for this integration. Should we instead say "Jolokia Input"? Although it's implicit from the requirements that the metricbeat module is not required. But it would be nice to make it explicit if you receive more comments to address and update this PR. Otherwise, this PR looks good to me. Don't update just for this nitpick.

Thanks for the approval, @akshay-saraswat! This PR is the old PR which was split into data stream specific PRs as discussed. The other PRs have been approved and some of them are already merged. We do have one open PR (#3070) which is approved by Jaime Soriano Pastor and waiting for the approval from CODEOWNERS. If "Jolokia Input" sounds more intuitive to you, we'll update that PR with the same change, it's not a problem.

Updated the same.
Quick Reference: https://github.com/elastic/integrations/pull/3070/files#diff-05a30b6a3e489638b54c77068eb780e019e4dcf90827f391e5693aa96aefb4f2R3

yug-rajani · 2022-05-09T08:21:18Z

Closing this PR as it was split up into multiple PRs as discussed in the comment #2811 (comment). All the parts are now merged and the linked issue (#493) has been closed.

Thanks a lot @mtojek, @ruflin, @jsoriano, @akshay-saraswat and @lalit-satapathy for taking out time to review the PRs and providing valuable feedback!

yug-rajani added 2 commits March 10, 2022 11:59

Initial commit for Apache Spark package

5d89087

Add entry to CODEOWNERS and PR Link to changelog.yml

c63bbb0

yug-rajani requested a review from mtojek March 10, 2022 06:45

yug-rajani self-assigned this Mar 10, 2022

yug-rajani added enhancement New feature or request New Integration Issue or pull request for creating a new integration package. Team:Integrations Label for the Integrations team labels Mar 10, 2022

yug-rajani linked an issue Mar 10, 2022 that may be closed by this pull request

Create Apache Spark integration #493

Closed

16 tasks

mtojek requested review from a team and lalit-satapathy March 10, 2022 09:38

lalit-satapathy reviewed Mar 10, 2022

View reviewed changes

mtojek reviewed Mar 10, 2022

View reviewed changes

yug-rajani added 5 commits March 12, 2022 10:20

Add system tests, visualizations and README.md

905d9c5

Update versions for dashboard

bca5904

Update version in visualizations

09cd450

Minor visualization version change

18817bb

Update kibana.version in manifest.yml

836c367

yug-rajani marked this pull request as ready for review March 12, 2022 08:23

yug-rajani requested a review from a team as a code owner March 12, 2022 08:23

mtojek suggested changes Mar 14, 2022

View reviewed changes

Update visualization titles as per the standard format

a5e7e0c

yug-rajani mentioned this pull request Mar 15, 2022

[hadoop] Add Hadoop package #2614

Closed

4 tasks

Changes in visualization versions

03f48c0

Add timeout to stream.yml.hbs

418818d

yug-rajani added 2 commits March 22, 2022 01:29

Split the data streams logically

7590986

Merge branch 'main' of https://github.com/yug-elastic/integrations in…

c95a75a

…to package_apache_spark

yug-rajani requested review from lalit-satapathy and mtojek March 22, 2022 06:19

mtojek reviewed Mar 23, 2022

View reviewed changes

Address review comments

1f6fd43

mtojek requested a review from ruflin March 29, 2022 08:42

Update sample events for nodes

87330a9

Merge branch 'elastic:main' into package_apache_spark

cf5377f

ruflin mentioned this pull request Mar 30, 2022

[nagios] Add Nagios package #2824

Closed

4 tasks

yug-rajani marked this pull request as draft March 30, 2022 19:18

yug-rajani mentioned this pull request Apr 6, 2022

[apache spark] Add visualizations and dashboard #3020

Merged

4 tasks

yug-rajani mentioned this pull request Apr 14, 2022

Spring boot package [Memory - data stream] #2979

Merged

4 tasks

yug-rajani removed a link to an issue Apr 26, 2022

Create Apache Spark integration #493

Closed

16 tasks

akshay-saraswat reviewed Apr 27, 2022

View reviewed changes

yug-rajani mentioned this pull request Apr 28, 2022

[apache_spark][node] Change naming convention of the node data stream #3070

Merged

4 tasks

yug-rajani closed this May 9, 2022


		This is a new integration created using the [elastic-package](https://github.com/elastic/elastic-package) tool.

		Consider using the README template file `_dev/build/docs/README.md`to generate a list of exported fields or include a sample event.

		@@ -0,0 +1,96 @@
		# Apache Spark

		The Apache Spark integration collects and parses data using the Jolokia Metricbeat Module.

[apache_spark] Add Apache Spark package #2811

[apache_spark] Add Apache Spark package #2811

Conversation

yug-rajani commented Mar 10, 2022 • edited Loading

What does this PR do?

Checklist

How to test this PR locally

Screenshots

elasticmachine commented Mar 10, 2022 • edited Loading

💔 Build Failed

Build stats

Steps errors

Check integration: apache_spark

Google Storage Download

🤖 GitHub comments

lalit-satapathy left a comment

Choose a reason for hiding this comment

mtojek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani commented Mar 10, 2022

yug-rajani commented Mar 11, 2022

elasticmachine commented Mar 12, 2022

yug-rajani commented Mar 13, 2022

mtojek left a comment

Choose a reason for hiding this comment

yug-rajani commented Mar 15, 2022

yug-rajani commented Mar 22, 2022

yug-rajani commented Mar 22, 2022 • edited Loading

yug-rajani commented Mar 23, 2022

mtojek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtojek Mar 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshay-saraswat Mar 30, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akshay-saraswat commented Mar 30, 2022

ruflin commented Mar 30, 2022

yug-rajani commented Mar 30, 2022 • edited Loading

ruflin commented Mar 31, 2022 • edited Loading

yug-rajani commented Mar 31, 2022

akshay-saraswat Apr 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yug-rajani commented Mar 10, 2022 •

edited

Loading

elasticmachine commented Mar 10, 2022 •

edited

Loading

`Check integration: apache_spark`

`Google Storage Download`

yug-rajani commented Mar 22, 2022 •

edited

Loading

mtojek Mar 29, 2022 •

edited

Loading

akshay-saraswat Mar 30, 2022 •

edited

Loading

yug-rajani commented Mar 30, 2022 •

edited

Loading

ruflin commented Mar 31, 2022 •

edited

Loading

akshay-saraswat Apr 27, 2022 •

edited

Loading