Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[apache_spark] Add Apache Spark package #2811

Closed
wants to merge 18 commits into from

Conversation

yug-rajani
Copy link
Contributor

@yug-rajani yug-rajani commented Mar 10, 2022

What does this PR do?

  • Generated the skeleton of Apache Spark integration package.
  • Added 4 data streams ( Nodes, Driver, Executors, Applications )
  • Added data collection logic.
  • Added the ingest pipelines.
  • Mapped fields according to the ECS schema and added Fields metadata in the appropriate yml files.
  • Added dashboards and visualizations.
  • Added system test cases.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • If I'm introducing a new feature, I have modified the Kibana version constraint in my package's manifest.yml file to point to the latest Elastic stack release (e.g. ^7.13.0).

How to test this PR locally

  • Clone integrations repo.
  • Install elastic-package locally.
  • Start elastic stack using elastic-package.
  • Move to integrations/packages/apache_spark directory.
  • Run the following command to run tests.

elastic-package test

Screenshots

image
image
image
image
image
image

@yug-rajani yug-rajani requested a review from mtojek March 10, 2022 06:45
@elasticmachine
Copy link

elasticmachine commented Mar 10, 2022

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-03-31T08:05:05.756+0000

  • Duration: 10 min 39 sec

Steps errors 2

Expand to view the steps failures

Check integration: apache_spark
  • Took 0 min 0 sec . View more details here
  • Description: ../../build/elastic-package check -v
Google Storage Download
  • Took 0 min 0 sec . View more details here

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

@yug-rajani yug-rajani self-assigned this Mar 10, 2022
@yug-rajani yug-rajani added enhancement New feature or request New Integration Issue or pull request for creating a new integration package. Team:Integrations Label for the Integrations team labels Mar 10, 2022
@yug-rajani yug-rajani linked an issue Mar 10, 2022 that may be closed by this pull request
16 tasks
@mtojek mtojek requested review from a team and lalit-satapathy March 10, 2022 09:38
Copy link
Collaborator

@lalit-satapathy lalit-satapathy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you populate "README.md" with all details needed? It does not have any details of the integration currently.

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you take a look at the Spark project documentation? Metrics are grouped based on instances, there isn't one big bucket for all metrics.

policy_templates:
- name: apache_spark
title: Apache Spark metrics
description: Collect Apache Spark metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same problem as with Spring Boot, you need to divide metrics into logical groups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we have addressed this in our previous comment

size: 600x600
type: image/png
icons:
- src: /img/apache_spark-logo.svg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can use the right logo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- monitoring
release: beta
conditions:
kibana.version: ^7.16.2 || ^8.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm... I think you can go with ^8.0.0 only. We don't need to publish this package for old releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution/sprint plan that we sent out mentioned ^7.16.0 and hence we kept this. But if you believe we should just keep ^8.0.0, we will update it.
CC: @akshay-saraswat

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8.0 sounds fine to me too. Let's reduce the supported versions and hence the need for backward compatibility as much as we can. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thank you for the clarification. We will go with 8.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a similar comment on Nagios integration just now. Let's make it 8.2.0 instead. 8.0 is already released and we will not be able to publish this integration before 8.2.0 I am guessing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on releasing new integrations only for the most recent version. I assume we don't test all the previous releases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we'll update it once it has been verified and tested on our end. We'll follow the same for the future integrations as well.


- version: "0.1.0"
changes:
- description: Initial draft of the package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you plan to merge this package with only this PR, it won't be a draft anymore :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's in the plan for future commits.
Does this look good?
Apache Spark integration package with Driver, Executor, Master, Worker and ApplicationSource metrics


This is a new integration created using the [elastic-package](https://github.com/elastic/elastic-package) tool.

Consider using the README template file `_dev/build/docs/README.md`to generate a list of exported fields or include a sample event.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is required from the end-user to make it integrated? What kind of changes should be introduced to Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's something we are still working on. We'll update the README with detailed steps of the changes required.

@yug-rajani
Copy link
Contributor Author

Could you populate "README.md" with all details needed? It does not have any details of the integration currently.

@lalit-satapathy , yes that's something we are still working on. This PR is still in progress, so expect system tests, dashboards, and README in the later commits

@yug-rajani
Copy link
Contributor Author

Did you take a look at the Spark project documentation? Metrics are grouped based on instances, there isn't one big bucket for all metrics.

@mtojek Yes, we did check out the documentation. However, we referred to the Hadoop integration where we have clubbed multiple metrics in a single data stream keeping in mind the future extensibility and thought of using the same approach here for Apache Spark.

@yug-rajani yug-rajani marked this pull request as ready for review March 12, 2022 08:23
@yug-rajani yug-rajani requested a review from a team as a code owner March 12, 2022 08:23
@elasticmachine
Copy link

Pinging @elastic/integrations (Team:Integrations)

@yug-rajani
Copy link
Contributor Author

/test

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtojek Yes, we did check out the documentation. However, we referred to the Hadoop integration where we have clubbed multiple metrics in a single data stream keeping in mind the future extensibility and thought of using the same approach here for Apache Spark.

I'm afraid that it isn't the way we'd like to follow. We'd rather see metrics split into multiple areas and add a new area if needed. We have to iterate on this. I will check the Hadoop integration and if it applies also there, I guess we have to refactor that one as well.

@yug-rajani yug-rajani mentioned this pull request Mar 15, 2022
4 tasks
@yug-rajani
Copy link
Contributor Author

/test

@yug-rajani
Copy link
Contributor Author

Which provider or logs are you planning to use for the Cluster Manager metrics?

We haven't actually analyzed the Cluster Manager as of now. Also, it wasn't a part of the documentation which was mentioned in the PRD.

@yug-rajani
Copy link
Contributor Author

yug-rajani commented Mar 22, 2022

We are facing an issue with the system tests for the ApplicationSource, Driver and Executor as we need to keep an application running in the container to collect the metrics. We are actively working on the same, and we'll update the PR as soon as it's done. Meanwhile, please feel free to add reviews to the other changes that are completed.

@yug-rajani
Copy link
Contributor Author

Issue with system tests:
The system tests for the nodes data stream are working fine. However, in case of driver and executors data streams, the metrics would be fetched when any application is running.

If we run a sample application (for example, WordCount) as a part of the Dockerfile, it would be considered as a step and would complete when the application is ran completely, and hence the metrics are not found during the system tests. We also tried running the application in background (once and in an infinite loop) using entrypoint script, but were unsuccessful and were not able to hit Jolokia even after that.

Please let us know if you have any ideas on the same. We can push the code over the PR for you to take a look if you think that's a good idea.

Copy link
Contributor

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yug, I did partially the review as some comments apply to many places. Please revisit the entire PR based on those.

export SPARK_MASTER_OPTS="$SPARK_MASTER_OPTS -javaagent:/usr/share/java/jolokia-agent.jar=config=/usr/local/spark/conf/jolokia-master.properties"
```

Now, create `/usr/local/spark/conf/jolokia-master.properties` file with following content:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use words: main/worker instead of master/worker everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting to use main/worker everywhere in the README or everywhere else (like the field names)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everywhere where it can be applied.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks for passing links, I'll make the changes.

wget -O jolokia-agent.jar http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.6/jolokia-jvm-1.3.6-agent.jar
```

As far, as Jolokia JVM Agent is downloaded, we should configure Apache Spark, to use it as JavaAgent and expose metrics via HTTP/Json. Edit spark-env.sh. It should be in `/usr/local/spark/conf` and add following parameters (Assuming that spark install folder is /usr/local/spark, if not change the path to one on which Spark is installed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HTTP/JSON

wget -O jolokia-agent.jar http://search.maven.org/remotecontent?filepath=org/jolokia/jolokia-jvm/1.3.6/jolokia-jvm-1.3.6-agent.jar
```

As far, as Jolokia JVM Agent is downloaded, we should configure Apache Spark, to use it as JavaAgent and expose metrics via HTTP/Json. Edit spark-env.sh. It should be in `/usr/local/spark/conf` and add following parameters (Assuming that spark install folder is /usr/local/spark, if not change the path to one on which Spark is installed):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/usr/local/spark

```

Now we need to create /usr/local/spark/conf/jolokia.policy with following content:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if you can use ```xml to enable code coloring. Could you please check if it works?

USER root
RUN \
apt-get update && apt-get install -y \
wget
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is curl present in the image? Maybe we don't need to install wget :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thank you!

"version": "8.0.0"
},
"apache_spark": {
"metrics": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not updated event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sample event is old because the system tests are only completed for the nodes data stream as of now. The sample event would be updated along with the same.

type: group
release: beta
fields:
- name: executor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, data stream name is inconsistent with this field

type: long
- name: cpu_time
type: long
- name: deserialize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should skip these metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but we followed the PRD where it has been mentioned to collect gauge and counter metrics.

Spark provides the following types of metrics - Gauge, Counter, Histogram, Meter, Timer. The most common types of metrics used in Spark instrumentation are gauges and counters. Hence, we will support only Gauge and Counter metrics from the following providers

How should we take it forward?
CC: @akshay-saraswat

Copy link
Contributor

@mtojek mtojek Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will pass the final decision to @akshay-saraswat.

BTW what you quoted here is the overall idea to collect different TYPES of metrics, not ALL metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yug-elastic I believe both deserialize and cpu_time are timer metrics. They are neither counter nor gauge. Please correct me if I am wrong. If not, discard them, please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it. Maybe this line in the Apache doc confused you - "namespace=executor (metrics are of type counter or gauge)". They are exposing it as a count but I don't think this is a count. It's time in seconds in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for the clarification!

type: long
- name: generated_method_size
type: long
- name: hive_client_calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as this comment.

Copy link
Contributor

@akshay-saraswat akshay-saraswat Mar 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our competitors collect it I believe. Let's keep it.

"version": "8.0.0"
},
"apache_spark": {
"metrics": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, outdated event

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sample event is old because the system tests are only completed for the nodes data stream as of now. The sample event would be updated along with the same.

@mtojek mtojek requested a review from ruflin March 29, 2022 08:42
@akshay-saraswat
Copy link
Contributor

Which provider or logs are you planning to use for the Cluster Manager metrics?

We haven't actually analyzed the Cluster Manager as of now. Also, it wasn't a part of the documentation which was mentioned in the PRD.

I think Cluster Manager is just a component in the architecture and you get the metrics associated via Master or ApplicationMaster if I am not mistaken. Feel free to correct my understanding if you find anything different.

@ruflin
Copy link
Collaborator

ruflin commented Mar 30, 2022

This PR would strongly benefit from splitting it up into multiple PRs. One for the foundation of spark + 1 data stream and then have an additional PR for each data stream. As the comments in this PR are important and should not be lost, I recommend to switch this PR to draft, open a fresh new PR and copy over the parts needed for the first foundation PR and reference this PR there. Then go on an copy over each data stream directory to a PR per data stream as soon as the first PR is merged. This should make it possible to move forward much more quickly and more focused.

@yug-rajani
Copy link
Contributor Author

yug-rajani commented Mar 30, 2022

This PR would strongly benefit from splitting it up into multiple PRs. One for the foundation of spark + 1 data stream and then have an additional PR for each data stream. As the comments in this PR are important and should not be lost, I recommend to switch this PR to draft, open a fresh new PR and copy over the parts needed for the first foundation PR and reference this PR there. Then go on an copy over each data stream directory to a PR per data stream as soon as the first PR is merged. This should make it possible to move forward much more quickly and more focused.

Thanks @ruflin, it's a good idea. Here are the links for the PRs that this PR is split into:
#2939
#2941
#2945
#2943
#3020 (follow-up PR for visualizations)

@mtojek JFYI, the change suggested in #2811 (comment) in this PR regarding dropping/not collecting some fields because they are too detailed are left to be covered in the above PRs. We'll update the same soon.

@yug-rajani yug-rajani marked this pull request as draft March 30, 2022 19:18
@ruflin
Copy link
Collaborator

ruflin commented Mar 31, 2022

Thanks for the split up @yug-elastic . Which of the PRs we should have a detailed look first? We need one PR to get merged first and then follow up with the other 3, this ensures we don't review some of the content twice.

Update: Looks like #2939 is the one?

@yug-rajani
Copy link
Contributor Author

Update: Looks like #2939 is the one?

Yes, #2939 it the one. Thanks for the review, @ruflin! We'll address the review comments soon.

@@ -0,0 +1,96 @@
# Apache Spark

The Apache Spark integration collects and parses data using the Jolokia Metricbeat Module.
Copy link
Contributor

@akshay-saraswat akshay-saraswat Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would stating 'metricbeat module' confuse users into thinking that a separate metricbeat module is required for this integration. Should we instead say "Jolokia Input"? Although it's implicit from the requirements that the metricbeat module is not required. But it would be nice to make it explicit if you receive more comments to address and update this PR. Otherwise, this PR looks good to me. Don't update just for this nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the approval, @akshay-saraswat! This PR is the old PR which was split into data stream specific PRs as discussed. The other PRs have been approved and some of them are already merged. We do have one open PR (#3070) which is approved by Jaime Soriano Pastor and waiting for the approval from CODEOWNERS. If "Jolokia Input" sounds more intuitive to you, we'll update that PR with the same change, it's not a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yug-rajani
Copy link
Contributor Author

Closing this PR as it was split up into multiple PRs as discussed in the comment #2811 (comment). All the parts are now merged and the linked issue (#493) has been closed.

Thanks a lot @mtojek, @ruflin, @jsoriano, @akshay-saraswat and @lalit-satapathy for taking out time to review the PRs and providing valuable feedback!

@yug-rajani yug-rajani closed this May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request New Integration Issue or pull request for creating a new integration package. Team:Integrations Label for the Integrations team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants