Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GRPC listener service for Agent #18827

Merged
merged 8 commits into from
Jun 1, 2020

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented May 28, 2020

What does this PR do?

Adds a GRPC server implementation to the Elastic Agent. This is just the implementation, the server is not actually used by the Elastic Agent (coming in later PR).

The GRPC server maintains the currently reported status of an application (connected or not connected). Pushes config updates to the application and informs the application when to stop. A watchdog is included in the server to ensure that the application checkin every 30 seconds if not then the first missed window of time the application will be marked degraded and then after another missed window (total of 60 seconds) the application will be marked failed (currently nothing is done at this point, follow up PR will add the kill/restart logic).

Actions are also handled by the GRPC server implementation, even across connections and disconnections, including timeout of operations. A action can timeout or be cancelled depending on the application state in the GRPC server.

Usage:

type StubHandler struct{}

func (h *StubHandler) OnStatusChange(as *ApplicationState, status proto.StateObserved_Status, message string) {
	// handle status changes
}

srv, _ := server.New(logger, ":6890", &StubHandler{})
_ = srv.Start()

app := application.New(...)
as, _ := srv.Register(app)

as.UpdateConfig("new_config")

resp, err := as.PerformAction("name", map[string]interface{}{}, 30 * time.Second)  // 30 seconds to perform action

as.Stop(30 * time.Second) // 30 seconds to stop

as.Destroy()  // Remove application from server, prevent application from re-connect, and do signal stop

Why is it important?

This is need as the contract between Elastic Agent and the spawned applications has flipped where the applications now connecting back to the Agent. Support for stopping and performing actions on application was also required this PR adds those required building blocks.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • Unit tests pass with data race checking go test -race github.com/elastic/beats/v7/x-pack/elastic-agent/pkg/core/server

How to test this PR locally

go test -race github.com/elastic/beats/v7/x-pack/elastic-agent/pkg/core/server

Related issues

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 28, 2020
@blakerouse blakerouse requested a review from a team May 28, 2020 18:49
@ph
Copy link
Contributor

ph commented May 28, 2020

@graphaelli I believe you are using gRPC in apm-server, we are planning to update the vendor version on our side. Not sure if that has an impact on you or not?

@elasticmachine
Copy link
Collaborator

elasticmachine commented May 28, 2020

💔 Tests Failed

Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #18827 updated]

  • Start Time: 2020-06-01T18:45:18.158+0000

  • Duration: 72 min 14 sec

Test stats 🧪

Test Results
Failed 1
Passed 8851
Skipped 1450
Total 10302

Test errors

Expand to view the tests failures

  • Name: Build and Test / Metricbeat OSS Unit tests / test_process – test_system.Test

    • Age: 1
    • Duration: 1.656
    • Error Details: False is not true : fd not found in any process events

Steps errors

Expand to view the steps failures

  • Name: Mage build unitTest

    • Description: mage build unitTest

    • Duration: 11 min 48 sec

    • Start Time: 2020-06-01T19:08:59.993+0000

    • log

  • Name: Mage goIntegTest

    • Description: mage goIntegTest

    • Duration: 1 min 8 sec

    • Start Time: 2020-06-01T19:08:28.724+0000

    • log

  • Name: Report to Codecov

    • Description: curl -sSLo codecov https://codecov.io/bash for i in auditbeat filebeat heartbeat libbeat metricbeat packetbeat winlogbeat journalbeat do FILE="${i}/build/coverage/full.cov" if [ -f "${FILE}" ]; then bash codecov -f "${FILE}" fi done

    • Duration: 2 min 22 sec

    • Start Time: 2020-06-01T19:45:11.948+0000

    • log

  • Name: Report to Codecov

    • Description: curl -sSLo codecov https://codecov.io/bash for i in auditbeat filebeat heartbeat libbeat metricbeat packetbeat winlogbeat journalbeat do FILE="${i}/build/coverage/full.cov" if [ -f "${FILE}" ]; then bash codecov -f "${FILE}" fi done

    • Duration: 1 min 27 sec

    • Start Time: 2020-06-01T19:48:32.253+0000

    • log

Log output

Expand to view the last 100 lines of log output

[2020-06-01T19:57:04.199Z] + [ -f heartbeat/build/coverage/full.cov ]
[2020-06-01T19:57:04.199Z] + FILE=libbeat/build/coverage/full.cov
[2020-06-01T19:57:04.199Z] + [ -f libbeat/build/coverage/full.cov ]
[2020-06-01T19:57:04.199Z] + FILE=metricbeat/build/coverage/full.cov
[2020-06-01T19:57:04.199Z] + [ -f metricbeat/build/coverage/full.cov ]
[2020-06-01T19:57:04.199Z] + FILE=packetbeat/build/coverage/full.cov
[2020-06-01T19:57:04.199Z] + [ -f packetbeat/build/coverage/full.cov ]
[2020-06-01T19:57:04.199Z] + FILE=winlogbeat/build/coverage/full.cov
[2020-06-01T19:57:04.199Z] + [ -f winlogbeat/build/coverage/full.cov ]
[2020-06-01T19:57:04.199Z] + FILE=journalbeat/build/coverage/full.cov
[2020-06-01T19:57:04.199Z] + [ -f journalbeat/build/coverage/full.cov ]
[2020-06-01T19:57:05.676Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats
[2020-06-01T19:57:05.999Z] + find . -type f -name TEST*.xml -path */build/* -delete
[2020-06-01T19:57:06.013Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Lint
[2020-06-01T19:57:06.092Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-OSS-Integration-tests
[2020-06-01T19:57:06.164Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Elastic-Agent-Mac-OS-X
[2020-06-01T19:57:06.236Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Elastic-Agent-x-pack
[2020-06-01T19:57:06.311Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Winlogbeat-oss
[2020-06-01T19:57:06.388Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-crosscompile
[2020-06-01T19:57:06.463Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-oss-Mac-OS-X
[2020-06-01T19:57:06.544Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-x-pack-Mac-OS-X
[2020-06-01T19:57:06.622Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Dockerlogbeat
[2020-06-01T19:57:06.704Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Journalbeat-oss
[2020-06-01T19:57:06.787Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Generators-Metricbeat-Linux
[2020-06-01T19:57:06.864Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-Mac-OS-X
[2020-06-01T19:57:06.937Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Functionbeat-x-pack
[2020-06-01T19:57:07.014Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Elastic-Agent-x-pack-Windows
[2020-06-01T19:57:07.091Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-x-pack-Mac-OS-X
[2020-06-01T19:57:07.175Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-Mac-OS-X
[2020-06-01T19:57:07.247Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-x-pack-Mac-OS-X
[2020-06-01T19:57:07.323Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-OSS-Unit-tests
[2020-06-01T19:57:07.393Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Heartbeat-oss
[2020-06-01T19:57:07.465Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-x-pack
[2020-06-01T19:57:07.541Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-oss-Windows
[2020-06-01T19:57:07.615Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-crosscompile
[2020-06-01T19:57:07.691Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Functionbeat-Mac-OS-X-x-pack
[2020-06-01T19:57:07.763Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-x-pack-Windows
[2020-06-01T19:57:07.835Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-x-pack-Windows
[2020-06-01T19:57:07.905Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Winlogbeat-Windows-x-pack
[2020-06-01T19:57:07.976Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Heartbeat-Mac-OS-X
[2020-06-01T19:57:08.049Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Auditbeat-oss-Linux
[2020-06-01T19:57:08.118Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Packetbeat-oss
[2020-06-01T19:57:08.193Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Libbeat-x-pack
[2020-06-01T19:57:08.276Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-Windows
[2020-06-01T19:57:08.351Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-Windows
[2020-06-01T19:57:08.422Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Winlogbeat-Windows
[2020-06-01T19:57:08.505Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-x-pack-Windows
[2020-06-01T19:57:08.579Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-x-pack
[2020-06-01T19:57:08.651Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Generators-Beat-Linux
[2020-06-01T19:57:08.724Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-x-pack
[2020-06-01T19:57:08.802Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Filebeat-oss
[2020-06-01T19:57:08.873Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Heartbeat-Windows
[2020-06-01T19:57:08.951Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Functionbeat-Windows
[2020-06-01T19:57:09.022Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Metricbeat-Python-integration-tests
[2020-06-01T19:57:09.093Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Generators-Metricbeat-Mac-OS-X
[2020-06-01T19:57:09.162Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Libbeat-oss
[2020-06-01T19:57:09.231Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Generators-Beat-Mac-OS-X
[2020-06-01T19:57:09.301Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Libbeat-crosscompile
[2020-06-01T19:57:09.369Z] Running in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats/Libbeat-stress-tests
[2020-06-01T19:57:09.727Z] + cat
[2020-06-01T19:57:09.727Z] + /usr/local/bin/runbld ./runbld-script
[2020-06-01T19:57:09.727Z] Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
[2020-06-01T19:57:16.330Z] runbld>>> runbld started
[2020-06-01T19:57:16.330Z] runbld>>> 1.6.11/a66728ff8f4356963772e6e6d2069392fa06acbe
[2020-06-01T19:57:18.248Z] runbld>>> The following profiles matched the job 'Beats/beats-beats-mbp/PR-18827' in order of occurrence in the config (last value wins).
[2020-06-01T19:57:19.632Z] runbld>>> Debug logging enabled.
[2020-06-01T19:57:19.632Z] runbld>>> Storing result
[2020-06-01T19:57:19.632Z] runbld>>> Store result: created {:total 2, :successful 2, :failed 0} 1
[2020-06-01T19:57:19.632Z] runbld>>> BUILD: https://c150076387b5421f9154dfbf536e5c60.us-west1.gcp.cloud.es.io:9243/build-1587637540455/t/20200601195719-73A36373
[2020-06-01T19:57:19.632Z] runbld>>> Adding system facts.
[2020-06-01T19:57:20.576Z] runbld>>> Adding vcs info for the latest commit:  c699fae0caa204fffcbdde607ef5595467e15eb5
[2020-06-01T19:57:20.837Z] runbld>>> >>>>>>>>>>>> SCRIPT EXECUTION BEGIN >>>>>>>>>>>>
[2020-06-01T19:57:20.837Z] runbld>>> Adding /usr/lib/jvm/java-8-openjdk-amd64/bin to the path.
[2020-06-01T19:57:20.837Z] + echo 'Processing JUnit reports with runbld...'
[2020-06-01T19:57:20.837Z] Processing JUnit reports with runbld...
[2020-06-01T19:57:21.410Z] runbld>>> <<<<<<<<<<<< SCRIPT EXECUTION END <<<<<<<<<<<<
[2020-06-01T19:57:21.410Z] runbld>>> DURATION: 26ms
[2020-06-01T19:57:21.410Z] runbld>>> STDOUT: 40 bytes
[2020-06-01T19:57:21.410Z] runbld>>> STDERR: 49 bytes
[2020-06-01T19:57:21.410Z] runbld>>> WRAPPED PROCESS: SUCCESS (0)
[2020-06-01T19:57:21.410Z] runbld>>> Searching for build metadata in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats
[2020-06-01T19:57:22.797Z] runbld>>> Storing build metadata: 
[2020-06-01T19:57:22.797Z] runbld>>> Adding test report.
[2020-06-01T19:57:22.797Z] runbld>>> Searching for junit test output files with the pattern: TEST-.*\.xml$ in: /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827/src/github.com/elastic/beats
[2020-06-01T19:57:23.742Z] runbld>>> Found 57 test output files
[2020-06-01T19:57:25.666Z] runbld>>> Test output logs contained: Errors: 0 Failures: 1 Tests: 10152 Skipped: 1335
[2020-06-01T19:57:25.929Z] runbld>>> Storing result
[2020-06-01T19:57:25.929Z] runbld>>> FAILURES: 1
[2020-06-01T19:57:26.192Z] runbld>>> Store result: updated {:total 2, :successful 2, :failed 0} 2
[2020-06-01T19:57:26.192Z] runbld>>> BUILD: https://c150076387b5421f9154dfbf536e5c60.us-west1.gcp.cloud.es.io:9243/build-1587637540455/t/20200601195719-73A36373
[2020-06-01T19:57:26.192Z] runbld>>> Email notification disabled by environment variable.
[2020-06-01T19:57:26.192Z] runbld>>> Slack notification disabled by environment variable.
[2020-06-01T19:57:31.701Z] Running on Jenkins in /var/lib/jenkins/workspace/Beats_beats-beats-mbp_PR-18827
[2020-06-01T19:57:31.802Z] [INFO] getVaultSecret: Getting secrets
[2020-06-01T19:57:31.853Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2020-06-01T19:57:32.494Z] + chmod 755 generate-build-data.sh
[2020-06-01T19:57:32.494Z] + ./generate-build-data.sh https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats-beats-mbp/PR-18827/ https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats-beats-mbp/PR-18827/runs/5 FAILURE 4334075
[2020-06-01T19:57:32.494Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats-beats-mbp/PR-18827/runs/5/steps/?limit=10000 -o steps-info.json
[2020-06-01T19:57:33.837Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats-beats-mbp/PR-18827/runs/5/tests/?status=FAILED -o tests-errors.json
[2020-06-01T19:57:34.388Z] INFO: curl https://beats-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/Beats/beats-beats-mbp/PR-18827/runs/5/log/ -o pipeline-log.txt

@blakerouse blakerouse changed the title Agent grpc server Add a GRPC listener service for Agent May 28, 2020
@blakerouse blakerouse mentioned this pull request May 28, 2020
5 tasks
@blakerouse
Copy link
Contributor Author

This is actually dependent on #18829, because of the usage of tls.ClientHelloInfo. SupportsCertificate.

@graphaelli
Copy link
Member

Thanks for the heads up, broadening to @elastic/apm-server

@axw
Copy link
Member

axw commented May 29, 2020

Thanks for the heads up @ph. I just tested updating apm-server's grpc to v1.29.1, and it appears to be fine.

@michalpristas michalpristas self-requested a review May 29, 2020 06:38
Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lot of code but i like it. just a small questions along the way

statusMessage string
statusConfigIdx uint64
statusTime time.Time
checkinConn bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you think about different name please? i imagine a connection itself under this name, something like isCheckingConnected, hasCheckingConnection...


pendingActions chan *pendingAction
sentActions map[string]*sentAction
actionsConn bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as ^^

sentActions: make(map[string]*sentAction),
actionsConn: true,
}
s.lock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can it be that in between get and set another set happens? in agent we do thing sync so it should be fine

func (s *Server) Checkin(server proto.ElasticAgent_CheckinServer) error {
firstCheckinChan := make(chan *proto.StateObserved)
go func() {
// go func will not be leaked, because when the main function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for comments about when go routine is done

if err != nil {
// failed to send action; add back to channel to retry on re-connect from the client
appState.actionsLock.Unlock()
appState.pendingActions <- pending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

food for thought: can out of order application of actions be an issue? [not a blocker]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why they would be out of order we append them to the pendingAction channel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have A1, A2, A3,
A1 fails so it is sent to pendingActions channel which is buffered up to 100 items. then you will proceed with A2 and A3 and then A1 is there again from channel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalpristas is correct on the ordering. But its not an issue because the actions on the client side are not blocking when it comes from reading from the stream.

So even if it goes A2, A3, A1 all 3 will be executed at the same time. Now on Agent side the PerformAction is blocking, even though the communication is not.

So the order of actions is still serial on Agent side:

PerformAction("action1") // block waiting for response
PerformAction("action2"). // this wont even be added to the channel until action1 completes, fails, or timesout

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, 👍

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few comments but testing looks good.

@@ -434,6 +440,7 @@ def detect_license_summary(content):
"MPL-2.0",
"UPL-1.0",
"ISC",
"ELASTIC",
]
SKIP_NOTICE = []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if err != nil {
// failed to send action; add back to channel to retry on re-connect from the client
appState.actionsLock.Unlock()
appState.pendingActions <- pending
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why they would be out of order we append them to the pendingAction channel?

ConfigStateIdx: as.statusConfigIdx, // stopping always inform that the config it has is correct
Config: "",
}
} else if checkin.ConfigStateIdx != as.expectedConfigIdx {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check also be covered in the lock above should we indeed defer the unlock of the struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it into the lock as well, thanks for pointing that out.

s := prevStatus
prevMessage := serverApp.statusMessage
message := prevMessage
if serverApp.status == proto.StateObserved_DEGRADED {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love that a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will, we ever try to restart a process if the watchdog doesn't have news of the client for an extended period of time, I am curious what would be the actions required to recover from that state? (can be a followup)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can add a hook here and decide later, with forking we have a watchdog for process to be killed so it could handle also this callback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what the OnStatusChange is given when the server is started. We will add the logic in that callback to handle when an application is marked FAILED.

@blakerouse
Copy link
Contributor Author

Removing the requirement for go 1.14 requires elastic/elastic-agent-client#9 to land so I can vendor it into this PR.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@blakerouse blakerouse force-pushed the agent-grpc-server branch from a56aa73 to 320259c Compare June 1, 2020 16:23
@blakerouse blakerouse merged commit 6e91ce4 into elastic:master Jun 1, 2020
@blakerouse blakerouse deleted the agent-grpc-server branch June 1, 2020 20:35
blakerouse added a commit to blakerouse/beats that referenced this pull request Jun 1, 2020
* Work on the GRPC server for agent.

* Lots of testing.

* Fix data races.

* Add support for elastic license in generate_notice.py.

* Update to generate server name unique per application.

* Fix go vet on stackdriver metricset using latest protobuf.

* Fix data race issue.

* Fix tests.

(cherry picked from commit 6e91ce4)
v1v added a commit to v1v/beats that referenced this pull request Jun 2, 2020
…-stage-level

* upstream/master: (30 commits)
  Add a GRPC listener service for Agent (elastic#18827)
  Disable host.* fields by default for iptables module (elastic#18756)
  [WIP] Clarify capabilities of the Filebeat auditd module (elastic#17068)
  fix: rename file and remove extra separator (elastic#18881)
  ci: enable JJBB (elastic#18812)
  Disable host.* fields by default for Checkpoint module (elastic#18754)
  Disable host.* fields by default for Cisco module (elastic#18753)
  Update latest.yml testing env to 7.7.0 (elastic#18535)
  Upgrade k8s.io/client-go and k8s keystore tests (elastic#18817)
  Add missing Jenkins stages for Auditbeat (elastic#18835)
  [Elastic Log Driver] Create a config shim between libbeat and the user (elastic#18605)
  Use indexers and matchers in config when defaults are enabled (elastic#18818)
  Fix panic on `metricbeat test modules` (elastic#18797)
  [CI] Fix permissions in MacOSX agents (elastic#18847)
  [Ingest Manager] When not port are specified and the https is used fallback to 443 (elastic#18844)
  [Ingest Manager] Fix install service script for windows (elastic#18814)
  [Metricbeat] Fix getting compute instance metadata with partial zone/region config (elastic#18757)
  Improve error messages in s3 input (elastic#18824)
  Add memory metrics into compute googlecloud (elastic#18802)
  include bucket name when logging error (elastic#18679)
  ...
blakerouse added a commit that referenced this pull request Jun 2, 2020
* Work on the GRPC server for agent.

* Lots of testing.

* Fix data races.

* Add support for elastic license in generate_notice.py.

* Update to generate server name unique per application.

* Fix go vet on stackdriver metricset using latest protobuf.

* Fix data race issue.

* Fix tests.

(cherry picked from commit 6e91ce4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants