Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subclass VmScan job as ManageIQ::Providers::Amazon::CloudManager::Scanning::Job #581

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

chessbyte
Copy link
Member

@chessbyte chessbyte commented Dec 12, 2019

State Machine diagram for this subclass of VmScan.

manageiq-providers-amazon-cloud_manager-scanning-job

@miq-bot
Copy link
Member

miq-bot commented Dec 12, 2019

Checked commit chessbyte@55f214d with ruby 2.5.5, rubocop 0.69.0, haml-lint 0.20.0, and yamllint 1.10.0
4 files checked, 0 offenses detected
Everything looks fine. 🏆

# * Then the job is marked as finished - and THAT is why the transition below is needed
#
# - In a different worker process, ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner
# * checks if a ManageIQ agent is running in the Amazon AWS Cloud, and starts it, if needed
Copy link
Member Author

@chessbyte chessbyte Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roliveri @hsong-rh @jerryk55 @Fryguy @agrare why not leave the job in the scanning state here? Then, in ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner::ResponseThread#perform_metadata_sync, move to synchronizing state before going to finished state?

# * puts response into SQS queue
#
# - In a different worker process, ManageIQ::Providers::Amazon::AgentCoordinatorWorker::Runner starts a ResponseThread
# * The ResponseThread checks the SQS queue intermittently.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roliveri @hsong-rh @jerryk55 @Fryguy @agrare Can the ResponseThread be replaced by each job looking for its own Response in SQS? Was this design driven by AWS cost minimization? If so, according to https://aws.amazon.com/sqs/pricing, the first million requests per month to SQS are free and the subsequent ones cost virtually nothing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be better? It seems more efficient the way it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one thing it requires an entire extra worker. (Though if this dedicated worker morphed into an operations worker and smartstate was a type of operation, then that argument is minimized cc @agrare )

Copy link
Member

@Fryguy Fryguy Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chessbyte Just to get some ballpark figures, 10,000 amazon instances doing smartstate once per day = 3,650,000 requests (and that's only polling once per Vm, not once every so often), which comes out to a grand total of $1.06 🤑

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still messier. Multiple jobs reading from the same queue. Then, if the message isn't for them, they have to re-queue the message (unless there's a better way). An extra worker may be better than having the jobs doing more work.

Or the jobs could just check S3 directly, that would eliminate the queue issues.

I still think it's a lot cleaner the way it is. Is having an additional worker such a big deal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we wanted to read all the responses and then distribute them to each job via MiqQueue or another mechanism, that polling on a timer can be done via the ScheduleWorker and a new schedule. The ScheduleWorker would do nothing more than queue up some work for the Amazon Operations Worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most, if not all, of what we're doing now is for a reason. We'll have to go back and revisit all of the design decisions to validate the feasibility of proposed changes.

The way it is now, works well, and it takes advantage of the queue, to streamline the process.
I still don't see the issue with having the worker. The worker subsystem already handles the coordination and orchestration, so we're not adding complexity. I'm really not seeing any problem that would warrant a change.

We need a central point of control to handle the agent management and queue processing. By its nature, this should be performed in an independent thread of execution, be it a separate process, or threads of an existing process. A worker seemed a logical choice to implement this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can leverage the ScheduleWorker, instead of a dedicated worker, that's fine.
However, I believe the queue operations block, so there's no deed to poll in a dedicated thread.
If this necessitates polling, we'd have to access the queue in a non-blocking manner, which would add complexity to the code.

I'm still wondering it this is a problem that warrants a change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe because I was not very involved in the initial design, I am reviewing it now since I am refactoring. So, I think having a discussion is warranted. Whether the discussion leads to a design/code change is a separate issue.

The only reason I focused in on this particular SmartState is because it is SO different from everything else - extra workers, unusual job state transitions, the parsing of the XML happening outside of the VmScan job code to name a few things that caught my eye.

Copy link
Member

@roliveri roliveri Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because Amazon is so different. But I'm not sure if all the differences in the Amazon code are required, or were just the result of expediency of development under given time constraints.

  • I think the current queue handling is probably the best way to go.

  • I'm not sure about the agent management. Hui knows most about that - but I recall it took some work to make sure it addressed all the corner cases.

  • The state transitions can be addressed, but with the advent of provider-specific scan jobs, that may not be a big issue.

  • It would be good if the XML can be parsed in a common way. I'm not sure if there's a good reason it is not.

@agrare agrare self-assigned this Jan 6, 2020
@agrare agrare merged commit 21ee878 into ManageIQ:master Jan 6, 2020
@agrare agrare added this to the Sprint 127 Ending Jan 6, 2020 milestone Jan 6, 2020
@chessbyte chessbyte deleted the subclass_vm_scan branch July 9, 2020 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants