Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data-plane-controller for managing data-plane life cycles #1739

Merged
merged 4 commits into from
Oct 29, 2024

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Oct 27, 2024

Description:

The controller uses the recent automations crate to monitor structured
changes to a bound data_planes and drive the data-plane to convergence.

It's modeled as a CI/CD pipeline, where an indicated branch of our
dry-dock repo (containing Pulumi and Ansible infrastructure) is deployed
for the given data-plane.

The controller performs the full lifecycle required for rolling updates:

  • pulumi up to create new resources or respond to replacements
  • Awaiting DNS propagation
  • Running the Ansible playbook to provision instances
  • Running pulumi up to reflect readiness of started instances
  • Awaiting DNS propagation (again)

It also periodically refreshes a stack from remote providers to detect
changed or deleted resources, such as EC2 instance replacements, and
responds accordingly to heal the infrastructure.

A number of sanity-checks are built in to verify that we're modifying data-plane configurations in allowed ways, and that we're not performing modifications while the controller is actively driving the data-plane towards convergence.

The controller also publishes a variety of exported Pulumi outputs which customers need to know about, such as IAM users and GCP service accounts, AWS private link bindings, and VPC CIDR blocks.

Fixes #1727

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jgraettinger jgraettinger requested a review from mdibaiee October 27, 2024 22:16
Add a view_logs RPC variant which takes a lower-bound last_logged_at and
orders on ascending logged_at.

Apply various column updates to `data_planes` row for a future
data-plane-controller.

Enforce a constraint that the data_planes configuration and deployment
branch cannot be changed while it's status != Idle.
Instead of fetching all logs every RPC, pass an exclusive lower-bound
logged_at from which further logs are fetched.

Also add a raw command for fetching logs of a specific bearer token.
Calculate a suitable value for the Pulumi stack name of a data-plane.

Prefix transform names with "from", because hashes with leading digits
are invalid.
@jgraettinger jgraettinger force-pushed the johnny/data-plane-controller branch from 0a4d85a to be54ea5 Compare October 28, 2024 02:01
@jgraettinger
Copy link
Member Author

jgraettinger commented Oct 28, 2024

Testing

I performed extensive scenario testing against a private estuary_support/ data-plane in Vultr, including:

  • Scaling up and scale-down of gazette and reactor deployments.
  • Rolling upgrades by scaling up a new deployment, and then scaling down the old one for gazette and reactors.
  • Adding a new Etcd deployment and scaling it down again. (note: etcd operations must be done one node at a time!)
  • Deleting a gazette broker via the Vultr UI, and then allowing the controller to refresh and repair the data-plane.

I used the new command flowctl raw bearer-logs to stream logs of the controller's actions as they're happening.

I also spun up a local stack and verified the new pulumi_stack column is populated properly, and performed a publication with flowctl to verify that log streaming looks good 👍

@jgraettinger jgraettinger requested a review from psFried October 28, 2024 15:30
Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a nit/question and one sanity check suggestion

crates/data-plane-controller/src/repo.rs Outdated Show resolved Hide resolved
The controller uses the recent `automations` crate to monitor structured
changes to a bound data_planes and drive the data-plane to convergence.

It's modeled as a CI/CD pipeline, where an indicated branch of our
dry-dock repo (containing Pulumi and Ansible infrastructure) is deployed
for the given data-plane.

The controller performs the full lifecycle required for rolling updates:

- `pulumi up` to create new resources or respond to replacements
- Awaiting DNS propagation
- Running the Ansible playbook to provision instances
- Running `pulumi up` to reflect readiness of started instances
- Awaiting DNS propagation (again)

It also periodically refreshes a stack from remote providers to detect
changed or deleted resources, such as EC2 instance replacements, and
responds accordingly to heal the infrastructure.
@jgraettinger jgraettinger force-pushed the johnny/data-plane-controller branch from be54ea5 to 0430e36 Compare October 29, 2024 16:32
@jgraettinger jgraettinger merged commit 5783c8c into master Oct 29, 2024
5 checks passed
@jgraettinger jgraettinger deleted the johnny/data-plane-controller branch October 29, 2024 16:32
@jgraettinger jgraettinger added the change:planned This is a planned change label Oct 29, 2024
github-actions bot pushed a commit to estuary/homebrew-flowctl that referenced this pull request Dec 12, 2024
## What's Changed
* flowctl: use new view_logs RPC with logged_at bound estuary/flow#1739
* flowctl raw bearer-logs: add --since parameter with 1 hour default estuary/flow#1752
* flowctl: add `raw spec` support for materializations estuary/flow#1798
* protocols/flow: add array inference to protocol estuary/flow#1787

**Full Changelog**: estuary/flow@v0.5.7...v0.5.8
williamhbaker added a commit to estuary/homebrew-flowctl that referenced this pull request Dec 12, 2024
## What's Changed
* flowctl: use new view_logs RPC with logged_at bound estuary/flow#1739
* flowctl raw bearer-logs: add --since parameter with 1 hour default estuary/flow#1752
* flowctl: add `raw spec` support for materializations estuary/flow#1798
* protocols/flow: add array inference to protocol estuary/flow#1787

**Full Changelog**: estuary/flow@v0.5.7...v0.5.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:planned This is a planned change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

automations for data-plane operations
2 participants