Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECO-2188] Revise indexer NAT gateway(s), rules, autoscaling #264

Merged
merged 66 commits into from
Sep 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
03b9302
Bump cfn-lint version
alnoki Sep 20, 2024
f7f7e77
Convert conditions to rules
alnoki Sep 23, 2024
a425160
Change MaybeDeploy to Deploy
alnoki Sep 23, 2024
00373a6
Update deploy file for rule changes
alnoki Sep 23, 2024
37442f1
Update README in prep for rules checking
alnoki Sep 23, 2024
7ec0cdf
Update env for deploy
alnoki Sep 23, 2024
ac4d8b0
Switch condition style
alnoki Sep 24, 2024
da5c56d
Try deploying repro template
alnoki Sep 24, 2024
546892f
Update README
alnoki Sep 24, 2024
5206dc9
Try provisioning VPC with new condition style
alnoki Sep 24, 2024
01a7ae0
Add missing `!Ref`
alnoki Sep 24, 2024
06f5564
Bump cfn-lint, rules logic
alnoki Sep 24, 2024
86eec30
Kill stack
alnoki Sep 24, 2024
f2d30b3
Try primary/fallback NAT gateway
alnoki Sep 24, 2024
abcecf4
Start making NAT gateway for each AZ
alnoki Sep 24, 2024
76270cc
Simplify use of Fn::Sub
alnoki Sep 24, 2024
6de2e05
Add redudant NAT gateways, kill stack
alnoki Sep 24, 2024
2d7fbcc
Add multi-NAT gateway design note
alnoki Sep 24, 2024
daeec1f
Provision entire stack
alnoki Sep 24, 2024
4c22c74
Kill WAF rules for REST, WS
alnoki Sep 24, 2024
eec18d7
Kill WAF
alnoki Sep 24, 2024
9de9ea1
Provision WAF
alnoki Sep 24, 2024
20853e9
Add alarm overload
alnoki Sep 24, 2024
c2971eb
Match original alarm exactly
alnoki Sep 24, 2024
f8b8ef0
Use cusom metric spec
alnoki Sep 24, 2024
6c92c90
Specify statistic
alnoki Sep 24, 2024
b8ead83
Pin unit
alnoki Sep 24, 2024
8484e71
Abstract autoscaling policy, targets
alnoki Sep 25, 2024
2b3130b
Add scale up and scale down alarms
alnoki Sep 25, 2024
494194d
Kill broker, postgrest
alnoki Sep 25, 2024
3d40933
Use predefined metric for scaling policy
alnoki Sep 25, 2024
bbe28fd
Kill more REST dependents
alnoki Sep 25, 2024
f535b3b
Re-deploy entire stack
alnoki Sep 25, 2024
3b46bb6
Kill WAF rules
alnoki Sep 25, 2024
0283329
Kill WAF entirely
alnoki Sep 25, 2024
2c38371
Remove typo
alnoki Sep 25, 2024
f89d239
Update bastion README
alnoki Sep 25, 2024
bfe775d
Kill broker, ALB, ALB DNS cert
alnoki Sep 25, 2024
6186455
Use step scale-in
alnoki Sep 25, 2024
8a9bd36
Re-provision broker
alnoki Sep 25, 2024
5d12e4e
Make name consistent
alnoki Sep 25, 2024
ba590bf
SPecify step scaling
alnoki Sep 25, 2024
a6dfdce
Try killing all containers
alnoki Sep 25, 2024
a9363d6
Deploy API containers
alnoki Sep 25, 2024
4b5efbd
Re-provision container extensions
alnoki Sep 26, 2024
20c5dc6
Kill ALB
alnoki Sep 26, 2024
fe50b1b
Kill broker
alnoki Sep 26, 2024
4cd0ab3
Redeploy broker, ALB
alnoki Sep 26, 2024
4108536
Kill all but DB, processor
alnoki Sep 26, 2024
ef14b83
Deploy all except WAF
alnoki Sep 26, 2024
a9f8a6a
Rename down to in
alnoki Sep 26, 2024
af74363
Kill again
alnoki Sep 26, 2024
ecc56fa
Add autoscaling architecture notes
alnoki Sep 26, 2024
1005464
Redeploy all except WAF
alnoki Sep 26, 2024
53944a8
Update bastion host line breaking
alnoki Sep 26, 2024
623bd2e
Address cspell
alnoki Sep 26, 2024
124aba1
Merge branch 'main' into ECO-2188
alnoki Sep 26, 2024
91e37f0
Merge branch 'main' into ECO-2188
alnoki Sep 27, 2024
4e07697
Kill stack to update SQL db
alnoki Sep 27, 2024
74c71ee
Bump binary versions
alnoki Sep 27, 2024
1644c62
Re-enable stack
alnoki Sep 27, 2024
f686f20
Merge branch 'main' into ECO-2188
alnoki Sep 27, 2024
9a895c2
Remove generalized cache with wildcard error
alnoki Sep 30, 2024
de6957d
Incorporate processor bump suggestion
alnoki Sep 30, 2024
06737f1
Merge branch 'main' into ECO-2188
alnoki Sep 30, 2024
d97df24
Revert environment
alnoki Sep 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cfg/pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ repos:
files: '.*\.cfn\.yaml'
id: 'cfn-lint'
repo: 'https://github.com/aws-cloudformation/cfn-lint'
rev: 'v1.11.0'
rev: 'v1.15.1'
-
hooks:
-
Expand Down
81 changes: 40 additions & 41 deletions src/cloud-formation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,37 +22,10 @@ under a root domain you provide, for an environment name of your choosing:

## Template parameters

`indexer.cfn.yaml` contains assorted [parameters] of the form `MaybeDeploy*`
that can be used to selectively provision and de-provision [resources]. For a
concise list of such parameters, see a [stack deployment file] at
`deploy-*.yaml`. See the template [conditions] section for associated
dependencies.

Note that even if a parameter is passed as `true`, the resources that directly
depend on it will not be created unless the condition's dependencies are also
met. All resources are eventually conditional on `MaybeDeployStack`, which can
be used to toggle provisioning and de-provisioning of all resources.

In practice this means that even if a `MaybeDeploy*` parameter is passed as
`true`, the corresponding resource(s) might not be created. For example if
`MaybeDeployStack` is `false`, then even if `MaybeDeployVpc` is `true`,
virtual private network resources won't be created because `MaybeDeployVpc`
is conditional on `MaybeDeployStack`.

In theory [rules] could be used to enforce parametric dependencies, thus
generating an error in the case that a hypothetical `DeployVpc` is passed
`true` but a hypothetical `DeployStack` is passed `false`, however rules have
several prohibitive issues in practice:

1. [`cfn-lint` issue #3630].

1. If a rule assertion fails, rather than reporting an assertion error, the
[GitSync status dashboard] instead simply halts the update with
[GitSync event] type `CHANGESET_CREATION_FAILED` and following event message,
misleadingly reporting that no changes are present when in fact the update
failure was a result of failed rule assertions:

> Changeset creation failed. The reason was No updates are to be performed..
`indexer.cfn.yaml` contains assorted [parameters] of the form `Deploy*` that can
be used to [conditionally][conditions] provision and de-provision [resources].
For a concise list of such parameters, see a [stack deployment file] at
`deploy-*.yaml`. See the template [rules] section for associated dependencies.

## Setup

Expand Down Expand Up @@ -220,7 +193,8 @@ several prohibitive issues in practice:
1. Create a [stack deployment file] (see `deploy-*.yml`) with appropriate
[template parameters](#template-parameters).

1. [Create the stack with GitSync].
1. [Create the stack with GitSync], then monitor [GitSync events][gitsync event]
in the [GitSync status dashboard].

## Querying endpoints

Expand Down Expand Up @@ -306,23 +280,28 @@ deployment environment:
### Bastion host connections

Before you try connecting to the bastion host, verify that the
`MaybeDeployBastionHost` [condition][conditions] evaluates to `true`. Note
too that if you have been provisioning and de-provisioning other resources, you
might want to de-provision then provision the bastion host before running the
below commands, in order to refresh the bastion host [user data] that stores the
URLs of other resources in the stack.
`DeployBastionHost` [condition][conditions] evaluates to `true`. Note too that
if you have been provisioning and de-provisioning other resources, you might
want to de-provision then provision the bastion host before running the below
commands, in order to refresh the bastion host [user data] that stores the URLs
of other resources in the stack.

1. Install the [EC2 Instance Connect CLI]:

```sh
pip install ec2instanceconnectcli
```

1. Connect to the bastion host over the [EC2 Instance Connect Endpoint] using
your stack name, for example `emoji-dev`:
1. Set your stack name:

```sh
STACK_NAME=<STACK_NAME>
echo $STACK_NAME
```

1. Connect to the bastion host over the [EC2 Instance Connect Endpoint]:

```sh
STACK_NAME=emoji-dev
INSTANCE_ID=$(aws cloudformation describe-stacks \
--output text \
--query 'Stacks[0].Outputs[?OutputKey==`BastionHostId`].OutputValue' \
Expand Down Expand Up @@ -399,6 +378,10 @@ The indexer database uses [Aurora PostgreSQL] on a
[high availability][high availability for aurora] with
[fault tolerant replica promotion] and [autoscaling][aurora autoscaling].

### NAT gateway redundancy

The indexer uses [a NAT gateway in each availability zone] for high resilience.

### Permissions

The `ContainerRole` [ECS task execution IAM role] provides
Expand All @@ -425,6 +408,19 @@ toggle [rule actions] between `Block` and `Count`.

See the [Web ACL traffic overview dashboards] to monitor rules.

### Container scaling

[Container autoscaling] for both REST and WebSocket endpoints relies on a
mixture of [target tracking] for scaling out and [step scaling] for scaling in.

Scaling in uses a custom [step scale CloudWatch alarm] that only fires when more
than one instance is active, to prevent alarms from triggering when only one
instance is live and at idle.

This design ensures that at least one server container is always live for both
REST and WebSocket endpoints.

[a nat gateway in each availability zone]: https://docs.aws.amazon.com/vpc/latest/userguide/nat-gateway-basics.html
[amazonec2containerserviceautoscalerole]: https://docs.aws.amazon.com/autoscaling/application/userguide/security-iam-awsmanpol.html#ecs-policy
[application autoscaling iam access]: https://docs.aws.amazon.com/autoscaling/application/userguide/security_iam_service-with-iam.html
[aptos labs grpc endpoint]: https://aptos.dev/en/build/indexer/txn-stream/aptos-hosted-txn-stream#endpoints
Expand All @@ -437,6 +433,7 @@ See the [Web ACL traffic overview dashboards] to monitor rules.
[aws cloudformation]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html
[cloudformation service role]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-iam-servicerole.html
[conditions]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/conditions-section-structure.html
[container autoscaling]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-auto-scaling.html
[container logging permissions]: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html#ec2-considerations
[create the stack with gitsync]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/git-sync-walkthrough.html
[ec2 instance connect cli]: https://github.com/aws/aws-ec2-instance-connect-cli
Expand Down Expand Up @@ -468,13 +465,15 @@ See the [Web ACL traffic overview dashboards] to monitor rules.
[secrets manager secrets]: https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html
[stack]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacks.html
[stack deployment file]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/git-sync-concepts-terms.html#git-sync-concepts-terms-depoyment-file
[step scale cloudwatch alarm]: https://docs.aws.amazon.com/autoscaling/application/userguide/step-scaling-policy-overview.html#step-scaling-how-it-works
[step scaling]: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
[systems manager parameters]: https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html
[target tracking]: https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html
[template file]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/gettingstarted.templatebasics.html
[template outputs section]: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/outputs-section-structure.html
[the upstream repository credentials docs]: https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-secret.html
[transaction stream service endpoint]: https://aptos.dev/en/build/indexer/txn-stream
[user data]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html
[web acl traffic overview dashboards]: https://docs.aws.amazon.com/waf/latest/developerguide/web-acl-dashboards.html
[web application firewall]: https://docs.aws.amazon.com/waf/latest/developerguide/waf-chapter.html
[`cfn-lint` issue #3630]: https://github.com/aws-cloudformation/cfn-lint/issues/3630
[`ecr::getauthorizationtoken`]: https://docs.aws.amazon.com/AmazonECR/latest/APIReference/API_GetAuthorizationToken.html
38 changes: 19 additions & 19 deletions src/cloud-formation/deploy-dev.yaml
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
---
parameters:
BrokerImageVersion: '0.7.0'
EnableWafRulesGeneral: 'true'
BrokerImageVersion: '0.8.0'
DeployAlb: 'true'
DeployAlbDnsRecord: 'true'
DeployBastionHost: 'true'
DeployBroker: 'true'
DeployContainers: 'true'
DeployDb: 'true'
DeployNlb: 'true'
DeployNlbVpcLink: 'true'
DeployPostgrest: 'true'
DeployProcessor: 'true'
DeployRestApi: 'true'
DeployRestApiDnsRecord: 'true'
DeployRouteTables: 'true'
DeployStack: 'true'
DeployVpc: 'true'
DeployWaf: 'false'
EnableWafRulesGeneral: 'false'
EnableWafRulesRestApi: 'false'
EnableWafRulesWebSocket: 'false'
Environment: 'dev'
MaybeDeployAlb: 'true'
MaybeDeployAlbDnsRecord: 'true'
MaybeDeployBastionHost: 'true'
MaybeDeployBroker: 'true'
MaybeDeployContainers: 'true'
MaybeDeployDb: 'true'
MaybeDeployNlb: 'true'
MaybeDeployNlbVpcLink: 'true'
MaybeDeployPostgrest: 'true'
MaybeDeployProcessor: 'true'
MaybeDeployRestApi: 'true'
MaybeDeployRestApiDnsRecord: 'true'
MaybeDeployRouteTables: 'true'
MaybeDeployStack: 'true'
MaybeDeployVpc: 'true'
MaybeDeployWaf: 'true'
Network: 'testnet'
ProcessorImageVersion: '0.5.0'
ProcessorImageVersion: '0.6.0'
tags: null
template-file-path: 'src/cloud-formation/indexer.cfn.yaml'
...
Loading
Loading