This repository contains the reference architecture of the infrastructure needed to deploy dSPACE SIMPHERA to AWS. It does not contain the helm chart needed to deploy SIMPHERA itself, but only the base infrastructure such as Kubernetes, PostgreSQL, storage accounts, etc.
You can use the reference architecture as a starting point for your SIMPHERA installation if you plan to deploy SIMPHERA to AWS. You can use the reference architecture as it is and only have to configure few individual values. If you have special requirements feel free to adapt the architecture to your needs. For example, the reference architecture does not contain any kind of VPN connection to a private, on-premise network because this is highly user specific. But the reference architecture is configured in such a way that the ingress points are available in the public internet.
Using the reference architecture you can deploy a single or even multiple instances of SIMPHERA, e.g. one for production and one for testing.
The following figure shows the main resources of the architecture: The main building brick of the SIMPHERA reference architecture for AWS is the Amazon EKS cluster. The cluster contains two auto scaling groups: The first group is reserved for SIMPHERA services and other auxiliary third-party services like Keycloak, nginx, etc. The second group is for the executors that perform the testing of the system under test. The data for SIMPHERA projects is stored in a Amazon RDS PostgreSQL instance. Keycloak stores SIMPHERA users in a separate Amazon RDS PostgreSQL instance. Executors need licenses to execute tests and simulations. They obtain the licenses from a license server. The license server is deployed on an EC2 instance. Project files and test results are stored in an non-public Amazon S3 bucket. For the initial setup of the license server, several files need to be exchanged between an administration PC and the license server. These files are exchanged via an non-public S3 bucket that can be read and written from the administration PC and the license server. A detailed list of the AWS resources that are mandatory/optional for the operation of SIMPHERA can be found in the AWSCloudSpec.
Charges may apply for the following AWS resources and services:
Service | Description | Mandatory? |
---|---|---|
Amazon Elastic Kubernetes Service | A Kubernetes cluster is required to run SIMPHERA. | Yes |
Amazon Virtual Private Cloud | Virtual network for SIMPHERA. | Yes |
Elastic Load Balancing | SIMPHERA uses a network load balancer. | Yes |
Amazon EC2 Auto Scaling | SIMPHERA automatically scales compute nodes if the capacity is exhausted. | Yes |
Amazon Relational Database | Project and authorization data is stored in Amazon RDS for PostgreSQL instances. | Yes |
Amazon Simple Storage Service | Binary artifacts are stored in an S3 bucket. | Yes |
Amazon Elastic File System | Binary artifacts are stored temporarily in EFS. | Yes |
AWS Key Management Service (AWS KMS) | Encryption for Kubernetes secrets is enabled by default. | |
Amazon Elastic Compute Cloud | Optionally, you can deploy a dSPACE license server on an EC2 instance. Alternatively, you can deploy the server on external infrastructure. | |
Amazon CloudWatch | Metrics and container logs to CloudWatch. It is recommended to deploy the dSPACE monitoring stack in Kubernetes. |
To create the AWS resources that are required for operating SIMPHERA, you need to accomplish the following tasks:
- install Terraform on your local administration PC
- register an AWS account where the resources needed for SIMPHERA are created
- create an IAM user with least privileges required to create the resources for SIMPHERA
- create security credentials for that IAM user
- request service quota increase for gpu instances if needed
- create non-public S3 bucket for Terraform state
- create IAM policy that gives the IAM user access to the S3 bucket
- clone this repository onto your local administration PC
- create Secrets manager secrets
- adjust Terraform variables
- apply Terraform configuration
- connect to the Kubernetes cluster
This reference architecture is provided as a Terraform configuration. Terraform is an open-source command line tool to automatically create and manage cloud resources. A Terraform configuration consists of various .tf
text files. These files contain the specifications of the resources to be created in the cloud infrastructure. That is the reason why this approach is called infrastructure-as-code. The main advantage of this approach is reproducibility because the configuration can be mainted in a source control system such as Git.
Terraform uses variables to make the specification configurable. The concrete values for these variables are specified in .tfvars
files. So it is the task of the administrator to fill the .tfvars
files with the correct values. This is explained in more detail in a later chapter.
Terraform has the concept of a state. On the one hand side there are the resource specifications in the .tf
files. On the other hand there are the resources in the cloud infrastructure that are created based on these files. Terraform needs to store mapping information which element of the specification belongs to which resource in the cloud infrastructure. This mapping is called the state. In general you could store the state on your local hard drive. But that is not a good idea because in that case nobody else could change some settings and apply these changes. Therefore the state itself should be stored in the cloud.
If you want to run AURELION with your SIMPHERA solution, you need to add gpu instances to your cluster.
In case you want to add a gpu node pool to your AWS infrastructure, you might have to increase the quota for the gpu instance type you have selected. Per default, the SIMPHERA Reference Architecture for AWS uses p3.2xlarge instances. The quota Running On-Demand P instances sets the maximum number of vCPUs assigned to the Running On-Demand P instances for a specific AWS region. Every p3.2xlarge instance has 8 vCPUs, which is why the quota has to be at least 8 for the AWS region where you want to deploy the instances.
You can create security credentials for that IAM user with the AWS console. Terraform uses these security credentials to create AWS resources on your behalf.
On your administration PC you need to install the Terraform command and the AWS CLI. To configure your aws account run the following command:
aws configure --profile <profile-name>
AWS Access Key ID [None]: *********
AWS Secret Access Key [None]: *******
Default region name [None]: eu-central-1
Default output format [None]: json
If you have been provided with session token, you can add it via following command:
aws configure set aws_session_token "<your_session_token>" --profile <profile-name>
Access credentials are typically stored in ~/.aws/credentials
and configurations in ~/.aws/config
.
There are various ways on how to authenticate, to run Terraform.
This depends on your specific setup.
Verify connectivity and your access credentials by executing following command:
aws sts get-caller-identity
{
"UserId": "REWAYDCFMNYCPKCWRZEHT:[email protected]",
"Account": "592245445799",
"Arn": "arn:aws:sts::592245445799:assumed-role/AWSReservedSSO_AdministratorAccess_vmcbaym7ueknr9on/[email protected]"
}
As mentioned before, Terraform stores the state of the resources it creates within an S3 bucket. The bucket name needs to be globally unique.
After you have created the bucket, you need to link it with Terraform:
To do so, please make a copy of the file state-backend-template
, name it state-backend.tf
and open the file in a text editor. The values have to point to an existing storage account to be used to store the Terraform state:
terraform {
backend "s3" {
#The name of the bucket to be used to store the terraform state. You need to create this container manually.
bucket = "terraform-state"
#The name of the file to be used inside the container to be used for this terraform state.
key = "simphera.tfstate"
#The region of the bucket.
region = "eu-central-1"
}
}
Important: It is highly recommended to enable server-side encryption of the state file. Encryption is not enabled per default.
Create the following IAM policy for accessing the Terraform state bucket and assign it to the IAM user:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "<your_account_arn>"
},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::terraform-state"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "<your_account_arn>"
},
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::terraform-state/<storage_key_state_backend>"
}
]
}
Your account ARN (Amazon Resource Number) is in the output of aws sts get-caller-identity
command.
Username and password for the PostgreSQL databases are stored in AWS Secrets Manager.
Before you let Terraform create AWS resources, you need to manually create a Secrets Manager secret that stores the username and password.
It is recommended to create individual secrets per SIMPHERA instance (e.g. production and staging instance).
To create the secret, open the Secrets Manager console and click the button Store a new secret
.
As secret type choose Other type of secret
.
The password must contain from 8 to 128 characters and must not contain any of the following: / (slash), '(single quote), "(double quote) and @ (at sign).
Open the Plaintext tab and paste the following JSON object and enter your usernames and passwords:
{
"postgresql_password": "<your password>"
}
Alternatively, you can create the secret with the following Powershell script:
$region = "<your region>"
$postgresqlCredentials = @"
{
"postgresql_password" : "<your password>"
}
"@ | ConvertFrom-Json | ConvertTo-Json -Compress
$postgresqlCredentials = $postgresqlCredentials -replace '([\\]*)"', '$1$1\"'
aws secretsmanager create-secret --name <secret name> --secret-string $postgresqlCredentials --region $region
On the next page you can define a name for the secret. Automatic credentials rotation is currently not supported by SIMPHERA, but you can rotate secrets manually. You have to provide the name of the secret in your Terraform variables. The next section describes how you need to adjust your Terraform variables.
For your configuration, please rename the template file terraform.tfvars.example
to terraform.tfvars
and open it in a text editor.
This file contains all variables that are configurable including documentation of the variables. Please adapt the values before you deploy the resources.
simpheraInstances = {
"production" = {
+ secretname = "<secret name>"
}
}
Also rename the file providers.tf.example
to main.tf
and fill in the name of the AWS profile you have created before.
provider "aws" {
+ profile = "<profile-name>"
}
Before you can deploy the resources to AWS you have to initialize Terraform:
terraform init
Afterwards you can deploy the resources:
terraform apply
Terraform automatically loads the variables from your terraform.tfvars
variable definition file.
Installation times may very, but it is expected to take up to 30 min to complete the deployment.
It is recommended to use AWS admin
account, or ask your AWS administrator to assign necessary IAM roles and permissions to your user.
Resources that contain data, i.e. the databases, S3 storage, and the recovery points in the backup vault are protected against unintentional deletion. :warning: If you continue with the procedure described in this section, your data will be irretrievably deleted.
Before the backup vault can be deleted, all the continuous recovery points for S3 storage and the databases need to be deleted, for example by using the following Powershell snippet:
$vaults = terraform output backup_vaults | ConvertFrom-Json
$profile = "<profile_name>"
foreach ($vault in $vaults){
Write-Host "Deleting $vault"
$recoverypoints = aws backup list-recovery-points-by-backup-vault --profile $profile --backup-vault-name $vault | ConvertFrom-Json
foreach ($rp in $recoverypoints.RecoveryPoints){
aws backup delete-recovery-point --profile $profile --backup-vault-name $vault --recovery-point-arn $rp.RecoveryPointArn
}
foreach ($rp in $recoverypoints.RecoveryPoints){
Do
{
Start-Sleep -Seconds 10
aws backup describe-recovery-point --profile $profile --backup-vault-name $vault --recovery-point-arn $rp.RecoveryPointArn | ConvertFrom-Json
} while( $LASTEXITCODE -eq 0)
}
aws backup delete-backup-vault --profile $profile --backup-vault-name $vault
}
Before the databases can be deleted, you need to remove their delete protection:
$databases = terraform output database_identifiers | ConvertFrom-Json
foreach ($db in $databases){
Write-Host "Deleting database $db"
aws rds modify-db-instance --profile $profile --db-instance-identifier $db --no-deletion-protection
aws rds delete-db-instance --profile $profile --db-instance-identifier $db --skip-final-snapshot
}
You can remove the S3 buckets like this:
$buckets = terraform output s3_buckets | ConvertFrom-Json
foreach ($bucket in $buckets){
aws s3 rb s3://$bucket --force --profile $profile
}
The remaining infrastructure resources can be deleted via Terraform.
Due to a bug, Terraform is not able to properly plan the removal of resources in the right order which leads to a deadlock.
To workaround the bug, you need to need to remove the eks-addons
module at first:
terraform destroy -target="module.eks-addons"
To delete the remaining resources, run the following command:
terraform destroy
This deployment contains a managed Kubernetes cluster (EKS).
In order to use command line tools such as kubectl
or helm
you need a kubeconfig configuration file.
You can update your kubeconfig using the aws cli update-kubeconfig command:
aws eks --region <region> update-kubeconfig --name <cluster_name> --kubeconfig <filename>
SIMPHERA stores data in the PostgreSQL database and in S3 buckets (MinIO) that needs to be backed up. AWS supports continuous backups for Amazon RDS for PostgreSQL and S3 that allows point-in-time recovery. Point-in-time recovery lets you restore your data to any point in time within a defined retention period.
This Terraform module creates an AWS backup plan that makes continuous backups of the PostgreSQL database and S3 buckets.
The backups are stored in an AWS backup vault per SIMPHERA instance.
An IAM role is also automatically created that has proper permissions to create backups.
To enable backups for your SIMPHERA instance, make sure you have the flag enable_backup_service
et in your .tfvars
file:
simpheraInstances = {
"production" = {
enable_backup_service = true
}
}
Create an target RDS instance (backup server) that is a copy of a source RDS instance (production server) of a specific point-in-time.
The command restore-db-instance-to-point-in-time
creates the target database.
Most of the configuration settings are copied from the source database.
To be able to connect to the target instance the easiest way is to explicitly set the same security group and subnet group as used for the source instance.
Restoring an RDS instance can be done via Powershell as described in the remainder:
aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier simphera-reference-production-simphera --target-db-instance simphera-reference-production-simphera-backup --vpc-security-group-ids sg-0b954a0e25cd11b6d --db-subnet-group-name simphera-reference-vpc --restore-time 2022-06-16T23:45:00.000Z --tags Key=timestamp,Value=2022-06-16T23:45:00.000Z
Execute the following command to create the pgdump pod using the standard postgres image and open a bash:
kubectl run pgdump -ti -n simphera --image postgres --kubeconfig .\kube.config -- bash
In the pod's Bash, use the pg_dump and pg_restore commands to stream the data from the backup server to the production server:
pg_dump -h simphera-reference-production-simphera-backup.cexy8brfkmxk.eu-central-1.rds.amazonaws.com -p 5432 -U dbuser -Fc simpherareferenceproductionsimphera | pg_restore --clean --if-exists -h simphera-reference-production-simphera.cexy8brfkmxk.eu-central-1.rds.amazonaws.com -p 5432 -U dbuser -d simpherareferenceproductionsimphera
Alternatively, you can restore the RDS instance via the AWS console.
This Terraform creates an S3 bucket for project data and results and enables versioning of the S3 bucket which is a requirement for point-in-time recovery.
To restore the S3 buckets to an older version you need to create an IAM role that has proper permissions:
$rolename = "restore-role"
$trustrelation = @"
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["sts:AssumeRole"],
"Effect": "allow",
"Principal": {
"Service": ["backup.amazonaws.com"]
}
}
]
}
"@
echo $trustrelation > trust.json
aws iam create-role --role-name $rolename --assume-role-policy-document file://trust.json --description "Role to restore"
aws iam attach-role-policy --role-name $rolename --policy-arn="arn:aws:iam::aws:policy/AWSBackupServiceRolePolicyForS3Restore"
aws iam attach-role-policy --role-name $rolename --policy-arn="arn:aws:iam::aws:policy/service-role/AWSBackupServiceRolePolicyForRestores"
$rolearn=aws iam get-role --role-name $rolename --query 'Role.Arn'
Restoring an S3 bucket can be done via Powershell as described in the remainder: You can restore the S3 data in-place, into another existing bucket, or into a new bucket.
$uuid = New-Guid
$metadata = @"
{
"DestinationBucketName": "man-validation-platform-int-results",
"NewBucket": "true",
"RestoreTime": "2022-06-20T23:45:00.000Z",
"Encrypted": "false",
"CreationToken": "$uuid"
}
"@
$metadata = $metadata -replace '([\\]*)"', '$1$1\"'
aws backup start-restore-job `
--recovery-point-arn "arn:aws:backup:eu-central-1:012345678901:recovery-point:continuous:simphera-reference-production-0f51c39b" `
--iam-role-arn $rolearn `
--metadata $metadata
Alternatively, you can restore the S3 data via the AWS console.
Encryption is enabled at all AWS resources that are created by Terraform:
- PostgreSQL databases
- S3 buckets
- EFS (Elastic file system)
- CloudWatch logs
- Backup Vault
Credentials can be manually rotated:
Open the secret in the Secrets Manager console and change the passwords manually.
Fill in the placeholders <namespace>
and the <path_to_kubeconfig>
and run the following command to remove SIMPHERA from your Kubernetes cluster:
helm delete simphera -n <namespace> --kubeconfig <path_to_kubeconfig>
Reinstall the SIMPHERA Quickstart Helmchart so that all Kubernetes pods and jobs will retrieve the new credentials. Important: During credentials rotation, SIMPHERA will not be available for a short period.
Name | Version |
---|---|
terraform | >= 1.1.7 |
aws | >= 3.72, < 5.0.0 |
helm | >= 2.4.1 |
kubernetes | >= 2.10 |
Name | Version |
---|---|
aws | 4.67.0 |
kubernetes | 2.23.0 |
Name | Source | Version |
---|---|---|
eks | git::https://github.com/aws-ia/terraform-aws-eks-blueprints.git | v4.32.1 |
eks-addons | git::https://github.com/aws-ia/terraform-aws-eks-blueprints.git//modules/kubernetes-addons | v4.32.1 |
security_group | terraform-aws-modules/security-group/aws | ~> 4 |
simphera_instance | ./modules/simphera_aws_instance | n/a |
vpc | terraform-aws-modules/vpc/aws | v3.11.0 |
Name | Description | Type | Default | Required |
---|---|---|---|---|
cloudwatch_retention | Global cloudwatch retention period for the EKS, VPC, SSM, and PostgreSQL logs. | number |
7 |
no |
cluster_autoscaler_helm_config | Cluster Autoscaler Helm Config | any |
{} |
no |
enable_aws_for_fluentbit | Install FluentBit to send container logs to CloudWatch. | bool |
false |
no |
enable_ingress_nginx | Enable Ingress Nginx add-on | bool |
false |
no |
enable_patching | Scans license server EC2 instance and EKS nodes for updates. Installs patches on license server automatically. EKS nodes need to be updated manually. | bool |
false |
no |
gpuNodeCountMax | The maximum number of nodes for gpu job execution | number |
12 |
no |
gpuNodeCountMin | The minimum number of nodes for gpu job execution | number |
0 |
no |
gpuNodeDiskSize | The disk size in GiB of the nodes for the gpu job execution | number |
100 |
no |
gpuNodePool | Specifies whether an additional node pool for gpu job execution is added to the kubernetes cluster | bool |
false |
no |
gpuNodeSize | The machine size of the nodes for the gpu job execution | list(string) |
[ |
no |
gpuNvidiaDriverVersion | The NVIDIA driver version for GPU node group. | string |
"535.54.03" |
no |
infrastructurename | The name of the infrastructure. e.g. simphera-infra | string |
"simphera" |
no |
install_schedule | 6-field Cron expression describing the install maintenance schedule. Must not overlap with variable scan_schedule. | string |
"cron(0 3 * * ? *)" |
no |
kubernetesVersion | The version of the EKS cluster. | string |
"1.22" |
no |
licenseServer | Specifies whether a license server VM will be created. | bool |
false |
no |
linuxExecutionNodeCountMax | The maximum number of Linux nodes for the job execution | number |
10 |
no |
linuxExecutionNodeCountMin | The minimum number of Linux nodes for the job execution | number |
0 |
no |
linuxExecutionNodeSize | The machine size of the Linux nodes for the job execution | list(string) |
[ |
no |
linuxNodeCountMax | The maximum number of Linux nodes for the regular services | number |
12 |
no |
linuxNodeCountMin | The minimum number of Linux nodes for the regular services | number |
1 |
no |
linuxNodeSize | The machine size of the Linux nodes for the regular services | list(string) |
[ |
no |
maintainance_duration | How long in hours for the maintenance window. | number |
3 |
no |
map_accounts | Additional AWS account numbers to add to the aws-auth ConfigMap | list(string) |
[] |
no |
map_roles | Additional IAM roles to add to the aws-auth ConfigMap | list(object({ |
[] |
no |
map_users | Additional IAM users to add to the aws-auth ConfigMap | list(object({ |
[] |
no |
scan_schedule | 6-field Cron expression describing the scan maintenance schedule. Must not overlap with variable install_schedule. | string |
"cron(0 0 * * ? *)" |
no |
simpheraInstances | A list containing the individual SIMPHERA instances, such as 'staging' and 'production'. | map(object({ |
{ |
no |
tags | The tags to be added to all resources. | map(any) |
{} |
no |
vpcCidr | The CIDR for the virtual private cluster. | string |
"10.1.0.0/18" |
no |
vpcDatabaseSubnets | List of CIDRs for the database subnets. | list(any) |
[ |
no |
vpcPrivateSubnets | List of CIDRs for the private subnets. | list(any) |
[ |
no |
vpcPublicSubnets | List of CIDRs for the public subnets. | list(any) |
[ |
no |
Name | Description |
---|---|
account_id | The AWS account id used for creating resources. |
backup_vaults | Backups vaults from all SIMPHERA instances. |
database_endpoints | Identifiers of the SIMPHERA and Keycloak databases from all SIMPHERA instances. |
database_identifiers | Identifiers of the SIMPHERA and Keycloak databases from all SIMPHERA instances. |
eks_cluster_id | Amazon EKS Cluster Name |
s3_buckets | S3 buckets from all SIMPHERA instances. |