Highly available cluster with multiple nodes #1571

tapaszto · 2021-05-10T11:56:06Z

We are trying to set up a highly available Atlantis cluster with multiple nodes for prod environment and currently testing with two nodes behind a load balancer. In order to have the nodes with the same data/status we deployed Atlantis data folder as a common file share (Azure files) and mounted this share to both nodes, but unfortunately both nodes start to fail and send application exceptions that I attached.

Questions:
Can the same set of data files shared among multiple Atlantis server instances as we envisioned?
Is this issue due to specific file locking mechanism of Atlantis?
Can this issue fixed by any code change or this is not easily achieved by smaller amount of code change. We have the intention to put development effort into it if it is easily achievable.
Generally, what is the advise/best practice in order to have a highly available Atlantis environment with multiple nodes?

jamengual · 2021-05-10T17:56:55Z

Atlantis was not designed to be set up with multiples nodes.
I think this will require a significant amount of code changes to be achieved. Basically, you will have to replace boltdb with some distributed DB and make tons of code changes to be able to sync properly, keep status, etc.

Usually, Atlantis users do not have highly available Atlantis servers, they have a big instance or multiple instances running different webhooks integrations ( maybe behind the same LB).

Now talking about the reason of having such a setup, why does it needs to be HA? Infra as code is not service so it does not have service dependencies, meaning it does not need to be "up".

acastle · 2021-05-10T23:52:15Z

As Pepe mentioned our reliance on BoltDB is really the limiting factor here. Bolt is intended to be used as an embedded database for applications and cannot be safely shared between processes. There has been a few PR discussions around creating a unified abstraction over database access to allow for pluggable database providers but to my knowledge no work has been done yet.

I believe it would be possible to run multiple atlantis instances with project configuration to limit each instance to only handling a subset of files but it is not possible to run multiple instances that function as one server.

tapaszto · 2021-05-11T12:21:30Z

Hi @jamengual & @acastle,

As I can see there is a Locker interface and it's implemented by boltdb.go, this persists the state into atlantis.db file.
Can we achieve our goal by creating a new implementation of the Locker interface which connects to a distributed DB like Azure Cosmos DB?
Are there any other server state files besides atlantis.db?
What other tasks are required besides the new implementation of the Locker interface? E.g. new configuration settings, initiating the new implementation instead of the current one according to specific server setting, etc.
Can you provide us a more granular work item list please? We are trying to have a better understanding to be able to estimate the required effort of this development.

jamengual · 2021-05-12T16:32:01Z

This is a lot of work just to itemize the needed changes and right now this is out of scope for us.

lkysow · 2021-05-13T00:07:02Z

#265 (comment) talks about some of the work. The locker isn't the hard part. It's the reliance on the filesystem for storing plans and for knowing which PRs are in progress.

jasonrberk · 2021-07-30T15:44:52Z

in the docs

Atlantis has no external database. Atlantis stores Terraform plan files on disk. If Atlantis loses that data in between a plan and apply cycle, then users will have to re-run plan. Because of this, you may want to provision a persistent disk for Atlantis.

I setup EFS and specify the ATLANTIS_DATA_DIR as the mount. My first instance started fine. but when I made some other changes, Fargate started the second instance before the first instance gets killed.....which failed with

Error: initializing server: starting BoltDB: timeout (a possible cause is another Atlantis instance already running)

so my question is, can a BoltDB created from instance "A" on EFS be picked up and used by instance "B"?

I think we can make FG completely kill the old container before staring the new one....if so....all the locks will be available to the new instance, so devs don't have to "re-plan"..... but will Bolt have issues in that design

dohnto · 2022-04-28T14:20:23Z

We currently run single atlantis instance, but I landed here, because I was considering Provider Plugin Cache for Atlantis and it explicitly mentions that

Note: The plugin cache directory is not guaranteed to be concurrency safe. The provider installer's behavior in environments with multiple terraform init calls is undefined.

so I got interested whether it is possible to run multiple replicas of atlantis and whether the cache should be per replica or not.

The answer for me is that I don't need to consider multiinstance scenario just yet - I think I might not be the only one and it would be worth mentioning in the Atlantis Docs, that it is expected to run just a single Atlantis replica.

EDIT: I am also wondering how one can run Atlantis as Kubernetes Deployment, where it is not guaranteed, that there will always be just a single replica.

tapaszto · 2022-11-03T14:26:50Z

As Pepe mentioned our reliance on BoltDB is really the limiting factor here. Bolt is intended to be used as an embedded database for applications and cannot be safely shared between processes. There has been a few PR discussions around creating a unified abstraction over database access to allow for pluggable database providers but to my knowledge no work has been done yet.

I believe it would be possible to run multiple atlantis instances with project configuration to limit each instance to only handling a subset of files but it is not possible to run multiple instances that function as one server.

Hi @jamengual & @acastle,

I would like to follow up this topic as the Redis locking DB is available now. Referring to my original question, is it feasible to use Atlantis with multiple nodes as of now? I can envision a two tenant "cluster" environment with an active and passive node, the locking DB is hosted in Azure Redis and the working directory is on a shared drive. Only one node is active at any time period in order to avoid any interference, a load balancer (e.g. Azure Traffic Manager) would monitor the active node and the nodes could be swapped in case of any issue of the active one. Is this design feasible?

jamengual · 2022-11-03T22:53:10Z

so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that.

tapaszto · 2022-11-03T23:12:44Z

so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that.

Sharing the data dir is easily achievable in our Azure environment. But won't we have any issues when multiple active nodes are writing the same data dir files? Does the current Atlantis design exclude this?

jamengual · 2022-11-03T23:24:10Z

no, because the lock now is on redis.( if you enable it)

…

On Thu, Nov 3, 2022 at 4:12 PM Istvan Tapaszto ***@***.***> wrote: so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that. Sharing the data dir is easily achievable in our Azure environment. But won't we have any issues when multiple active nodes are writing the same data dir files? Does the current Atlantis design exclude this? — Reply to this email directly, view it on GitHub <#1571 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3EREXSG2G4IIOAWK54MLWGRBHPANCNFSM44RAFHYA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

nitrocode · 2022-11-07T20:09:53Z

To close this ticket, I think some official docs are needed on how redis locking can be used to spin up more than one instance/pod of atlantis

jamengual · 2022-11-07T20:59:01Z

I agree.

…

On Mon, Nov 7, 2022 at 12:10 PM nitrocode ***@***.***> wrote: To close this ticket, I think some official docs are needed on how redis locking can be used to spin up more than one instance/pod of atlantis — Reply to this email directly, view it on GitHub <#1571 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3ERBHJBLNJ4IT6BBA4CDWHFOZXANCNFSM44RAFHYA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gartemiev · 2023-01-16T11:02:05Z

Any news when this will be officially documented regarding implementation? This also blocks: terraform-aws-modules/terraform-aws-atlantis#322

nitrocode · 2023-01-16T16:26:18Z

@gartemiev none. This is an open source project and we depend 100% on user contributions. Please feel free to try out this feature, experiment, and see what works. If you can get it working and document it, everyone would appreciate it.

jamengual · 2023-01-16T17:46:11Z

did you enable Redis locking? are you running parallel plans and applies?

…

On Mon, Jan 16, 2023 at 8:26 AM nitrocode ***@***.***> wrote: @gartemiev <https://github.com/gartemiev> none. This is an open source project and we depend 100% on user contributions. Please feel free to try out this feature, experiment, and see what works. If you can get it working and document it, everyone would appreciate it. — Reply to this email directly, view it on GitHub <#1571 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3EREBTMB6FDUH7JMYECLWSVZDNANCNFSM44RAFHYA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

albertorm95 · 2023-03-02T11:21:33Z

So in order to have HA in Atlantis:

Load Balancer in front of Atlantis "Cluster"
The actual Atlantis "Cluster" in our scenario ECS
Share disk space between all Nodes/Containers in the Cluster
Switch locking-db-type to Redis

I'll test this

Relates to:

External Locking DB: Redis #2491 (comment)

nitrocode · 2023-03-02T13:30:04Z

You may not need to share disk space. I'm unsure of this since i haven't tested it, but it's possible that redis is housing not only the lock but possibly the plans as well.

Please test with shared disk space and without. This will be handy in documentation on the website

albertorm95 · 2023-03-03T16:21:39Z

NFS or any shared file system is required since the plans are NOT store in Redis.

There were some weird behaviours like this error being shown in multiple atlantis instances and multiple times in some of them, this might not be related to the solution:

{"level":"error","ts":"2023-03-03T16:04:32.624Z","caller":"logging/simple_logger.go:163","msg":"invalid key: b5bacfe9-e187-4e6b-af0a-d169b785e0a2","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/logging.(*StructuredLogger).Log\n\tgithub.com/runatlantis/atlantis/server/logging/simple_logger.go:163\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:92\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).getProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:70\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).GetProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:83\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/[email protected]/mux.go:210\ngithub.com/urfave/negroni/v3.Wrap.func1\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/[email protected]/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"}

Also when having multiple Atlantis behind the LB when you try to look for the logs of the plan you might or not get them since it is being load balanced 😅, with some flags maybe the Atlantis logs could be also "centralized" so any Atlantis instance can show you the logs.

Also at least using NFS it felt slow, so maybe look into store the plans in Redis could improve this 👀

cc: @nitrocode

nitrocode · 2023-03-03T17:23:42Z

Ooofa... Thanks for the update

anhdle14 · 2023-03-22T13:26:16Z

How about a failover mechanism with shared PV/PVC? I don't think a HA multi-nodes is a good way to solve Terraform and Atlantis. Because normally only 1 worker at any time can execute the plan/apply to any tfstate.

So how about supporting a failover mechanism like:

/webhook -> instance 1
instance 1 went down
/webhook -> instance 2 with same setup.

I think a standby instance could just solve this easily.

jukie · 2023-03-22T13:45:36Z

If you run in Kubernetes or an Autocaling group of 1 you'd already get that experience though @anhdle14.
Having multiple nodes would mean zero downtime and give the ability to distribute work if you have multiple projects/repos managed by atlantis.

anhdle14 · 2023-03-22T15:07:58Z

yeah that is true, I was thinking more of a failover scenario when cluster went down for a particular zone/region. I think for my case I actually need to have the deployment on multiple clusters but this should work for a single cluster deployment.
Also my solution to multiple projects/repos is to having multiple deployment because of isolation / blast radius and multi tenancy.

atlantis.example.com/team point to different deployment etc...

Dilergore · 2023-04-12T08:36:13Z

I am working with @tapaszto who originally opened this thread. Since I can see there are some recent comments here let me share my thoughts.

We have been using Atlantis for almost 3 years. First, we were hosting it in an ACI then migrated to App Service and right now we are discussing moving to AKS. Since the Redis option became available for the lock DB we were planning to make our environment more resilient. The ultimate goal would be to have a multi-zone and multi-region active-active-active deployment.

Our preference would be to stay on App Service, that said there are certain storage limitations there. Since the repo content is still stored on disk, the disk needs to be shared across the nodes. For that either we use Azure Files (SMB) or Blobfuse (with AKS), but both of these are at least 5x slower than writing the content to a local disk. These are not options sadly because of the performance.

AKS is now offering shared ZRS Managed Disk support which we are actively exploring. This might solve the zone redundancy requirement if we move to AKS but still will not solve the geo-redundancy requirement. For now, we are considering a primary-secondary (active-passive) deployment, potentially sharing the locking database across regions but not the files as there is no technical solution for that.

I think that the next step for this project when it comes to resiliency is to have a solution for the git content/plan files.

jukie · 2023-04-12T16:47:49Z

I was pointed to the https://github.com/lyft/atlantis fork which makes use of temporal workflows. That would be a heavy lift to pull in but something fully distributed like that is what I'd prefer vs NFS shares.

Dilergore · 2023-04-13T06:19:10Z

@jukie - Thanks for sharing this, I spent some time going over this and I definitely have some questions and thoughts.

I went through the README of project Neptune. I can see how potentially Temporal and its engine would solve failures and would enable HA even across regions.

That said, It would be great to understand whether project Neptune is just a fork which planned to be used in Lyft or there is plan to merge this back in some shape and form as a new major version in the future to the upstream version.

It seems the Neptune workflow is targeting Terraform actions happening after a PR merge: This is a big behavioral change which (at leats for us) would not be the preferred way of handling deployments. There might be some edge cases but the majority of the deployments for us must happen the way they happen today for consistency purposes: The code cannot be merged before a terraform apply succeeds. For the type of workflow what Neptune tries to cover we already have options like CI/CD pipelines.

Do not get me wrong, this is useful and I see the value, but this fork is raising a lot of questions in my mind and it would be really great to see what is the future of the upstream version of Atlantis.

jamengual · 2023-04-13T16:59:00Z

@nishkrishnan from Lyft may have some comments about this too.

I think Atlantis is great, but it lacks in few areas when it comes down to Enterprise deployments and very busy deployments. The current workarounds work but at the core Atlantis was not built to be highly available and that is requirement for some companies.

I'm not opposed ( but I'm not the only mantainer) to expand on the Redis usage or maybe even bringing some of the Lyft work upstream but with the modifications needed to keep the current flow into Atlantis 2.0 for example.

This kind of effort will need coordination (which I'm willing to provide) and multiple people working actively/committed to this effort.

The possibility of multiple companies contributing to this is possible too.

@nitrocode @GenPage

nishkrishnan · 2023-04-13T19:44:22Z

Hey, i can speak a little bit about Lyft.

We completely rearchitected Atlantis to the point that a lot of original stuff in there is pretty much unused/deleted. Atlantis in it's current state is great in terms of flexibility but that's a double edged sword and especially impacts the testability and iterative development of the product. So in order to ease the rearchitecture and simplify things a bit, we've made an opnionated version in a way with less features but enough to POC the new backend in a reasonable amount of time.

That said, I don't believe it's worth it to try and re-integrate with upstream given the divergence. I see it as a new product entirely with a heavier dependency tree (ie. Temporal). Lyft initially wanted to have another repo in the Atlantis org owned by us where we could own, build and iterate on this version, but there were some political differences that stopped us, so we kept our work in our fork.

As for what we plan to do with it, i think that depends on general interest. I'd love to hear from the community about their usecases/setups etc. I'm usually out and about in the Atlantis slack channel so feel free to hmu and we can chat.

Dilergore · 2023-04-14T05:26:45Z

I like the idea building the platform on Temporal, as I mentioned above, I totally see the benefits.

When it comes to use cases and setups, I will try to collect some of ours:

Zone and Geo redundancy
Support for Azure DevOps
Custom workflow capabilities
Kind of base workflow as what we have in the upstream Atlantis (PR comments and apply before merge)

I think the above are the most important ones. In my opinion there should be a tactical and a strategic solution. The tactical could be supporting Redis for the file store, while I could easily imagine a strategic end goal for a new more sophisticated major version maybe based on Temporal and pulling some of the Lyft code in.

Let me know your thoughts.

jamengual · 2023-04-14T05:39:46Z

I agree but whatever is built it needs to still support the current VCS types we have and a streamlined configuration method ( so we don't have 150 flags) with the mayor and most popular options used, which the lyft fork does not support.

As for the geo settings that I think can be achieved at the infrastructure level so if just a HA version is built we can improve from that.

Dilergore · 2023-04-14T08:49:38Z

Definitely, If we choose Redis for the file backend, that would solve the zone/geo requirement.

https://redis.com/redis-enterprise/technology/active-active-geo-distribution/

GenPage · 2023-04-14T17:40:06Z

I think there is something to work on here that can address Atlantis without having a massive lift and shift of the backend like Lyft chose to do with Temporal. As Nish said, it's an entirely new product at that point.

My focus as of late as a new maintainer is trying to organize the project at a higher level after a transition in maintainers. We've set up a new Google Group and calendar event for Office Hours to try and organize around the community on key pain point areas that Atlantis is lacking.

The core issue we have is that we need to be backward compatible to a certain degree and our release process needs to reflect that. As Pepe mentioned, the over-abundance of configuration flags shows the feature set fracturing when there is not a clear direction on what problems Atlantis is trying to solve.

You'll see some structure around higher-level objectives that the community as a whole is experiencing coming soon as we try to organize the community. Especially around reliability and scalability. For example, have been numerous regressions lately due to new features that attempt to solve an edge case for one user, but break entire features for the rest of the community.

We have to remember as an open-source project we do not have the time or resources to compete with paid offerings and should not. We will never reach feature parity at the same level of quality. We need to be focusing on core workflows/features that address the majority of the needs of the community.

GenPage · 2023-04-14T17:44:59Z

For the sake of this issue, I see these things being bottlenecks for HA:

Control Plane
1. State (currently solved by Redis instead of internal boltDB)
2. Routing Events
  a. Incoming Webhooks
  b. API plan/applies
Data Plane
1. git clone
2. terraform plans
3. project locks

A proposal from either the community is welcome for solving these, as it is a bigger architectural shift than the current single binary design. I will set up some templates soon for proposals (taking influence from other OSS/CNCF/Sig projects). Let me know what you all think.

mattb18 · 2023-04-14T17:58:04Z

I'd be interested in this functionality, but from more of a scaling perspective rather than availability.

The Atlantis instance at my company is inactive 99% of the time (which I imagine is the case for most). Due to this we run Atlantis in a micro GCP instance. This generally works fine until you have a large Terraform project to plan or a few parallel plans at the same time (where the instance becomes CPU bound).

Ideally I'd like to run the 'scheduler' side of things in the micro VM that basically just handles webhooks and the redis queue. In addition to this there would then be an autoscaling pool of Atlantis workers which pick up plans and applys from the redis queue. If the plans were kept in Redis maybe this would work? From my use case it wouldn't matter too much if there is a short initialisation time for the worker to pull the repo etc.

Happy to help with contributions where required.

jstewart612 · 2023-04-20T17:02:32Z

So, we got plans/applies/locks to Redis, which is great, but still compel ReadWriteOnce in the helm chart for the PVC template on the StatefulSet? Any reason for that? If plans/applies/locks can now go to Redis, aren't the only filesystem objects the Terraform binaries it downloads, plus cloned git repositories, with the locking and concurrency governed by the plans/applies/locks managed by Redis entries?

If making that configurable in the Helm chart is the only blocker there, I can have a PR up in five minutes. I just want to make sure I'm not missing something important before I put one up.

jamengual · 2023-04-20T17:19:08Z

there is still information in boltDB ( the atlantis db) that was not migrated to redis because it required a lot of other code changes. so Redis locking is not enough by itself to make it HA, you will have other issues after that.

…

On Thu, Apr 20, 2023 at 10:02 AM John Stewart ***@***.***> wrote: So, we got plans/applies/locks to Redis, which is great, but still compel ReadWriteOnce in the helm chart for the PVC template on the StatefulSet? Any reason for that? If plans/applies/locks can now go to Redis, aren't the only filesystem objects the Terraform binaries it downloads, plus cloned git repositories, with the locking and concurrency governed by the plans/applies/locks managed by Redis entries? If making that configurable in the Helm chart is the only blocker there, I can have a PR up in five minutes. I just want to make sure I'm not missing something important before I put one up. — Reply to this email directly, view it on GitHub <#1571 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3ERHNMWCAPPTIKQEEJWLXCFT3HANCNFSM44RAFHYA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jstewart612 · 2023-04-20T17:55:47Z

@jamengual dreams, dashed!

Ah well. Will continue to keep eyes-on this issue and hope :)

yasinlachiny · 2023-04-21T09:14:33Z

We are using Atlantis as a single pod in EKS but the problem is when we wanna upgrade EKS cluster it kills the Atlantis pod and the pipeline fails. If we could use a disruption budget we could solve the problem but we need an Atlantis cluster.

Running Atlantis in EC2 also can solve the problem but we are trying to just use EKS not other services.

Dilergore · 2023-04-21T15:29:32Z

there is still information in boltDB ( the atlantis db) that was not migrated to redis because it required a lot of other code changes. so Redis locking is not enough by itself to make it HA, you will have other issues after that.

@jamengual - Maybe I am missing something here. I though that if you use Redis there is no BoltDB file in the file system. You are right about that there are other issues, like the previously mentioned repository content what comes from git.

jamengual · 2023-04-21T16:27:45Z

@jamengual - Maybe I am missing something here. I though that if you use Redis there is no BoltDB file in the file system. You are right about that there are other issues, like the previously mentioned repository content what comes from git.

I'm no expert on this part of the code but I remember there were other things stored in boldtb, it is not just the locks stored there.

jukie · 2023-04-21T18:19:27Z

Only the locks are stored in redis, not plan files yet as far as I'm aware.

Dilergore · 2023-04-24T06:12:08Z

The plan files were never stored in the database. The plan files are on the filesystem. I just deployed an instance with Redis and I cannot see the BoltDB file.

Dilergore · 2023-04-28T08:37:49Z

I am doing some testing on my end, so let me add some further details.

Lock resiliency is currently resolved by Redis. When you enable Redis there will not be a file based (BoltDB) locking database created.
If you are hosting your instances on Kubernetes, you can deploy the replicas in a StatefulSet so every replica will have its own disk to store the repo content / plan files. With a clever ingress rule you can persist sessions for the incoming webhooks by setting the following annotations for Nginx:

Annotations

The below is for Azure DevOps. It parses the JSON body of the hooks and extracting the PR URL as a "key" to base the session persistence on. Nginx is hashing then this value and based on the number of instances it will send the traffic to a certain backend. This is deterministic so as long as you have the same amount of instances behind, the session will persist and will be bound to the same instance.

nginx.ingress.kubernetes.io/upstream-hash-by: "$session_id"
nginx.ingress.kubernetes.io/configuration-snippet: |
      # Define Lua function to extract session ID from JSON request body
      set $session_id "";

      access_by_lua_block {
        -- Set random so if there is nothin to extract we randomly balance
        ngx.var.session_id = math.random()
        
        -- Read the body. Required so fill the body variable to be able to extract it
        ngx.req.read_body()
        local json = require "cjson"
        local req_body = ngx.req.get_body_data()

        -- Extract the resource.url property from the JSON after parsing it
        if req_body then
          local body_table = json.decode(req_body)
          if body_table then
            if body_table.resource then
              if body_table.resource.url then

                -- Set the PR URL for session ID
                ngx.var.session_id = body_table.resource.url
              end
            end
          end
        end
      }

Then there are still a few caveats:

UI based unlocks
- If a user hits the unlock button on the Atlantis UI that command will be likely executed on a node where the repo files and plans are not present.
- This will likely cause no failure, it will remove the entries from the lock DB but we end up having garbage files on the node which originally handled the requests.
- Never tested, but under certain circumstances this might cause some inconsistency (if the PR gets abandoned and reopened)
- IMO this is a minor issue which we could live with until we do not have a better solution
Job streaming
- If a user on the portal of the VCS clicks the button to see the live stream of the job.
- The session will likely go to a node where the job is not running.
- While this is causing no issue, and as long as you have a few number of nodes with a couple of browser refreshes you would likely end on the proper node, it is bad for user experience. Our users heavily rely on this feature by now so it is not acceptable in production.
API endpoints for plan/apply
- Never tested as we do not rely on this functionality as of now.
- That said, I think something in the body of the POST request could be used for key hashes for session persistence, similarly to the above mentioned VCS provider hooks.

An alternative option which I considered was to write a wrapper around the entire thing. The wrapper would be the "frontend" of the app which then would handle persistence based on its own database. While I do not like the idea of maintaining something extra and homegrown, theoretically it would work except again some caveats:

There is no List/Get API to retrieve data from the lock DB
- Minor issue, could be handled in the wrapper and its database
The database currently stores no information on what host executed previous plans/applies.
- Minor issue, could be handled in the wrapper and its database
Jobs are having UUIDs in the URL and there is no way to retrieve any information on them as it all happens in-memory. Even with a wrapper in front of them, routing the requests for jobs due to their indeterministic behavior is not possible.
- Major issue as we heavily rely on this functionality

jamengual · 2023-04-28T17:05:11Z

Your setup might work on K8s with nginx ingress but if you were in AWS using ECS you would need to use WAF to do that and that is just not the best solution IMHO. Anything that requires you to look at the body of a request to make a decision as to where to send it is not scalable or easy to maintain nor is a good practice from an application development perspective.

As you noted about the users hitting the logs page and having to refresh to find the right server etc, that is something that will have to be fixed in Atlantis, and once you fix that you are one step closer to having Atlantis able to know which server that request belong too (disk not shared) and send it over.

If the Atlantis API offered a way to push the same event received after the request was received to the right server then that will solve that issue.(As you noted too if there is a way to /get from the API the jobID+server to send the request)

Redis solves some issues but there is data that is not there, maybe if we can store, jobid, workspace status(PR repo clones), and such, we could maybe make the API more powerful to help redirect the calls to the right server.

Dilergore · 2023-05-02T09:37:54Z

@jamengual - We have a hard requirement to increase the resiliency of our internal setup by the end of this year. I 100% agree with you that using the ingress to do this is not ideal, nor should be done in a perfect world, but I just cannot rule out any workaround at this point due to the requirements. The above was meant to be sort of a summary for everyone who might have similar requirements/desire to understand what the issues are.

I agree with you that most of this should be solved on the Atlantis code level and not with infra workarounds.

jamengual · 2023-05-02T15:55:05Z

totally @Dilergore we all have constraints in our companies and I totally understand where you are coming from, I had to do something similar using AWS WAF due to the business requirements.

Pardeep009 · 2023-12-30T07:24:58Z

Hi Folks!
I have implemented the multi node setup of atlantis in my organisation and have written a medium blog around the same, hoping this might can help.

nthienan · 2024-01-01T12:22:18Z

Hi Folks!
I have implemented the multi node setup of atlantis in my organisation and have written a medium blog around the same, hoping this might can help.

Awsome @Pardeep009. Thanks for sharing this. It's very useful. Hopefully, your changes for problem #2.2 will get merged into atlantis offically

johnjelinek · 2024-01-05T18:24:22Z

@Pardeep009: is there a PR open here to incorporate your changes for the additional lock?

Pardeep009 · 2024-01-05T18:28:10Z

There is one in draft state, require more work to be done before raising it for review.

johnjelinek · 2024-01-05T18:34:52Z

I meant: is there a PR upstream here instead of in your fork? I like where your idea is heading, I wonder if @jamengual had a reason to keep this lock separate from the other lock that allows you to configure a backend.

Pardeep009 · 2024-01-05T18:39:46Z

No, there is no PR in the upstream here.

jamengual · 2024-01-05T18:40:47Z

I'm did bit coded the original lock implementation

johnjelinek · 2024-01-05T18:44:30Z

@Pardeep009: maybe it would be good for you to explain in a new issue what your thought process is (you can re-use the points from your blog post). It'd be good to get some feedback from the atlantis engineering team.

jmbravo · 2024-02-09T10:42:22Z

Hi Folks! I have implemented the multi node setup of atlantis in my organisation and have written a medium blog around the same, hoping this might can help.

@Pardeep009 but this solution is not valid for EKS StatefulSet, because even if you have an EFS, a pvc is created for each replica and they have different volumes

GMartinez-Sisti · 2024-11-15T21:34:53Z

Hi Folks! I have implemented the multi node setup of atlantis in my organisation and have written a medium blog around the same, hoping this might can help.

@Pardeep009 but this solution is not valid for EKS StatefulSet, because even if you have an EFS, a pvc is created for each replica and they have different volumes

Reviving this thread 😁

The helm chart currently supports using a storage class that can be configured as EFS. So one less blocker.

nitrocode mentioned this issue Nov 4, 2022

Add high availability using redis for locking terraform-aws-modules/terraform-aws-atlantis#322

Closed

dupuy26 mentioned this issue Mar 2, 2023

Atlantis command to disable (or enable?) auto-merge persistently for a particular PR #3181

Open

1 task

jamengual mentioned this issue Sep 28, 2023

Multiple atlantis instances - Installation using Helm chart #3795

Open

dosubot bot added the Stale label Oct 15, 2024

Highly available cluster with multiple nodes #1571

Highly available cluster with multiple nodes #1571

Comments

tapaszto commented May 10, 2021

jamengual commented May 10, 2021

acastle commented May 10, 2021

tapaszto commented May 11, 2021

jamengual commented May 12, 2021

lkysow commented May 13, 2021

jasonrberk commented Jul 30, 2021

dohnto commented Apr 28, 2022 • edited Loading

tapaszto commented Nov 3, 2022 • edited Loading

jamengual commented Nov 3, 2022 • edited Loading

tapaszto commented Nov 3, 2022

jamengual commented Nov 3, 2022 via email

nitrocode commented Nov 7, 2022

jamengual commented Nov 7, 2022 via email

gartemiev commented Jan 16, 2023

nitrocode commented Jan 16, 2023

jamengual commented Jan 16, 2023 via email

albertorm95 commented Mar 2, 2023 • edited Loading

nitrocode commented Mar 2, 2023

albertorm95 commented Mar 3, 2023 • edited Loading

nitrocode commented Mar 3, 2023

anhdle14 commented Mar 22, 2023

jukie commented Mar 22, 2023

anhdle14 commented Mar 22, 2023 • edited Loading

Dilergore commented Apr 12, 2023

jukie commented Apr 12, 2023

Dilergore commented Apr 13, 2023

jamengual commented Apr 13, 2023 • edited Loading

nishkrishnan commented Apr 13, 2023 • edited Loading

Dilergore commented Apr 14, 2023

jamengual commented Apr 14, 2023

Dilergore commented Apr 14, 2023

GenPage commented Apr 14, 2023

GenPage commented Apr 14, 2023 • edited Loading

mattb18 commented Apr 14, 2023

jstewart612 commented Apr 20, 2023

jamengual commented Apr 20, 2023 via email

jstewart612 commented Apr 20, 2023

yasinlachiny commented Apr 21, 2023 • edited Loading

Dilergore commented Apr 21, 2023

jamengual commented Apr 21, 2023

jukie commented Apr 21, 2023

Dilergore commented Apr 24, 2023

Dilergore commented Apr 28, 2023 • edited Loading

jamengual commented Apr 28, 2023

Dilergore commented May 2, 2023 • edited Loading

jamengual commented May 2, 2023

Pardeep009 commented Dec 30, 2023

nthienan commented Jan 1, 2024 • edited Loading

johnjelinek commented Jan 5, 2024

Pardeep009 commented Jan 5, 2024

johnjelinek commented Jan 5, 2024 • edited Loading

Pardeep009 commented Jan 5, 2024

jamengual commented Jan 5, 2024

johnjelinek commented Jan 5, 2024

jmbravo commented Feb 9, 2024

GMartinez-Sisti commented Nov 15, 2024

dohnto commented Apr 28, 2022 •

edited

Loading

tapaszto commented Nov 3, 2022 •

edited

Loading

jamengual commented Nov 3, 2022 •

edited

Loading

albertorm95 commented Mar 2, 2023 •

edited

Loading

albertorm95 commented Mar 3, 2023 •

edited

Loading

anhdle14 commented Mar 22, 2023 •

edited

Loading

jamengual commented Apr 13, 2023 •

edited

Loading

nishkrishnan commented Apr 13, 2023 •

edited

Loading

GenPage commented Apr 14, 2023 •

edited

Loading

yasinlachiny commented Apr 21, 2023 •

edited

Loading

Dilergore commented Apr 28, 2023 •

edited

Loading

Dilergore commented May 2, 2023 •

edited

Loading

nthienan commented Jan 1, 2024 •

edited

Loading

johnjelinek commented Jan 5, 2024 •

edited

Loading