Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Extremely Poor Docker Resource Utilization Efficiency #2730

Open
palisadoes opened this issue Dec 2, 2024 · 115 comments
Open

API: Extremely Poor Docker Resource Utilization Efficiency #2730

palisadoes opened this issue Dec 2, 2024 · 115 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@palisadoes
Copy link
Contributor

Describe the bug

We run a demonstration instance of Talawa-API on a GoDaddy VPS server running Ubuntu. It has the following resources:

  1. 1 core
  2. 2 GB of RAM
  3. 40 GB of disk

Other information:

  1. The demo instance is intended to create an evaluation environment for new GtiHub contributors and users alike as they decide to use Talawa. The DB of the demo instance gets reset every day.
  2. Talawa API runs natively on this VPS server with acceptable performance with one user. The load average is approximately 1, which is the target value for a system with only 1 core.
  3. When Talawa API runs on the server using docker. The load average reaches 130, the swap process is the top CPU resource user. The system is so overloaded that only one ssh session at a time is achievable.

The purpose of this issue is to find ways to tune all Talawa-API Dockerfile and app configurations to lower its CPU and RAM utilization by at least 75%

  1. With the current Docker performance very few developers or end users will want to try Talawa themselves.
  2. This has been a recurring issue with Talawa API. The poor performance threatens the success of our current MongoDB based MVP.

To Reproduce
Steps to reproduce the behavior:

  1. Run Talawa-API on a system
  2. See excessive resource utilization

Expected behavior

  1. Acceptable usage information such that it can run easily on a mid-range laptop without impacting its performance

Actual behavior

  1. Poor performance

Screenshots

image

Additional details
Add any other context or screenshots about the feature request here.

Potential internship candidates

Please read this if you are planning to apply for a Palisadoes Foundation internship

@palisadoes palisadoes added the bug Something isn't working label Dec 2, 2024
@github-actions github-actions bot added feature request unapproved Unapproved for Pull Request labels Dec 2, 2024
@prayanshchh
Copy link
Contributor

can u please assign, I want to work on this issue but I will need guidance

@varshith257
Copy link
Member

This mostly related of reducing docker image size

@varshith257 varshith257 removed the unapproved Unapproved for Pull Request label Dec 3, 2024
@palisadoes palisadoes changed the title Extremely Poor Docker Resource Utilization Efficiency API: Extremely Poor Docker Resource Utilization Efficiency Dec 4, 2024
@palisadoes palisadoes added good first issue Good for newcomers and removed feature request labels Dec 4, 2024
@prayanshchh
Copy link
Contributor

prayanshchh commented Dec 6, 2024

Diffrent ways to approach this issue

1. Multi-Stage Builds
Using a multi-stage build can help separate the build and runtime environments, ensuring that only production-ready artifacts are included in the final image. This can be achieved by:

Installing dependencies and building the application in the first stage.
Copying only the necessary files (e.g., dist, node_modules) into a minimal runtime stage.

2. Optimizing Base Images
Switching to optimized base images can dramatically reduce size:

Baseline Image (Full Node.js): ~900 MB
Using Multi-Stage with Slim: ~400–500 MB
Using Multi-Stage with Alpine: ~250–300 MB
With Distroless: ~150–200 MB

3. Using Compression Tools
Tools like docker-slim can further compress the final image by analyzing and stripping unused dependencies and files:
With docker-slim: ~100–150 MB.

please suggest a method that doesn't impact comaptibility with codebase

@palisadoes
Copy link
Contributor Author

@prayanshchh

Please investigate the best solution and propose it after testing on your system. It's not just RAM, but also ways to reduce the CPU overhead.

@prayanshchh
Copy link
Contributor

alright sir

@vasujain275
Copy link
Contributor

@palisadoes @prayanshchh

The main problem I found with the API is that we have to run it in dev mode in the production Docker environment because our build process for the Talawa API is broken, so we can't use npm run start. If we resolve the build issue, we can drastically improve performance and security of the docker container.

I think @varshith257 also tried to solve the build process issue a few months back, any upadates on that?

@palisadoes
Copy link
Contributor Author

Would this PR by @adithyanotfound provide any insights?

@palisadoes
Copy link
Contributor Author

@vasujain275 Why do you say the build process is broken? Can you create an issue for someone else to try to fix it?

@prayanshchh
Copy link
Contributor

Would this PR by @adithyanotfound provide any insights?

Yes this helps, I will start my work on this in two days, have got end sem exams

@prayanshchh
Copy link
Contributor

am unassigning myself from the issue due to lack of progress

@prayanshchh prayanshchh removed their assignment Dec 14, 2024
@PurnenduMIshra129th
Copy link

@palisadoes plz assign me

@PurnenduMIshra129th
Copy link

@palisadoes what is the load average if the api runs without docker means what is the performance . I need this because i will only focus to improve to docker performance.If not then i have to use profiler to measure what is the exact issue is it related to docker container or in code sue unOptimized query.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Dec 17, 2024

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism. And one doubt is how i give more load to this api because at the time of testing l am the only user .

@vasujain275
Copy link
Contributor

vasujain275 commented Dec 17, 2024

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism.

  1. We don't need k8s
  2. Multistage builds and lightweight base image will not help, we already have multi stage builds with alpine images. The main issue is our build process.
  3. @palisadoes Due to my end semester exams right now I am not able to create that Graphql build Error Issue that is the main performance blocker on this. I will get to in 2-3 days once my exams end. Sorry for the delay.
  4. I think we should close the docker performance related issues as they create unnecessary confusion. Our docker images are well optimised. The main issue is that we are running our api in dev mode in them, once the build is fixed we can modify the docker files to see the performance improvements.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Dec 17, 2024

Build related issue means i don't get means u are saying about unnecessary node modules or something like this are in build at the time of building the docker image there which are causing the issue. I need futher calrity. And in above u commented u are not able to run npm run start it is working fine because api service is starting

@palisadoes
Copy link
Contributor Author

@palisadoes for now i have done limits it cpu and memory usage . Also added the multistage build and used one light weight image . But i think this will handle upto a specific user . But To handle it effectivly can i use kubernatives or any other services to handle the load . So it will scale the pods if load increase and reduce the cpu usage and improve the performance.If not does the vps server where the container is hosted can it provides this mechanism.

1. We don't need k8s

2. Multistage builds and lightweight base image will not help, we already have multi stage builds with alpine images. The main issue is our build process.

3. @palisadoes Due to my end semester exams right now I am not able to create that Graphql build Error Issue that is the main performance blocker on this. I will get to in 2-3 days once my exams end. Sorry for the delay.

4. I think we should close the docker performance related issues as they create unnecessary confusion. Our docker images are well optimised. The main issue is that we are running our api in dev mode in them, once the build is fixed we can modify the docker files to see the performance improvements.

OK.

@PurnenduMIshra129th
Copy link

@palisadoes i run a load test on the server with docker and with out docker on the configuration of duration of 30 sec and 2 req/sec and found means total of 60 request will be made in 30 sec in this scenerio both have equal successRate . But when i run the same test for same duration but with different request rate like 5 req/sec means 150 request in 30 sec got the result of slightly better performance of server with out docker . But the thing is server can't handle 150 request in 30 sec as many of request is under processing and not completed the request out of this only 40 request is successful.And if u want run the docker on low end service for a small user base like in 60 sec it makes 50 to 60 (considerable factor like medicore device 4gb of ram and 4core ) it will handle the request easily if talwa-api will reduce its cpu excessive task and if we limit the cpu usage also it will handle but some slowness will be there in this scenerio. What u say?

@palisadoes
Copy link
Contributor Author

@PurnenduMIshra129th please coordinate with @vasujain275

There appears to be multiple causes. The application is clearly over using resources.

Here is additional information.

@PurnenduMIshra129th
Copy link

@vasujain275 yes u are correct build process is broken . After build it is not working properly . Also when i try to run npm run prod it is not running gives multiple error. U have any thoughts on this ? should we have use import instead of require.

@bandhan-majumder
Copy link

@palisadoes is there anything I can help with?

@palisadoes
Copy link
Contributor Author

We need to focus on the app performance which is causing docker to appear to be slow.

@PurnenduMIshra129th the bare minimum need to to be done to get the demo instance usable on the cloud server. Please coordinate with @vasujain275

@gautam-divyanshu
Copy link
Member

@vasujain275 @PurnenduMIshra129th What's the status?

@palisadoes
Copy link
Contributor Author

palisadoes commented Jan 31, 2025

  1. I need a volunteer to take this issue over as @vasujain275 doesn't seem to be available.
  2. The API needs to run under the talawa-api user and Admin needs to run under the talawa-admin user
  3. The host is api-demo.talawa.io which is the same OS instance as admin-demo.talawa.io
  4. Whoever is interested in working on this, contact me on slack to post their public SSH key so they can login to the server. They will also get sudo access.
  5. This absolutely needs to be resolved this weekend. We are highlighting the demo as part of our GSoC 2025 application, as justifiable validation of our progress.

@PurnenduMIshra129th
Copy link

@palisadoes @varshith257 If you provide some guidance then i can give give a try

@palisadoes
Copy link
Contributor Author

@palisadoes @varshith257 If you provide some guidance then i can give give a try

The guidance is clearly defined in the issue.

@PurnenduMIshra129th
Copy link

@palisadoes let me first setup the goDaddy server on my machine.

@palisadoes
Copy link
Contributor Author

You can't. You don't have access to the server in question and it's current configuration. You need login access to do so. This is what needs to be done.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Jan 31, 2025

@palisadoesi think i need some credentials to login into server ? Can i get the credentials so that i can try?My steps would be first i will create docker image in production enviroment then i will set its limit to get minimum performance.If applicable then i can see any other changes that will be helpful

@palisadoes
Copy link
Contributor Author

@palisadoesi think i need some credentials to login into server ? Can i get the credentials so that i can try?

Send me your SSH public key and the username you require in slack.

@varshith257
Copy link
Member

@palisadoes I think deployment is done of what @vasujain275 has shared to me and just left with replacing the dev image with prod image and then we are good to go i guess so far

@PurnenduMIshra129th
Copy link

@gautam-divyanshu i have now access to server so trying to what to do ? Not much of idea of this.

@PurnenduMIshra129th
Copy link

@palisadoes @varshith257 yes it running on server but i can see that still its memory limit is set to full .. see the screen shot

Image
see the limit column.

@palisadoes
Copy link
Contributor Author

@PurnenduMIshra129th

  1. Is this the develop branch running on the server? That is known to work.
  2. Also, is there any room left for talawa-admin to run in docker too? We cannot have the apps consuming all available resources.
  3. Is it setup to:
    1. Reload the sample DB every day?
    2. Reload the app whenever there are updates to the develop branch?

These are fundamental questions to get the app running.

@PurnenduMIshra129th
Copy link

PurnenduMIshra129th commented Jan 31, 2025

@palisadoes currently the docker is running on server . But the i can see limit is not set there for now it talawa-api have access to all ram and cpu the system have . So we have to set the limit in its compose file and deploy again. also For now i can see load is also not too high .once we update the limit in api side then we can run the admin . For Now as load is not more so we can run Talawa-admin. But it is better to first set the resource limit on both side then we can deploy this. And after we deploy both of this we can benchmark this as said by @varshith257 .

And i did not check the sample DB reloading and updating the develop branch.
Currently writing test case so bit of busy there.

Image
see the limit column it have access to all of our available ram . so in critical condition it will utilize the whole ram . Then swap process will start to manage the process . And evantually load will increase if multiple user will enter in same time.

@VanshikaSabharwal
Copy link

Can i work on this issue @palisadoes @varshith257 ? So that it will be solved fast.

@palisadoes
Copy link
Contributor Author

That means the docker file in develop is configured in a suboptimal way. It's probably true in Admin too.

@PurnenduMIshra129th
Copy link

@palisadoes Yes .it is true

@palisadoes
Copy link
Contributor Author

@PurnenduMIshra129th

  1. Since they need to be updated. Can you lead that effort?
  2. Both develop* branches in admin will need to be updated

@PurnenduMIshra129th
Copy link

@palisadoes ok i will do

@palisadoes
Copy link
Contributor Author

  1. @VanshikaSabharwal has got access to the server. Please coordinate with her.
  2. Create PRs to fix the resource allocation against this issue for all affected branches.

@PurnenduMIshra129th
Copy link

@palisadoes ok i will contact her in slack

@adithyanotfound
Copy link

I had previously made a PR that optimized the admin for production, but I can’t find the changes in the codebase. Were they reverted or overwritten?

@palisadoes
Copy link
Contributor Author

palisadoes commented Feb 2, 2025

@adithyanotfound

  1. They were. We overwrote the develop-postgres branch on top of develop. This was because we had cloned develop-postgres from develop and stopped updates on develop at the same time. Your PR must have been missed in the review period.
  2. Can you reapply the changes?

I'm sorry this happened. There are a lot of moving pieces to manage, and I missed this.

@adithyanotfound
Copy link

@palisadoes No worries! I’ll reapply the changes and open a new PR shortly.

@PurnenduMIshra129th
Copy link

@palisadoes @adithyanotfound can you tell what are the changes or in which file it was

@PurnenduMIshra129th
Copy link

@palisadoes to start the talawa-api in server only we have to start the container or after that is there anything we have to do?

@palisadoes
Copy link
Contributor Author

@PurnenduMIshra129th

@VanshikaSabharwal and I managed to setup the API on the server using the development environment.

We created a /etc/cron.d/talawa-api file that explains the process

Please coordinate with her.

We need better online documentation on this too.

@palisadoes
Copy link
Contributor Author

Ideally, we should be using production and not develop instances for the API and Admin on the server. Please try to get that working.

@PurnenduMIshra129th
Copy link

@palisadoes ok will complete that soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
10 participants