Question - MongoDB/S3 AWS Implementation #1637

a-r-db · 2021-08-02T15:54:12Z

a-r-db
Aug 2, 2021

TL;DR What is the performance increase of implementing public S3 access and MongoDB. Has it ever been calculated?

To the best of my knowledge the companies using these libraries are implementing production servers using Amazing Web Services.

I am currently using your h5p-client and h5p-server ReactJS project, with several modifications, to implement a development, testing and production server for my employer.

Very soon I will be implementing the backend.

What is the performance difference between hosting everything inside one backend lambda function and hosting each part separately: MongoDB, S3, etc. for the library, content and metadata, respectively?

Also, is there any idea how much performance will be gained if we use public over private S3 endpoints?

I am preparing and collecting information about this problem, but if somebody with more experience on this issue could give some pointers, I would be much obliged.

Also might be serving data to 100 users at first, then as much as 1000 and probably never more than 10000.

Answered by sr258

Aug 2, 2021

Do you mean the functionality to serve files directly from S3 as described in the section „ Increasing scalability by getting content files directly from S3“? Or do you mean how the Mongo/S3 storages compare to the file system storages?

The Mongo/S3 storage classes are certainly superior to file system storage in general, especially if you list content. Their main benefit, however is scalability beyond a single instance as the file system storage classes won‘t work (well) across machines or containers. If you want to use the library in a scaled environment, you‘ll have to use Mongo/S3 (and Redis).

If you use public S3 links as described in the docs section above, the performance gain depe…

View full answer

sr258 · 2021-08-02T16:41:27Z

sr258
Aug 2, 2021
Maintainer

Do you mean the functionality to serve files directly from S3 as described in the section „ Increasing scalability by getting content files directly from S3“? Or do you mean how the Mongo/S3 storages compare to the file system storages?

The Mongo/S3 storage classes are certainly superior to file system storage in general, especially if you list content. Their main benefit, however is scalability beyond a single instance as the file system storage classes won‘t work (well) across machines or containers. If you want to use the library in a scaled environment, you‘ll have to use Mongo/S3 (and Redis).

If you use public S3 links as described in the docs section above, the performance gain depends a bit on several factors:

passing files through your app server will always increase latency; if you host everything inside AWS this should be negligible; as there are relatively few files and large that are served this way (if you only serve content files), I don‘t think the effect will be noticeable
if you have a massive user base with lots of parallel requests you can decrease the load on the app server, which might increase overall performance

To sum it up, I don‘t think it‘s worth using the public S3 endpoint functionality if you don‘t have a very big user base. You always have to keep in mind that serving files directly from the S3 service is charged for by AWS (as far as I know) and that you have to put up something like CloudFront to really have high speed.

We don‘t have measurements on performance gains. The Mongo / S3 content storage adapter is not in large-scale use that we manage ourselves. I wrote it as contractual work and I don‘t know how exactly the customer is using the adapter and how it performs depending on the configuration. I haven‘t heard of any complaints either, though.

If you want to use the Mongo/S3 library storage adapter, I would highly recommend using CloudFront to serve the library files and to change the library files URL accordingly. H5P creates a ton of HTTP requests and they must be as fast as possible. h5p-server has a cache busting mechanism that will allow safe use of a CDN for library files. In general, I think most of the performance issues can be solved by having a super-fast and cached library files storage. It‘s also important to use the cached library storage to improve metadata access times in the library. The performance gains can be more than 1000% compared to uncached access.

I haven‘t used Amazon Lambdas, as we deploy on a self-hosted Kubernetes. I think you should use AWS S3, DocumentDB and their Redis implemention as backend service. It should be fine to put all H5P functionality into a single lambda. H5P wasn‘t designed for a microservice approach, so splitting the core H5P functionality into several services is really hard, as they only have few routes that contain a lot of functionality. (I have a few thoughts on improving this, but I don‘t have time for this at the moment) Your custom functionality to perform CRUD on H5Ps can be separated into their own lambda, but I don‘t think you really have to.

One think to bear in mind is that you might run into issues if users upload massive .h5p files. The library extracts these files to local temporary storage to perform validation and then persist them in the backend storage. (pure in memory validation turned out to be a problem in the past) From what I‘ve seen there‘s only 500 MB of temp storage in lambdas, so this might be a problem.

Are your user numbers concurrent users or total users? If they are total users, I wouldn‘t worry too much about premature performance optimizations.

18 replies

sr258 Aug 8, 2021
Maintainer

If you get the Kubernetes service from DO, you don't need Rancher and having 4 Nodes will work. 3 nodes will also work if you only install Sentry or FusionAuth and not both. If you don't install either, 3 nodes with 2 vCPU and 4 GB RAM will also work fine. Sentry and FusionAuth are the resource hogs and you'll have to consider whether you want to use them if you have to cut costs. All the rest is really resource efficient. What also uses a lot of CPU is monitoring with Prometheus and Grafana.

The redundancy we have is simply having 4 nodes on which we've spread out MongoDB on 3 of those, Minio on all 4 and we have three pods per app service. That way one of our nodes can fail or be taken out for maintenance without service interruption. Having three nodes is also enough for redundancy sake.

I set up Kubernetes myself and if I disregard dead ends on only count in the setup on Hetzner it took me maybe 15 hours to set everything up including the services. I already have the necessary skills about Kubernetes and Rancher from my day job, so I didn't have to do much learning here. Maintenance currently is pretty low (probably 1-3 a month), but we don't install every update for every service right away. If your employer is more strict, maintenance cost will also be higher. As we do this as part of our open source effort it's difficult to put a price tag on the work. As our service is free and we the way our company earns money by consultant work and tailoring our software to customer's needs, there's no real price tag on ops. If we'd charge ops costs to customers, Amazon AWS would be much more attractive, as the the high personell costs in Germany mean that their service aren't as expensive in comparison. If your team has to put lots of hours into learning about new tech, simply using the more expensive tech from Amazon might be cheaper for your customer, as they save on hours!

Having a full dev setup on your dev machine might be difficult if you don't have massive amounts of RAM on your machine...

a-r-db Aug 8, 2021
Author

I have 32 vCPU and 128GB ram. 4GB ram per 1 vCPU

It is running Windows as host os with VMWare Workstation.
Nested inside is 1 ubuntu 20.04 with 64GB ram and 16vCPU
I'm running Vagrant with Virtual-box with nested virtualization to test the machines as a cluster.
I've already setup most of the machines for use as a cluster.
I'll installed Rancher to manage them instead of DO for the time being.

It should be plenty to test with room to spare for my two other VMs.
I don't have a nice car. I decided to buy a nice home dev server/workstation. 😄
I think we need FusionAuth and I really wish Sentry was an option but I don't think that's possible.

With the resources we've mentioned give or take a few nodes/services,
we could easily service over 1000 users with 3-4 nodes and I assume with 4TB transfer.
That's around the same cost as AWS and we forgot to tell them how important Elasticache will be.
I don't think they included it in the price analysis.

What kind of resources do you think I should get with around $160-$180 USD from digital ocean? $(200 - 225) CAD
Then I'll try and build my server to emulate what DO's best offerings would be and what my employer's desire for month to month cost would be. DO is a great offering cause bandwidth is included in the server cost.

From what I gather it's $180 usd or $240 usd depending on if we can have 3 or 4 nodes.
Actual costs translates to $255.36 cad or $340.48 cad including taxes.
I can build something in no time to show my team how it works and what it looks like.
The nodejs apps can be reconfigured to work with these tools rather quickly.
I'll install sentry on the 3rd or 4th node anyways.
Maybe Sentry can be phased out after the main development stage too,
but it would be so useful I don't think there's any alternative.
Also do you think a single Wordpress instance could be tacked onto this or would it hog too many resources?
That way we could cut something out of their budget and transfer the monthly charge to DO.
I'm sure their WP instance is over priced.

Also note, I don't think anybody else on my team knows how to adapt any of these technologies and it's a very small team. i'm probably the only person on the team at this time who could implement this within a sound amount of time.

Thanks,
Austin.

P.S. Looking to expand your business to Canada 😄 ? 120 CAD sounds like a handsome sum of money to me. I'm only a little bit serious of course. That would be excellent in my case! Although I should probably just start job hunting myself. Real world work experience is much more important than the skills I have obtained from school, research and self-teaching.

a-r-db Aug 8, 2021
Author

Do you mind telling how many users your setup was provisioned for approximately?
Sharing good numbers might help my employer consider this idea.
Also what type of ingress controller do you use or recommend for this project?
Do you think any kubernetes network plugin is ideal?
I'm looking at Flannel and Calico.

a-r-db Aug 12, 2021
Author

Well I resigned from the project.
We changed direction again.
Sorry to have wasted your time.

sr258 Aug 13, 2021
Maintainer

Alright, good luck!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - MongoDB/S3 AWS Implementation #1637

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 18 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Question - MongoDB/S3 AWS Implementation #1637

a-r-db Aug 2, 2021

Replies: 1 comment · 18 replies

sr258 Aug 2, 2021 Maintainer

sr258 Aug 8, 2021 Maintainer

a-r-db Aug 8, 2021 Author

a-r-db Aug 8, 2021 Author

a-r-db Aug 12, 2021 Author

sr258 Aug 13, 2021 Maintainer

a-r-db
Aug 2, 2021

Replies: 1 comment 18 replies

sr258
Aug 2, 2021
Maintainer

sr258 Aug 8, 2021
Maintainer

a-r-db Aug 8, 2021
Author

a-r-db Aug 8, 2021
Author

a-r-db Aug 12, 2021
Author

sr258 Aug 13, 2021
Maintainer