Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce sharding rules to MongoDB collections #642

Closed
wants to merge 11 commits into from

Conversation

sejongk
Copy link
Contributor

@sejongk sejongk commented Sep 13, 2023

What this PR does / why we need it:
This PR introduces sharding rules to MongoDB collections to distribute loads on the database cluster.
It takes reference from #472.

Which issue(s) this PR fixes:

Addresses #673

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Additional documentation:


Checklist:

  • Added relevant tests or not required
  • Didn't break anything

@sejongk sejongk added the enhancement 🌟 New feature or request label Sep 13, 2023
@sejongk sejongk self-assigned this Sep 13, 2023
@sejongk sejongk marked this pull request as draft September 13, 2023 09:47
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Attention: 116 lines in your changes are missing coverage. Please review.

Comparison is base (940941a) 49.47% compared to head (bb175e1) 49.19%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #642      +/-   ##
==========================================
- Coverage   49.47%   49.19%   -0.28%     
==========================================
  Files          69       69              
  Lines        9951    10129     +178     
==========================================
+ Hits         4923     4983      +60     
- Misses       4512     4608      +96     
- Partials      516      538      +22     
Files Coverage Δ
server/backend/database/mongo/indexes.go 57.14% <ø> (ø)
server/backend/database/mongo/client.go 38.95% <39.58%> (-0.52%) ⬇️

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sejongk
Copy link
Contributor Author

sejongk commented Sep 29, 2023

There are some issues I’ve been handling.

The first thing is how to guarantee the uniqueness of fields. In the Yorkie, there is a requirement to force unique constraints on several field combinations (e.g. owner and name in the project collection).

However, MongoDB does not support unique indexes across shards, except when the unique index contains the full shard key as a prefix of the index (ref. https://www.mongodb.com/docs/manual/core/sharding-shard-key/). In these situations MongoDB will enforce uniqueness across the full key, not a single field . It seems related to how the sharding works in MongoDB. Indexing is done and maintained in each shard, not globally, so that the unique indexes are also applied per shard. In addition, uniqueness is supported only for ranged shard key, not for hashed shard key.

It suggests to use a proxy collection for each combination to be globally unique (ref. https://www.mongodb.com/docs/manual/tutorial/unique-constraints-on-arbitrary-fields/) So, I implemented this method (the commit Introduce proxy collections to guarantee uniqueness). It looks fine but I think the collection design is going to be complex, the additional executions for proxies are costly, and there should be a way to make those executions atomic.

Therefore, I’ve implemented to use the application-level lock to get rid of proxy collections. Before insertion operations, it checks if there is a document that already has the same combination, and it creates a new document only if it does not exist (using upsert in MongoDB). Plus, the application-level locking is used to make the operation atomic and prevent conflicts between server instances. The current memory lock seems not to support distributed locks. We can use the Redis or TTLed MongoDB instead later for this.

I’m going to check benchmark results to figure out the accurate tradeoffs for performance.

The second thing is about the shard key. I choose the following keys for good performance under the current query patterns. Note that the server_seq ranged-sharding is used for changes, because range query is frequently used for the changes.

sh.shardCollection("yorkie-meta.projects", { _id: "hashed" })
sh.shardCollection("yorkie-meta.users", { username: "hashed" })
sh.shardCollection("yorkie-meta.clients", { project_id: "hashed" })
sh.shardCollection("yorkie-meta.documents", { project_id: "hashed" })
sh.shardCollection("yorkie-meta.changes", { doc_id: "hashed", server_seq: 1 })
sh.shardCollection("yorkie-meta.snapshots", { doc_id: "hashed" })
sh.shardCollection("yorkie-meta.syncedseqs", { doc_id: "hashed" })

MongoDB provides the balancing mechanisms for sharding (ref. https://www.mongodb.com/docs/manual/core/sharding-data-partitioning/#range-migration, https://www.mongodb.com/docs/manual/core/sharding-balancer-administration/) The balancer process automatically migrates data when there is an uneven distribution of a sharded collection's data across the shards. See Migration Thresholds (ref. https://www.mongodb.com/docs/manual/core/sharding-balancer-administration/#std-label-sharding-migration-thresholds ) for more details.

Any idea about these issues?

@sejongk sejongk marked this pull request as ready for review September 29, 2023 11:42
Copy link
Member

@krapie krapie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.
I left some small comments below.

The first thing is how to guarantee the uniqueness of fields. In the Yorkie, there is a requirement to force unique constraints on several field combinations (e.g. owner and name in the project collection).

I personally think locking with distributed keys is the lowest priority option that we have considering our current sharded cluster mode, since we are avoiding communications between server instances in sharded cluster mode.

Proxy collection seems more attractive to me for now, since it is MongoDB's official suggestion and we already have some DB actions that require atomic execution.

But benchmarking will clearly show what is most suitable for our situation.

The second thing is about the shard key. I choose the following keys for good performance under the current query patterns. Note that the server_seq ranged-sharding is used for changes, because range query is frequently used for the changes.

I think we have to constantly benchmark and tune collection shard keys based on your shard key selection.

}

info.ID = types.ID(result.UpsertedID.(primitive.ObjectID).Hex())
println("infoID!!", info.ID, result.UpsertedID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is for testing purpose.
Consider removing this code later.

]
}
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add newline here.

return err
}

// NOTE: If the project is already being created by another, it is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?
Could you give me an explanation for this?

@@ -1466,6 +1641,24 @@ func (c *Client) collection(
Collection(name, opts...)
}

func (c *Client) deleteProjectProxyInfo(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still using this function?
It seems like you have introduced lock instead of proxy collection.

@hackerwins hackerwins force-pushed the main branch 3 times, most recently from 7892142 to fdc2e1c Compare November 4, 2023 10:22
@hackerwins hackerwins marked this pull request as draft November 23, 2023 01:45
@sejongk sejongk closed this Nov 23, 2023
@sejongk sejongk deleted the add-sharding-rule-to-mongodb branch November 23, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 🌟 New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants