-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce sharding rules to MongoDB collections #642
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #642 +/- ##
==========================================
- Coverage 49.47% 49.19% -0.28%
==========================================
Files 69 69
Lines 9951 10129 +178
==========================================
+ Hits 4923 4983 +60
- Misses 4512 4608 +96
- Partials 516 538 +22
☔ View full report in Codecov by Sentry. |
There are some issues I’ve been handling. The first thing is how to guarantee the uniqueness of fields. In the Yorkie, there is a requirement to force unique constraints on several field combinations (e.g. owner and name in the project collection). However, MongoDB does not support unique indexes across shards, except when the unique index contains the full shard key as a prefix of the index (ref. https://www.mongodb.com/docs/manual/core/sharding-shard-key/). In these situations MongoDB will enforce uniqueness across the full key, not a single field . It seems related to how the sharding works in MongoDB. Indexing is done and maintained in each shard, not globally, so that the unique indexes are also applied per shard. In addition, uniqueness is supported only for ranged shard key, not for hashed shard key. It suggests to use a proxy collection for each combination to be globally unique (ref. https://www.mongodb.com/docs/manual/tutorial/unique-constraints-on-arbitrary-fields/) So, I implemented this method (the commit Therefore, I’ve implemented to use the application-level lock to get rid of proxy collections. Before insertion operations, it checks if there is a document that already has the same combination, and it creates a new document only if it does not exist (using I’m going to check benchmark results to figure out the accurate tradeoffs for performance. The second thing is about the shard key. I choose the following keys for good performance under the current query patterns. Note that the sh.shardCollection("yorkie-meta.projects", { _id: "hashed" })
sh.shardCollection("yorkie-meta.users", { username: "hashed" })
sh.shardCollection("yorkie-meta.clients", { project_id: "hashed" })
sh.shardCollection("yorkie-meta.documents", { project_id: "hashed" })
sh.shardCollection("yorkie-meta.changes", { doc_id: "hashed", server_seq: 1 })
sh.shardCollection("yorkie-meta.snapshots", { doc_id: "hashed" })
sh.shardCollection("yorkie-meta.syncedseqs", { doc_id: "hashed" }) MongoDB provides the balancing mechanisms for sharding (ref. https://www.mongodb.com/docs/manual/core/sharding-data-partitioning/#range-migration, https://www.mongodb.com/docs/manual/core/sharding-balancer-administration/) The balancer process automatically migrates data when there is an uneven distribution of a sharded collection's data across the shards. See Migration Thresholds (ref. https://www.mongodb.com/docs/manual/core/sharding-balancer-administration/#std-label-sharding-migration-thresholds ) for more details. Any idea about these issues? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution.
I left some small comments below.
The first thing is how to guarantee the uniqueness of fields. In the Yorkie, there is a requirement to force unique constraints on several field combinations (e.g. owner and name in the project collection).
I personally think locking with distributed keys is the lowest priority option that we have considering our current sharded cluster mode, since we are avoiding communications between server instances in sharded cluster mode.
Proxy collection seems more attractive to me for now, since it is MongoDB's official suggestion and we already have some DB actions that require atomic execution.
But benchmarking will clearly show what is most suitable for our situation.
The second thing is about the shard key. I choose the following keys for good performance under the current query patterns. Note that the
server_seq
ranged-sharding is used for changes, because range query is frequently used for the changes.
I think we have to constantly benchmark and tune collection shard keys based on your shard key selection.
} | ||
|
||
info.ID = types.ID(result.UpsertedID.(primitive.ObjectID).Hex()) | ||
println("infoID!!", info.ID, result.UpsertedID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is for testing purpose.
Consider removing this code later.
] | ||
} | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add newline here.
return err | ||
} | ||
|
||
// NOTE: If the project is already being created by another, it is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this comment mean?
Could you give me an explanation for this?
@@ -1466,6 +1641,24 @@ func (c *Client) collection( | |||
Collection(name, opts...) | |||
} | |||
|
|||
func (c *Client) deleteProjectProxyInfo( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we still using this function?
It seems like you have introduced lock instead of proxy collection.
7892142
to
fdc2e1c
Compare
What this PR does / why we need it:
This PR introduces sharding rules to MongoDB collections to distribute loads on the database cluster.
It takes reference from #472.
Which issue(s) this PR fixes:
Addresses #673
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation:
Checklist: