-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678
Conversation
2cd4fe2
to
d7e1c45
Compare
f582184
to
aa12d8c
Compare
Please @RustedBones / @clairemcginty, what's next step? |
8b98860
to
5396b84
Compare
Hi 👋🏽, Just a FYI regarding azure cosmosdb image. It doesn't run with ubuntu-latest in GHA. So far, it only works with 18.04. See Azure/azure-cosmos-db-emulator-docker#45 and Azure/azure-cosmos-db-emulator-docker#56 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here are the main points to address:
Let's no depend on scribe
for logging, even for tests.
The BoundedReader
should be implemented respecting beam's API.
build.sbt
Outdated
val bsonVersion = "4.8.1" | ||
val cosmosVersion = "4.37.1" | ||
val cosmosContainerVersion = "1.17.5" | ||
val scribeVersion = "3.10.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The project is not using scribe
. The new module should also stick with the logging conventions with slf4j
.
also, try to respect alphabetical orderings in the imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It's more a coherence issue. As maintainer, it is preferable to have all module look as similar as possible.
It's also the same with the beam implementations. We've taken the habit to write them in java so upstream contribution to beam can be easy.
Since you've also opened a PR in beam, would you like some support on that ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scio-cosmosdb/src/it/scala/com/spotify/scio/cosmosdb/CosmosDbIOIT.scala
Outdated
Show resolved
Hide resolved
scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/syntax/ScioContextSyntax.scala
Show resolved
Hide resolved
scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbBoundedReader.scala
Outdated
Show resolved
Hide resolved
scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbBoundedReader.scala
Outdated
Show resolved
Hide resolved
scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbBoundedReader.scala
Outdated
Show resolved
Hide resolved
scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbRead.scala
Outdated
Show resolved
Hide resolved
To fix the |
Ready! |
d7c035c
to
5a85e78
Compare
0fe0b7d
to
dabc978
Compare
…ith Core (SQL) API Refs: spotify#4675
…ContextOps#readCosmosDbCoreApi and add slf4j-api dependency
8a58fc7
to
091e441
Compare
…dempotent, add @experimental annotations and simplifying the creation of the CosmosDbBoundedSource
I've checked a bit Azure cosmos DB documentation and it is very similar to Cassandra or BigTable. |
@RustedBones, This is why it is an experimental one, I already managed to migrate 72k rows to json (1.2Gb) in an azure storage in 20min, this is much better than other alternatives. That is why I have implemented the basic methods. ...I still don't know what would be the equivalent of split for cosmos: But it is still functional, I am already using it, as soon as I can, I could give the metrics running in PRD. And of course it can be improved, looking for how I could do that partitioning, but first I have this first version. PD. The same basic idea is the one I have for the implementation of azure table storage |
The current implementation moves 68729 documents from a cosmosdb to an azure blob storage formatted to json (986M), in 10 minutes accounting for all the time it takes to start the machine in a kubernetes. |
@RustedBones, any news on this pull request? it's been 2 weeks |
Will base the implementation on beam as explained in #4675 |
Add suppor for read data from Azure Cosmos DB with Core (SQL) API
Fixes: #4675
Refs: apache/beam#23604