feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

Miuler · 2023-01-27T03:26:43Z

Add suppor for read data from Azure Cosmos DB with Core (SQL) API

Miuler · 2023-01-28T01:24:24Z

Please @RustedBones / @clairemcginty, what's next step?

eddumelendez · 2023-01-31T21:26:29Z

Hi 👋🏽, Just a FYI regarding azure cosmosdb image. It doesn't run with ubuntu-latest in GHA. So far, it only works with 18.04. See Azure/azure-cosmos-db-emulator-docker#45 and Azure/azure-cosmos-db-emulator-docker#56

RustedBones

Thanks for the PR. Here are the main points to address:

Let's no depend on scribe for logging, even for tests.
The BoundedReader should be implemented respecting beam's API.

RustedBones · 2023-02-01T08:12:06Z

build.sbt

+val bsonVersion = "4.8.1"
+val cosmosVersion = "4.37.1"
+val cosmosContainerVersion = "1.17.5"
+val scribeVersion = "3.10.7"


The project is not using scribe. The new module should also stick with the logging conventions with slf4j.
also, try to respect alphabetical orderings in the imports.

The production code is implement with Slf4j (import org.slf4j.LoggerFactory)

But the dependency is implicit, I'm add the dependency explicit.

And respect with scribe, I'm using only in test, because is gives me explicit number line without overload.

Yes. It's more a coherence issue. As maintainer, it is preferable to have all module look as similar as possible.
It's also the same with the beam implementations. We've taken the habit to write them in java so upstream contribution to beam can be easy.
Since you've also opened a PR in beam, would you like some support on that ?

Is there no possibility to support scribe only in tests? It's very scala style, plus it looks beautiful :)

build.sbt

scio-cosmosdb/src/it/scala/com/spotify/scio/cosmosdb/CosmosDbIOIT.scala

scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/CosmosDbIO.scala

scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/syntax/ScioContextSyntax.scala

scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbBoundedReader.scala

scio-cosmosdb/src/main/scala/com/spotify/scio/cosmosdb/read/CosmosDbRead.scala

RustedBones · 2023-02-01T09:14:44Z

To fix the checks in CI, you can run sbt headerCreateAll

Miuler · 2023-02-01T17:13:29Z

To fix the checks in CI, you can run sbt headerCreateAll

Ready!

…ith Core (SQL) API Refs: spotify#4675

…ContextOps#readCosmosDbCoreApi and add slf4j-api dependency

@experimental

…dempotent, add @experimental annotations and simplifying the creation of the CosmosDbBoundedSource

RustedBones · 2023-02-06T13:58:58Z

I've checked a bit Azure cosmos DB documentation and it is very similar to Cassandra or BigTable.
The connector should be written following the same concepts of the CassadraIO/BigtableIO from beam and support splitting reads by partition key ranges (taking care of size and locality).
Right now, the proposed implementation can't be parallelized to run on multiple nodes. People using such connector will hit poor performance when consuming large amount of data

Miuler · 2023-02-07T02:16:18Z

I've checked a bit Azure cosmos DB documentation and it is very similar to Cassandra or BigTable. The connector should be written following the same concepts of the CassadraIO/BigtableIO from beam and support splitting reads by partition key ranges (taking care of size and locality). Right now, the proposed implementation can't be parallelized to run on multiple nodes. People using such connector will hit poor performance when consuming large amount of data

@RustedBones, This is why it is an experimental one, I already managed to migrate 72k rows to json (1.2Gb) in an azure storage in 20min, this is much better than other alternatives. That is why I have implemented the basic methods.

...I still don't know what would be the equivalent of split for cosmos:

But it is still functional, I am already using it, as soon as I can, I could give the metrics running in PRD.

And of course it can be improved, looking for how I could do that partitioning, but first I have this first version.

PD. The same basic idea is the one I have for the implementation of azure table storage

Miuler · 2023-02-08T03:05:31Z

The current implementation moves 68729 documents from a cosmosdb to an azure blob storage formatted to json (986M), in 10 minutes accounting for all the time it takes to start the machine in a kubernetes.

Miuler · 2023-02-22T05:15:28Z

@RustedBones, any news on this pull request? it's been 2 weeks

RustedBones · 2023-07-11T10:18:13Z

Will base the implementation on beam as explained in #4675

Miuler mentioned this pull request Jan 27, 2023

feat(scio-cosmosdb): Add support for cosmosdb with Core (SQL) API #4675

Open

Miuler force-pushed the feature/4675_azure_cosmosdb branch from 2cd4fe2 to d7e1c45 Compare January 27, 2023 20:24

Miuler marked this pull request as ready for review January 27, 2023 20:25

Miuler force-pushed the feature/4675_azure_cosmosdb branch 3 times, most recently from f582184 to aa12d8c Compare January 27, 2023 23:02

Miuler force-pushed the feature/4675_azure_cosmosdb branch 2 times, most recently from 8b98860 to 5396b84 Compare January 30, 2023 22:13

RustedBones requested changes Feb 1, 2023

View reviewed changes

Miuler force-pushed the feature/4675_azure_cosmosdb branch from d7c035c to 5a85e78 Compare February 1, 2023 21:24

Miuler requested a review from RustedBones February 1, 2023 21:32

Miuler force-pushed the feature/4675_azure_cosmosdb branch 4 times, most recently from 0fe0b7d to dabc978 Compare February 2, 2023 15:42

Miuler added 3 commits February 2, 2023 10:43

feat(scio-cosmosdb): feat(azure-cosmosdb): Add support for cosmosdb w…

20c0208

…ith Core (SQL) API Refs: spotify#4675

docs(scio-cosmosdb): Add Header license, add scaladoc in CosmosDbScio…

e07ab10

…ContextOps#readCosmosDbCoreApi and add slf4j-api dependency

test(scio-cosmosdb): Refactor for add the testcontainers-scala-scalatest

a79c089

Miuler changed the title ~~feat(scio-cosmosdb): feat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API~~ fat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API Feb 2, 2023

Miuler force-pushed the feature/4675_azure_cosmosdb branch 2 times, most recently from 8a58fc7 to 091e441 Compare February 2, 2023 19:23

fix(scio-cosmosdb): Fix the CosmosDbBoundedReader#getCurrent is now i…

091e441

…dempotent, add @experimental annotations and simplifying the creation of the CosmosDbBoundedSource

RustedBones changed the title ~~fat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API~~ feat azure-cosmosdb: Add support for azure cosmos DB Feb 6, 2023

RustedBones changed the title ~~feat azure-cosmosdb: Add support for azure cosmos DB~~ feat azure-cosmosdb: Add support for azure cosmos DB NoSQL Feb 6, 2023

RustedBones closed this Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

Miuler commented Jan 27, 2023 •

edited

Loading

Miuler commented Jan 28, 2023 •

edited

Loading

eddumelendez commented Jan 31, 2023

RustedBones left a comment •

edited

Loading

RustedBones Feb 1, 2023

Miuler Feb 1, 2023

RustedBones Feb 1, 2023

Miuler Feb 1, 2023 •

edited

Loading

RustedBones commented Feb 1, 2023

Miuler commented Feb 1, 2023

RustedBones commented Feb 6, 2023

Miuler commented Feb 7, 2023 •

edited

Loading

Miuler commented Feb 8, 2023 •

edited

Loading

Miuler commented Feb 22, 2023

RustedBones commented Jul 11, 2023

feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

Conversation

Miuler commented Jan 27, 2023 • edited Loading

Miuler commented Jan 28, 2023 • edited Loading

eddumelendez commented Jan 31, 2023

RustedBones left a comment • edited Loading

Choose a reason for hiding this comment

RustedBones Feb 1, 2023

Choose a reason for hiding this comment

Miuler Feb 1, 2023

Choose a reason for hiding this comment

RustedBones Feb 1, 2023

Choose a reason for hiding this comment

Miuler Feb 1, 2023 • edited Loading

Choose a reason for hiding this comment

RustedBones commented Feb 1, 2023

Miuler commented Feb 1, 2023

RustedBones commented Feb 6, 2023

Miuler commented Feb 7, 2023 • edited Loading

Miuler commented Feb 8, 2023 • edited Loading

Miuler commented Feb 22, 2023

RustedBones commented Jul 11, 2023

Miuler commented Jan 27, 2023 •

edited

Loading

Miuler commented Jan 28, 2023 •

edited

Loading

RustedBones left a comment •

edited

Loading

Miuler Feb 1, 2023 •

edited

Loading

Miuler commented Feb 7, 2023 •

edited

Loading

Miuler commented Feb 8, 2023 •

edited

Loading