Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat azure-cosmosdb: Add support for azure cosmos DB NoSQL #4678

Closed
wants to merge 4 commits into from

Conversation

Miuler
Copy link

@Miuler Miuler commented Jan 27, 2023

Add suppor for read data from Azure Cosmos DB with Core (SQL) API

Fixes: #4675
Refs: apache/beam#23604

@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch from 2cd4fe2 to d7e1c45 Compare January 27, 2023 20:24
@Miuler Miuler marked this pull request as ready for review January 27, 2023 20:25
@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch 3 times, most recently from f582184 to aa12d8c Compare January 27, 2023 23:02
@Miuler
Copy link
Author

Miuler commented Jan 28, 2023

Please @RustedBones / @clairemcginty, what's next step?

@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch 2 times, most recently from 8b98860 to 5396b84 Compare January 30, 2023 22:13
@eddumelendez
Copy link

Hi 👋🏽, Just a FYI regarding azure cosmosdb image. It doesn't run with ubuntu-latest in GHA. So far, it only works with 18.04. See Azure/azure-cosmos-db-emulator-docker#45 and Azure/azure-cosmos-db-emulator-docker#56

Copy link
Contributor

@RustedBones RustedBones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here are the main points to address:

Let's no depend on scribe for logging, even for tests.
The BoundedReader should be implemented respecting beam's API.

build.sbt Outdated
val bsonVersion = "4.8.1"
val cosmosVersion = "4.37.1"
val cosmosContainerVersion = "1.17.5"
val scribeVersion = "3.10.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The project is not using scribe. The new module should also stick with the logging conventions with slf4j.
also, try to respect alphabetical orderings in the imports.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The production code is implement with Slf4j (import org.slf4j.LoggerFactory)

imagen

But the dependency is implicit, I'm add the dependency explicit.

And respect with scribe, I'm using only in test, because is gives me explicit number line without overload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It's more a coherence issue. As maintainer, it is preferable to have all module look as similar as possible.
It's also the same with the beam implementations. We've taken the habit to write them in java so upstream contribution to beam can be easy.
Since you've also opened a PR in beam, would you like some support on that ?

Copy link
Author

@Miuler Miuler Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no possibility to support scribe only in tests? It's very scala style, plus it looks beautiful :)

imagen

build.sbt Show resolved Hide resolved
@RustedBones
Copy link
Contributor

To fix the checks in CI, you can run sbt headerCreateAll

@Miuler
Copy link
Author

Miuler commented Feb 1, 2023

To fix the checks in CI, you can run sbt headerCreateAll

Ready!

@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch from d7c035c to 5a85e78 Compare February 1, 2023 21:24
@Miuler Miuler requested a review from RustedBones February 1, 2023 21:32
@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch 4 times, most recently from 0fe0b7d to dabc978 Compare February 2, 2023 15:42
@Miuler Miuler changed the title feat(scio-cosmosdb): feat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API fat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API Feb 2, 2023
@Miuler Miuler force-pushed the feature/4675_azure_cosmosdb branch 2 times, most recently from 8a58fc7 to 091e441 Compare February 2, 2023 19:23
…dempotent, add @experimental annotations and simplifying the creation of the CosmosDbBoundedSource
@RustedBones RustedBones changed the title fat(azure-cosmosdb): Add support for cosmosdb with Core (SQL) API feat azure-cosmosdb: Add support for azure cosmos DB Feb 6, 2023
@RustedBones RustedBones changed the title feat azure-cosmosdb: Add support for azure cosmos DB feat azure-cosmosdb: Add support for azure cosmos DB NoSQL Feb 6, 2023
@RustedBones
Copy link
Contributor

I've checked a bit Azure cosmos DB documentation and it is very similar to Cassandra or BigTable.
The connector should be written following the same concepts of the CassadraIO/BigtableIO from beam and support splitting reads by partition key ranges (taking care of size and locality).
Right now, the proposed implementation can't be parallelized to run on multiple nodes. People using such connector will hit poor performance when consuming large amount of data

@Miuler
Copy link
Author

Miuler commented Feb 7, 2023

I've checked a bit Azure cosmos DB documentation and it is very similar to Cassandra or BigTable. The connector should be written following the same concepts of the CassadraIO/BigtableIO from beam and support splitting reads by partition key ranges (taking care of size and locality). Right now, the proposed implementation can't be parallelized to run on multiple nodes. People using such connector will hit poor performance when consuming large amount of data

@RustedBones, This is why it is an experimental one, I already managed to migrate 72k rows to json (1.2Gb) in an azure storage in 20min, this is much better than other alternatives. That is why I have implemented the basic methods.

...I still don't know what would be the equivalent of split for cosmos:

imagen

imagen

But it is still functional, I am already using it, as soon as I can, I could give the metrics running in PRD.

And of course it can be improved, looking for how I could do that partitioning, but first I have this first version.

PD. The same basic idea is the one I have for the implementation of azure table storage

@Miuler
Copy link
Author

Miuler commented Feb 8, 2023

The current implementation moves 68729 documents from a cosmosdb to an azure blob storage formatted to json (986M), in 10 minutes accounting for all the time it takes to start the machine in a kubernetes.

@Miuler
Copy link
Author

Miuler commented Feb 22, 2023

@RustedBones, any news on this pull request? it's been 2 weeks

@RustedBones
Copy link
Contributor

Will base the implementation on beam as explained in #4675

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat(scio-cosmosdb): Add support for cosmosdb with Core (SQL) API
3 participants