Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source MongoDB: Failed to fetch Schema #14246

Open
shubham307307 opened this issue Jun 29, 2022 · 19 comments
Open

Source MongoDB: Failed to fetch Schema #14246

shubham307307 opened this issue Jun 29, 2022 · 19 comments
Assignees
Labels
autoteam community connectors/source/mongodb frozen Not being actively worked on team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working

Comments

@shubham307307
Copy link

shubham307307 commented Jun 29, 2022

I created source (MongoDB) and destination (BigQuery), checked connection - on this step all ok. After, i setup new connection, and started fetch data schema from mongo, this operation lasted about 30 min, and failed. Can you pleae tell is this due to large number of collection or data format issue on mongo side

error212

@harshithmullapudi
Copy link
Contributor

Could be possible that this is because of the large number of collections? Could you help us more information on

  1. How many collections are there?
  2. Also if you can open network tab and check the error and share the logs here?

@JCWahoo
Copy link

JCWahoo commented Jul 6, 2022

This has been happening to me for quite some time, roughly 40 collections. Old Ruby connector doesnt have this issue

@JCWahoo
Copy link

JCWahoo commented Jul 6, 2022

Actually I can no longer get schema on either version. Both are still funning loads jut fine, schema discovery fails

@harshithmullapudi harshithmullapudi added connectors/source/mongodb team/connectors-java type/bug Something isn't working and removed team/tse Technical Support Engineers labels Jul 7, 2022
@alexandr-shegeda alexandr-shegeda changed the title Failed to fetch Schema Source MongoDB: Failed to fetch Schema Sep 21, 2022
@VitaliiMaltsev VitaliiMaltsev self-assigned this Sep 22, 2022
@grishick grishick added the team/db-dw-sources Backlog for Database and Data Warehouse Sources team label Sep 27, 2022
@VitaliiMaltsev
Copy link
Contributor

@JCWahoo I can't reproduce this issue. I created a dataset of 40 collections and 100,000 documents in each, and the schema discovery is successful
Do you have any other tips on how to reproduce this?

@JCWahoo
Copy link

JCWahoo commented Sep 30, 2022

@VitaliiMaltsev We have more than 40 collections, some with millions of documents. No log messages are returned when using the v2 connector so I've got nothing to go on. The old Ruby connector returns an error around using mapReduce on a view when validating schema... Not sure if that is helpful or not.

When I watch Mongo during schema discovery in v2, it appears the connector is retrieving more than 10k documents for schema evaluation

@VitaliiMaltsev
Copy link
Contributor

@JCWahoo just tested a dataset of 100 collections and 1.5 miliions documents in each, and the schema discovery is still successful
I'm not sure how to reproduce the same behaviour as yours

@JCWahoo
Copy link

JCWahoo commented Oct 3, 2022

@VitaliiMaltsev I know, its frustrating not having any error in the logs. It's an Atlas cluster that I'm connecting to the replica shard in standalone mode. The collections have several layers of nesting within documents. Happy to provide any more detail I can, I've been blocked on making any updates to the Mongo connection because the schema discovery continues to fail. The old connector and new connector are still pulling data hourly without issue, I just cant change them.

@VitaliiMaltsev
Copy link
Contributor

@JCWahoo in that case could you please provide me with more information about your source so i can try to reproduce it again

  1. Total number of databases in your mongo db and their names
  2. Total number of collections in each of the databases
  3. The name of the database you are using for the sync
  4. Approximate number of documents in each collection
  5. The data structure of documents that you use in collections with examples

@JCWahoo
Copy link

JCWahoo commented Oct 3, 2022

@VitaliiMaltsev Sure thing. For simplicity/privacy we'll call the database "Production". There are 80 collections, largest one is approx ~50 million documents. I'm using production as the database name for sync and admin as the authentication source. My user has readAnyDatabase @ admin and read @ local permissions in Mongo. Largest collection has around 20 fields, some with several layers of nesting such as

  • collection
  • array1
    
  •     object1
    
  •         nestedobject1
    
  •         nestedobject2
    
  •         nestedobject3
    
  •            yetanotherlevelofnesting
    
  •     object2
    

@VitaliiMaltsev
Copy link
Contributor

Please provide an example of the document as json with all levels of nesting with the same field names as your largest collection

@JCWahoo
Copy link

JCWahoo commented Oct 3, 2022

Can I email it to you rather than here?

@VitaliiMaltsev
Copy link
Contributor

Sure, my email is [email protected]

@VitaliiMaltsev
Copy link
Contributor

@JCWahoo I tried to reproduce this problem with all your recommendations (80 collections, the biggest one with 50 million documents and the same data structure as you sent me) with no success. In my env schema discover works well
On the other hand, I found 2 potential bottlenecks that could lead to this problem.

  1. Potential issue with Mongo Atlas latency
    Try changing the region of your cluster as described here
    https://www.mongodb.com/docs/atlas/tutorial/move-cluster/
  2. For each collection, the formation of the final result that will be displayed in UI is done using a normal loop, which can be quite slow if you have many collections / documents
    I created a PR to parallelize this process, which will significantly increase the performance
    Hope this helps :)

@JCWahoo
Copy link

JCWahoo commented Oct 5, 2022 via email

@VitaliiMaltsev
Copy link
Contributor

@JCWahoo i just released mongodb-source version 0.1.19. Please check it out and try to refresh schema once more

@JCWahoo
Copy link

JCWahoo commented Oct 11, 2022 via email

@VitaliiMaltsev
Copy link
Contributor

@JCWahoo I have one more assumption that your collections contain documents that have a different structure within the same collection. This may be causing your issue.
I can advise you to try to reduce the DISCOVER_LIMIT parameter in the MongoUtils class by a multiple.
At the moment it is hardcoded and equals 10000 documents
Try to set it to 100 documents, build the source-mongodb-v2 connector with the dev version and maybe that will solve your problem

@JCWahoo
Copy link

JCWahoo commented Oct 11, 2022

Thanks - Any docs/guidance on that last bit? Not sure I've done that before, or its been so long I dont recall

@VitaliiMaltsev
Copy link
Contributor

i believe our teem need to implement this task so that you can choose this option yourself on UI

@bleonard bleonard added the frozen Not being actively worked on label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoteam community connectors/source/mongodb frozen Not being actively worked on team/db-dw-sources Backlog for Database and Data Warehouse Sources team type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants