-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about data size limits #3
Comments
Hi, If it works for smaller data sets I would guess that:
Can you tell the setup you are running your program on? Number of nodes, CPU, etc. |
Hi, I also tried using the BSPartitioner, but it crashed earlier than with the GridPartitioner. And I also tried repartitioning the data before calling the dbscan run function, but still, it gets stalled :( I was thinking that there's a comment in DBScan.scala (148) that suggests to repartition the data after the groupBy. Maybe I'll try to modify that part and see if there's any improvement. Do you think it could work? |
It may help. Though, I also found a place were we collected the cluster content in a I will try to solve this |
Ok thank you very much!! |
Hi, first of all thanks for the application! I've been trying it out with different datasets and it works great with the smaller ones! But the application stalls with bigger datasets. My particular case is a dataset of 120GB with 2000 million records, and I want to run DBScan with an eps of 0.0001. I don't know if maybe I'm configuring the parameter ppd badly (with a value of 100 it stalls indefinitely, but with smaller values there seems to be a progress...even though it still hangs), or if it won't work with this large dataset and such a small eps.
Is there any chance that I'm configuring it wrong? Thanks in advance!
The text was updated successfully, but these errors were encountered: