-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Training] Cleaner inheritance #1
base: master
Are you sure you want to change the base?
Conversation
- added parallelism possibility - added missing variables - write the constructor in a pyspark/java way
Hi, Thank you for enhancing the module. Parallelism was definitely on my mind but could never focus enough to get this done. I have gone through the codes you have added and they make sense. But before I merge the PR will it be possible for you to post some spark metrics that indicate we are getting parallelism definitively. This is because more than often I have seen the promise of parallelism being elusive when I look at actual metrics/DAGs in many of my spark implementations. Also this would be good reference for future tests. We will need one run with the original code base and one run with your modifications and corresponding performance parameters. Let me know if you can run the tests. If not I will run them myself after the holidays and merge your PR Thanks |
Hello, thanks for your response. I added a benchmark-test for both classes (initial fit, then with timeit two tests repeated three times). My PR:
Current version in the repo:
All of them were executed on my local machine running in battery mode and with an Apple-Chip, so I would encourage you to also perform these tests :). Currently my PR does not seem to improve anything, at least not on my machine. What really surprises me ist the fact that the first iteration always takes longer, I wonder if any information after the Do you know how to track spark DAGs, so we can look into the internal mechanics? |
The best way to try out performance tests and see DAGs and other metrics is to use the |
Hello Rajarshi,
first of all thank you very much for this super useful implementation and this really elegant solution!
While working with your implementation, I stumbled across two problems: your solution did not support parallelism like the original CrossValidator (while working on a cluster this was problematic) and some other input variables were missing (I wrote setter and getter methods for your two additional variables and designed the constructor in a more java/pyspark style). I tried to extend your class and combined it with the original code of the pyspark library.
I would be really happy if we could discuss the changes :)
Cheers,
Florian