Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Training] Cleaner inheritance #1

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

Flo-Wo
Copy link

@Flo-Wo Flo-Wo commented Dec 14, 2021

Hello Rajarshi,
first of all thank you very much for this super useful implementation and this really elegant solution!
While working with your implementation, I stumbled across two problems: your solution did not support parallelism like the original CrossValidator (while working on a cluster this was problematic) and some other input variables were missing (I wrote setter and getter methods for your two additional variables and designed the constructor in a more java/pyspark style). I tried to extend your class and combined it with the original code of the pyspark library.

I would be really happy if we could discuss the changes :)

Cheers,

Florian

        - added parallelism possibility
        - added missing variables
        - write the constructor in a pyspark/java way
@RajarshiBhadra
Copy link
Owner

Hi,

Thank you for enhancing the module. Parallelism was definitely on my mind but could never focus enough to get this done. I have gone through the codes you have added and they make sense. But before I merge the PR will it be possible for you to post some spark metrics that indicate we are getting parallelism definitively. This is because more than often I have seen the promise of parallelism being elusive when I look at actual metrics/DAGs in many of my spark implementations. Also this would be good reference for future tests. We will need one run with the original code base and one run with your modifications and corresponding performance parameters. Let me know if you can run the tests. If not I will run them myself after the holidays and merge your PR

Thanks

@Flo-Wo
Copy link
Author

Flo-Wo commented Dec 22, 2021

Hello,

thanks for your response. I added a benchmark-test for both classes (initial fit, then with timeit two tests repeated three times).

My PR:

  • parallelism=1: initial: 42.08579206466675, timeit: [35.692217707999994, 29.944519624999998, 22.568345499999992]
  • parallelism=2: initial: 41.17821216583252, timeit: [34.591567084000005, 26.926313542000003, 22.74932529099999]
  • parallelism=5: initial: 41.06698298454285 timeit: [31.794554874999996, 23.588971333999993, 17.96917808299999]
  • parallelism=10: inital: 43.786065101623535 timeit: [33.374571125, 27.449010374999986, 22.526496959]

Current version in the repo:

  • initial: 17.43117618560791, timeit: [28.174317915999993, 23.965052333000003, 23.813772040999993]

All of them were executed on my local machine running in battery mode and with an Apple-Chip, so I would encourage you to also perform these tests :). Currently my PR does not seem to improve anything, at least not on my machine.

What really surprises me ist the fact that the first iteration always takes longer, I wonder if any information after the .train() call is stored in the DataFrame?

Do you know how to track spark DAGs, so we can look into the internal mechanics?

@RajarshiBhadra
Copy link
Owner

The best way to try out performance tests and see DAGs and other metrics is to use the
Databricks community edition. Here you can run on dedicated clusters on workers and check from Spark UI how the DAGs are executed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants