[Training] Cleaner inheritance #1

Flo-Wo · 2021-12-14T14:31:53Z

Hello Rajarshi,
first of all thank you very much for this super useful implementation and this really elegant solution!
While working with your implementation, I stumbled across two problems: your solution did not support parallelism like the original CrossValidator (while working on a cluster this was problematic) and some other input variables were missing (I wrote setter and getter methods for your two additional variables and designed the constructor in a more java/pyspark style). I tried to extend your class and combined it with the original code of the pyspark library.

I would be really happy if we could discuss the changes :)

Cheers,
 Florian

- added parallelism possibility - added missing variables - write the constructor in a pyspark/java way

RajarshiBhadra · 2021-12-21T04:35:32Z

Hi,

Thank you for enhancing the module. Parallelism was definitely on my mind but could never focus enough to get this done. I have gone through the codes you have added and they make sense. But before I merge the PR will it be possible for you to post some spark metrics that indicate we are getting parallelism definitively. This is because more than often I have seen the promise of parallelism being elusive when I look at actual metrics/DAGs in many of my spark implementations. Also this would be good reference for future tests. We will need one run with the original code base and one run with your modifications and corresponding performance parameters. Let me know if you can run the tests. If not I will run them myself after the holidays and merge your PR

Thanks

method

test

Flo-Wo · 2021-12-22T15:36:25Z

Hello,

thanks for your response. I added a benchmark-test for both classes (initial fit, then with timeit two tests repeated three times).

My PR:

parallelism=1: initial: 42.08579206466675, timeit: [35.692217707999994, 29.944519624999998, 22.568345499999992]
parallelism=2: initial: 41.17821216583252, timeit: [34.591567084000005, 26.926313542000003, 22.74932529099999]
parallelism=5: initial: 41.06698298454285 timeit: [31.794554874999996, 23.588971333999993, 17.96917808299999]
parallelism=10: inital: 43.786065101623535 timeit: [33.374571125, 27.449010374999986, 22.526496959]

Current version in the repo:

initial: 17.43117618560791, timeit: [28.174317915999993, 23.965052333000003, 23.813772040999993]

All of them were executed on my local machine running in battery mode and with an Apple-Chip, so I would encourage you to also perform these tests :). Currently my PR does not seem to improve anything, at least not on my machine.

What really surprises me ist the fact that the first iteration always takes longer, I wonder if any information after the .train() call is stored in the DataFrame?

Do you know how to track spark DAGs, so we can look into the internal mechanics?

RajarshiBhadra · 2021-12-22T16:25:35Z

The best way to try out performance tests and see DAGs and other metrics is to use the
Databricks community edition. Here you can run on dedicated clusters on workers and check from Spark UI how the DAGs are executed

Flo-Wo added 2 commits December 14, 2021 15:28

cleaner inheritence:

a6889be

- added parallelism possibility - added missing variables - write the constructor in a pyspark/java way

removed inline type declarations

87f1d52

Flo-Wo added 15 commits December 21, 2021 08:01

removed checkpoints and pycache

c942fc3

added a gitignore file

fc7d772

changed to newest version of spark

bcf83d5

changed methods name with underscore, to overwrite the inheritance

eb39003

method

added the test, currently not really a benchmark test

b45cd29

small fixes inside the test

6fbc5b7

perfomance plot

29af64d

added indexer into the pipeline

c663c88

added a perfomance image, as well as the old version and a benchmark

1dec526

test

removed some unnecessary lines

52c8b6e

removed a strange line

a3ba21b

separat benchmark test for the previous version

b34daa0

benchmark test for the previous version

cf1bef3

added timeit method

7facaeb

newest version of the test

5e72e4e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Training] Cleaner inheritance #1

[Training] Cleaner inheritance #1

Flo-Wo commented Dec 14, 2021

RajarshiBhadra commented Dec 21, 2021

Flo-Wo commented Dec 22, 2021

RajarshiBhadra commented Dec 22, 2021

[Training] Cleaner inheritance #1

Are you sure you want to change the base?

[Training] Cleaner inheritance #1

Conversation

Flo-Wo commented Dec 14, 2021

RajarshiBhadra commented Dec 21, 2021

Flo-Wo commented Dec 22, 2021

RajarshiBhadra commented Dec 22, 2021