Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexSelector: use dynamic options #219

Merged
merged 15 commits into from
Jun 15, 2022
Merged

IndexSelector: use dynamic options #219

merged 15 commits into from
Jun 15, 2022

Conversation

achimgaedke
Copy link
Contributor

addresses loading timeouts for large datasets as described in #205

This is rough and ready, keen to get some feedback.

@oegedijk
Copy link
Owner

Cool, really nice approach! I actually noticed that dynamic dropdown thing in the documentation the other day as well. Very nifty!

Wonder if there are any performance tradeoffs though such that for small datasets (or rather maybe memory usage of self.idxs) we would use the old_fashioned dropdown and for larger ones the dynamic one.

But if it is perfomative enough than we can just default to the dynamic one...

@oegedijk
Copy link
Owner

This seems quite a bit slower, but has the benefit of being case sensitive:

image

@oegedijk
Copy link
Owner

Probably makes sense to wrap this into an IndexDropdownComponent so that is easy to replicate across components.

@oegedijk
Copy link
Owner

so this seems to work (check the feature input component index selector on the whatif tab)...

@oegedijk
Copy link
Owner

hmm, tests seem to fail on ubuntu. May check tomorrow...

@achimgaedke
Copy link
Contributor Author

Glad you like it.

Why not use the IndexSelector everywhere?

Maybe there's something I'm missing in the code. It's likely that I actually haven't found all index selection locations.

The limit of options returned (1000) is probably best to make configurable... a bit like plot_sample.

@achimgaedke
Copy link
Contributor Author

Btw, very pleased with the load time of the application utilising 1M data points.

@oegedijk
Copy link
Owner

Good question, and answer is that I forgot I had already written IndexSelector, but am apparently good at reinventing the wheel :)

@oegedijk
Copy link
Owner

I was looking at the FeatureInputComponent which didn't use the IndexSelector, my bad. Anyway, will have a look later. Also at this weird stochastic ubuntu bug.

@oegedijk
Copy link
Owner

Okay, I now extended the IndexSelector to all existing ExplainerComponents, and also added a functionality that that when the index is set directly the options simply get set to [index]. Before if the index was set from e.g. a random index selector and the chosen index was not already in the dropdown, it would not show.

Since you have ready access to a million row index dataset, could you test performance of this self.explainer.get_index_list().str.contains(search_value, case=False)][:self.max_idxs_in_dropdown].tolist() vs the islice approach you had first?

I would like search to be case-insensitive, so if islice is much faster on big datasets, then we could add an .index_list_lower() and compare search_value.lower() to it...

@oegedijk
Copy link
Owner

bumped sklearn to 1.1 and python to 3.8 and voila tests are passing :)

@achimgaedke
Copy link
Contributor Author

achimgaedke commented May 24, 2022 via email

@achimgaedke
Copy link
Contributor Author

I have reviewed the changes.

I propose some changes to the IndexSelector class from a software-engineering perspective, mainly from a consistency perspective and a bug in the conditions between the layout and the callback creation.

There's one difference in the use of max_idxs_in_dropdown: The original fix proposes to send dropdown options always as a callback result when requested. This version goes the middle way: allowing the dropdown values to be included in the app layout (multiple times).

Is a purely callback-based option too slow in some cases? If not, I'd suggest keeping only that one.

All in all, I'm worried about startup performance (and hitting a timeout with JSON becoming too slow to load) rather than runtime performance. Do we want two different thresholds?

I am not too concerned with the CPU performance difference of iclice vs pandas.DataFrame.contains for now. One would also need to consider memory consumption.

@oegedijk
Copy link
Owner

I think some subtle bug may have been introduced with the cleanup, but will have a closer look tomorrow..

thanks again for the work!

@oegedijk oegedijk merged commit c247f47 into oegedijk:master Jun 15, 2022
@oegedijk
Copy link
Owner

Alright, looks good to me, let's merge it!

Thanks again for this awesome initiave!

Will give you a shout out on linkedin once I'll release this version...

@oegedijk
Copy link
Owner

just released it as version 0.4.0

@achimgaedke achimgaedke deleted the dynamic-index-selector branch November 24, 2022 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants