Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) #4378

Merged
merged 39 commits into from
Jul 7, 2021

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented Jun 14, 2021

Fixes #3881.

Resolves this comment from #4351 (comment).

however this is failing for test_classifier_pred_contrib[multiclass-classification-scipy_csr_matrix]

Changes in this PR

  • changes lgb.dask.DaskLGBMClassifier.predict() to return a Dask Bag a list of Dask Arrays when input data is composed of sparse matrices and pred_contrib=True
  • changes data creation function in the unit tests for lightgbm.dask so that it now correctly sets local data to a scipy CSR matrix when asked to do so (previously, it would return a dense numpy array)

How this improves lightgbm.dask

Background

#3000 added support for getting feature contributions (using .predict(pred_contrib=True)) for input data stored in sparse matrices. The design decision was made in that PR that for one specific case (multiclass classificaiont + sparse input data + pred_contrib=True), LightGBM should return a list of sparse matrices (one per class) instead of a single concatenated-together new matrix. Maybe this is because of #3000 (comment) not sure.

I think concatenating sparse matrices together into one from multiple is a bit more expensive than for dense data - eg you have to re-calculate the indptr

Notes for Reviewers

When given a Dask Array, lightgbm.dask._predict() can use Dask's built-in .map_blocks() to say "call predict(**kwargs) on each chunk of this array, then return me a new Dask Array of all those results". However, operations where you want to call a function that does not return an array on each chunk of a Dask Array and then combine those results are not well supported. See the discussion in dask/dask#7589 for an explanation of why this is a very difficult problem.

The best I could come up with was this approach, which produces a Dask Bag where, if you call .compute(), you'll get a result that is identical to the equivalent non-Dask code with lgb.sklearn.LGBMClassifier.predict().

I am ok with this change going into the 3.3.0 release (#4310) even though it is a breaking change, since other things are reliant on it, since it is such a narrow part of the API (only applies if using Dask and using sparse Dask Arrays and using multiclass classification and looking for feature contributions), and since it makes the Dask interface more consistent with the scikit-learn interface. The Dask community has, generally speaking, been comfortable with breaking changes if they're made in pursuit of better consistency with scikit-learn / numpy/ scikit-learn. But will defer to your opinion on that @StrikerRUS .

@jameslamb
Copy link
Collaborator Author

@jmoralez whenever you have time, could you please take a look at this and let me know what you think?

Here's some simpler code you could use to test that might be a little easier to experiment with interactively than the unit tests.

import dask.array as da
import numpy as np
import lightgbm as lgb
from dask import delayed
from dask.distributed import Client, LocalCluster
from lightgbm.dask import DaskLGBMClassifier
from lightgbm.sklearn import LGBMClassifier
from scipy.sparse import csc_matrix, csr_matrix
from sklearn.datasets import make_blobs

n_workers = 3
cluster = LocalCluster(n_workers=n_workers)
client = Client(cluster)
client.wait_for_workers(n_workers)

print(f"View the dashboard: {cluster.dashboard_link}")

chunk_size = 50
X, y = make_blobs(n_samples=100, centers=3, random_state=42)
rnd = np.random.RandomState(42)
dX = da.from_array(X, chunks=(chunk_size, X.shape[1])).map_blocks(csc_matrix)
dy = da.from_array(y, chunks=chunk_size)

dask_clf = DaskLGBMClassifier(n_estimators=5, num_leaves=2, tree_learner="data")
dask_clf.fit(dX, dy)

preds = dask_clf.predict(dX, pred_contrib=True)
preds_computed = preds.compute()

@jameslamb jameslamb changed the title [dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators [dask] Make output of feature contribution predictions for sparse matrices match those from sklearn estimators (fixes #3881) Jun 14, 2021
@jmoralez
Copy link
Collaborator

Nice idea using bags! Although I'm not 100% convinced on the type of preds (dask.bag.core.Item) and I think maybe building the bag from delayeds instead of sequence might be a better idea, since I think we're nesting delayeds here (that .compute() in the function looks suspicious).

I believe we can maybe get preds to be List[dask.array.Array] where each array has the same chunks as dX and such that client.compute(preds) returns List[Future] where each future's result type is scipy.sparse.csr_matrix.

I think it's really up to deciding which consistency to have, with the result type of predict or with the result type of preds.compute()

@jameslamb
Copy link
Collaborator Author

I believe we can maybe get preds to be List[dask.array.Array] where each array has the same chunks as dX

hmmm, I couldn't think of a way to do this that didn't involve having num_classes_ duplicate calls of predict(). Since each call of predict() on one chunk of dX returns a list with num_classes_ elements.

If you have an idea about the chain of calls that could produce such an output (even if it's just pseudocode), that would help me.

I'll experiment a bit more.

@jmoralez
Copy link
Collaborator

I was experimenting with this in the if statement for the pred_contrib, definitely not pretty (I imported dask, dask.array and dask.bag without using compat) and client.compute(preds) returns futures of type COO instead of CSR, but maybe it can help you.

delayed_chunks = data.to_delayed()
bag = db.from_delayed(delayed_chunks[:, 0])

predict_function = partial(
 _predict_part,
 model=model,
 raw_score=False,
 pred_proba=False,
 pred_leaf=False,
 pred_contrib=True,
 **kwargs
)

@dask.delayed
def extract(l, i):
    return l[i]

preds = bag.map_partitions(predict_function)
chunks = data.chunks[0]
out = [[] for _ in range(model.n_classes_)]
for j, partition in enumerate(preds.to_delayed()):
    for i in range(model.n_classes_):
        part = da.from_delayed(
            extract(partition, i),
            shape=(chunks[j], model.n_classes_),
            meta=data._meta
        )
        out[i].append(part)
for i in range(model.n_classes_):
     out[i] = da.concatenate(out[i])
return out

@jameslamb
Copy link
Collaborator Author

oooo nice, thank you! That's very helpful. I'm going to keep experimenting.

@jameslamb
Copy link
Collaborator Author

I'll move this back to a draft for now

@jameslamb jameslamb marked this pull request as draft June 15, 2021 03:27
@jameslamb
Copy link
Collaborator Author

All CI jobs except one passed. QEMU_multiarch bdist failed with this error:

E ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10497&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

Looks like that is coming from

assert len(np.unique(preds_with_contrib[:, num_features]) == 1)

I think only that QEMU job was failing because different versions of numpy will treat that case slightly differently. Anyway, it's a mistake in the test (a misplaced parenthesis).

I fixed that issue in 95ae45f.

@jameslamb
Copy link
Collaborator Author

ah! This test just failed again!

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10497&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9

It seems the problem is just about about the use of len() on a numpy array, but how np.unique() works in the newest version of numpy.

I'm able to reproduce the problem on my Mac by upgrading to the newest numpy (v1.2.1) and scipy (1.7.0)

pip install --upgrade numpy scipy
cd python-package
python setup.py install
cd ../tests/python_package_test
# manually comment out "if not platform.startswith('linux')"
pytest test_dask.py::test_classifier_pred_contrib

I'll investigate this later tonight.

@jameslamb
Copy link
Collaborator Author

@StrikerRUS I was able to reproduce this tonight! I believe there was a breaking change in numpy 1.21.0 (released June 22, 2021), which causes the error I saw in #4378 (comment).

I've opened numpy/numpy#19405 to report it.

That issue is specific to using np.unique() on a np.matrix object. I learned that calling .todense() on a sparse matrix returns a np.matrix, not a np.array.

We didn't hit this issue before in LightGBM's tests because of the uses of np.array() to wrap results of .todense() (removed in #4378 (comment)).

Given that, I think we should just restore those uses of np.array(). They don't introduce that much overhead and they allow these tests to not be blocked by this possible regression in numpy.

I restored them in 831927b.

@rkern
Copy link

rkern commented Jul 5, 2021

As mentioned on the numpy issue, this is not so much of a regression. np.unique() never worked correctly on np.matrix objects. The fact that it raises an exception in one version and doesn't in an older version is not a regression (arguably, an enhancement, though both are basically accidents of never thinking about np.matrix in the implementation). Coercing to true ndarrays was always the right idea.

@rkern
Copy link

rkern commented Jul 5, 2021

Consider using np.asarray(), which will not copy the underlying memory.

@jameslamb
Copy link
Collaborator Author

Coercing to true ndarrays was always the right idea.
Consider using np.asarray(), which will not copy the underlying memory.

Thanks very much for taking the time to come visit this PR and share some advice with us, @rkern! We'll definitely do that here.

@rkern
Copy link

rkern commented Jul 5, 2021

There's also toarray() on the scipy.sparse matrix objects, which bypasses np.matrix entirely. That would be the most efficient and idiomatic way to write that.

@StrikerRUS StrikerRUS self-requested a review July 5, 2021 11:12
@jameslamb
Copy link
Collaborator Author

There's also toarray() on the scipy.sparse matrix objects, which bypasses np.matrix entirely. That would be the most efficient and idiomatic way to write that.

ah, even better! I'll make that change here. Thanks again for your help.

@jameslamb
Copy link
Collaborator Author

ok switching from .todense() to .toarray() worked and all checks are passing 🙌

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@StrikerRUS StrikerRUS merged commit b09da43 into master Jul 7, 2021
@StrikerRUS StrikerRUS deleted the fix/dask-multiclass-sparse-predict branch July 7, 2021 11:27
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
4 participants