Fixed `AnchorTabular` length discrepancy between `feature` and `names` field. #902

RobertSamoilescu · 2023-04-13T18:32:11Z

This PR fixes the AnchorTabular length discrepancy between the feature and names filed returned in the explanation object. To describe what caused the issue, let us consider the following example.

Consider that the dataset has a numerical feature f. Because Anchors can only handle discrete data, a discretization step is required for numerical features. In our examples, we discretize the numerical values based on the 25, 50, 75% quantiles. Let t25, t50, t75 be the associated quantile values. This results in a discretization of the numerical feature f in 4 bins: [-inf, t25], [t25, t50], [t50, t75], and [t75, +inf], encoded by 0, 1, 2, and 3, respectively.

Let us consider that we want to explain an instance X, and let us denote X[f] the feature value of f for the instance X. Assume that X[f] falls in bin number 2, thus being encoded by the value 2.

For numerical features, the AnchorTabular algorithm creates multiple predicates associated with the same feature f. Those predicates correspond to intervals from which numerical samples can be drawn for the perturbation step in the algorithm. The code for this can be seen here. In our case the following predicates will be created:

P1 = [1, 2, 3],
P2 = [2, 3],
P3 = [0, 1, 2]

Note that each predicate Pi corresponds to an interval to from which we can sample values for the feature f. For example P1 will be associated with the interval [t25, +inf], P2 with [t50, +inf], and P3 with [-inf, t75].

It is possible that the final anchor can contain multiple predicates form the three Pi's we listed above. Let us assume that it ends up containing P1 and P2. With this assumption let us move to the construction of the human interpretable representaion of the anchor implemented here.

Let's say that the the anchor is composed of three predicates encoded by [1, 2, 3], where 1 is associtated to a feature g different than f, and 2, 3 correspond to predicates P1, P2 associtated to feature f.

Following the code line be line we have:

anchor_idxs = explanation['feature']     # anchor_idx= [1, 2, 3]

explanation['names'] = []

explanation['feature'] = [self.enc2feat_idx[idx] for idx in anchor_idxs]  # explanation['features'] = [g, f, f]

ordinal_ranges = {self.enc2feat_idx[idx]: [float('-inf'), float('inf')] for idx in anchor_idxs}  # ordinal_ranges = {g: [-inf, +inf], f: [-inf, +inf]}

We already see at this point that the length of the explanation['feature'] differs from the length of the keys in ordinal_ranges, because explanation['feature'] contains a duplicate of f.

The following block of code perform a correct intersection and refinement of the intervals for each feature in the anchor:

for idx in set(anchor_idxs) - self.cat_lookup.keys():
    feat_id = self.enc2feat_idx[idx]  # feature col. id
    if 0 in self.ord_lookup[idx]:  # tells if the feature in X falls in a higher or lower bin
        ordinal_ranges[feat_id][1] = min(
            ordinal_ranges[feat_id][1], max(list(self.ord_lookup[idx]))
        )
    else:
        ordinal_ranges[feat_id][0] = max(
            ordinal_ranges[feat_id][0], min(list(self.ord_lookup[idx])) - 1
        )

Finally, the human interpretable representation of the anchor for numerical features is constructed here based on the dictionary ordinal_ranges.

Note that the explanation['names'] filed avoids the duplication of the same feature, hence the difference in length with the explanation['feature'].

The way to fix this issue is to set the explanation[names] to the keys list in ordinal_ranges.

jklaise · 2023-04-17T12:16:54Z

Nice! Thanks also for the thorough explanation.

Fixed AnchorTabular length discrepancy between feature and names field.

c21dd84

RobertSamoilescu requested a review from jklaise April 13, 2023 18:32

jklaise merged commit 89eb7d0 into SeldonIO:master Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed `AnchorTabular` length discrepancy between `feature` and `names` field. #902

Fixed `AnchorTabular` length discrepancy between `feature` and `names` field. #902

RobertSamoilescu commented Apr 13, 2023

jklaise commented Apr 17, 2023

Fixed AnchorTabular length discrepancy between feature and names field. #902

Fixed AnchorTabular length discrepancy between feature and names field. #902

Conversation

RobertSamoilescu commented Apr 13, 2023

jklaise commented Apr 17, 2023

Fixed `AnchorTabular` length discrepancy between `feature` and `names` field. #902

Fixed `AnchorTabular` length discrepancy between `feature` and `names` field. #902