update docs to v0.6.9

interpretml · Jan 6, 2025 · 4ef6815 · 4ef6815
1 parent d692bba
commit 4ef6815
Show file tree

Hide file tree

Showing 37 changed files with 625 additions and 427 deletions.
diff --git a/_images/2393c2aa4e4d6a5f0210bc0299c65758021aedfb7adc755e10ace23b741964af.png b/_images/2393c2aa4e4d6a5f0210bc0299c65758021aedfb7adc755e10ace23b741964af.png
diff --git a/_images/71ea4e4d3dce0c1c07d44449078685b33b7f36a74d740dd971a7c9ada51234fe.png b/_images/71ea4e4d3dce0c1c07d44449078685b33b7f36a74d740dd971a7c9ada51234fe.png
diff --git a/_sources/ebm-internals-classification.ipynb b/_sources/ebm-internals-classification.ipynb
@@ -180,7 +180,7 @@
     "\n",
     "As described in [part 1](./ebm-internals-regression.ipynb), continuous feature binning is defined with a list of cut points that partition the continuous range into regions. In this example, our dataset has 3 unique values for the continuous feature: 7.0, 8.0, and 9.0. Similarly to [part 1](./ebm-internals-regression.ipynb) the main effects in this example have 2 bin cuts that separate these into 3 regions. In this example, the bin cuts for main effects are again 7.5 and 8.5.\n",
     "\n",
-    "EBMs support the ability to reduce the binning resolution when binning a feature for interactions. In the call to \\_\\_init\\_\\_ for the ExplainableBoostingClassifier, we specified max_interaction_bins=4, which limited the EBM to creating just 4 bins when binning for interactions. Two of those bins are reserved for 'missing' and 'unknown' values, which leaves the model with 2 bins for the remaining continuous feature values. We have 3 unique values in our dataset though, so the EBM is forced to decide which of these values to group together and choose a single cut point that separate them into the 2 regions. In this example, the EBM could have chosen any cut point between 7.0 and 9.0. It chose 8.5, which puts the 7.0 and 8.0 values in the lower bin and 9.0 in the upper bin.\n",
+    "EBMs support the ability to reduce the binning resolution when binning a feature for interactions. In the call to \\_\\_init\\_\\_ for the ExplainableBoostingClassifier, we specified max_interaction_bins=4, which limited the EBM to creating just 4 bins when binning for interactions. Two of those bins are reserved for 'missing' and 'unseen' values, which leaves the model with 2 bins for the remaining continuous feature values. We have 3 unique values in our dataset though, so the EBM is forced to decide which of these values to group together and choose a single cut point that separate them into the 2 regions. In this example, the EBM could have chosen any cut point between 7.0 and 9.0. It chose 8.5, which puts the 7.0 and 8.0 values in the lower bin and 9.0 in the upper bin.\n",
     "\n",
     "The binning definitions for main effect and interactions are stored in a list for each feature in the ebm.bins_ attribute. In this example, ebm.bins_[1] contains a list of arrays: [array([7.5, 8.5]), array([8.5])]. The first array of [7.5, 8.5] at ebm.bins_[1][0] is the binning resolution for main effects. The second array of [8.5] at ebm.bins_[1][1] is the binning resolution used when binning for interactions.\n",
     "\n",
@@ -218,7 +218,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "ebm.term_scores_[1] is the lookup table for the second feature in this example. Since the second feature is a continuous feature, we use cut points for binning. The 0th bin index is again reserved for missing values, and the last bin index is again reserved for unknown values. In this example, the 0th bin score is non-zero because we included a missing value in the dataset for this feature.\n",
+    "ebm.term_scores_[1] is the lookup table for the second feature in this example. Since the second feature is a continuous feature, we use cut points for binning. The 0th bin index is again reserved for missing values, and the last bin index is again reserved for unseen values. In this example, the 0th bin score is non-zero because we included a missing value in the dataset for this feature.\n",
     "\n",
     "The ebm.bins_[1] attribute contains a list having 2 arrays of cut points. In this case we are binning a main effects feature, so we use the bins at index 0, which is ebm.bins_[1][0]."
    ]
@@ -249,7 +249,7 @@
    "source": [
     "<h2>Sample code</h2>\n",
     "\n",
-    "Finally, here's some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like regression, multiclass, unknown values, or interactions beyond pairs.\n",
+    "Finally, here's some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like regression, multiclass, unseen values, or interactions beyond pairs.\n",
     "\n",
     "If you need a drop-in complete function that can work in all EBM scenarios, see the multiclass example in [part 3](./ebm-internals-multiclass.ipynb) which handles regression and binary classification in addition to multiclass and all the other nuances."
    ]

diff --git a/_sources/ebm-internals-multiclass.ipynb b/_sources/ebm-internals-multiclass.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "This is part 3 of a 3 part series describing EBM internals and how to make predictions. For part 1, click [here](./ebm-internals-regression.ipynb). For part 2, click [here](./ebm-internals-classification.ipynb).\n",
     "\n",
-    "In this part 3 we'll cover multiclass, specified bin cuts, term exclusion, and unknown values. Before reading this part you should be familiar with the information in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb)"
+    "In this part 3 we'll cover multiclass, specified bin cuts, term exclusion, and unseen values. Before reading this part you should be familiar with the information in [part 1](./ebm-internals-regression.ipynb) and [part 2](./ebm-internals-classification.ipynb)"
    ]
   },
   {
@@ -224,7 +224,7 @@
    "source": [
     "ebm.term_scores_[0] is the lookup table for the nominal categorical feature containing country names. For multiclass, each bin consists of an array of logits with 1 logit per class being predicted. In this example, each row corresponds to a bin. There are 4 bins in the outer index and 3 class logits in the inner index.\n",
     "\n",
-    "Missing values are once again placed in the 0th bin index, shown above as the first row of 3 logits. The unknown bin is the last row of zeros.\n",
+    "Missing values are once again placed in the 0th bin index, shown above as the first row of 3 logits. The unseen bin is the last row of zeros.\n",
     "\n",
     "Since this feature is a nominal categorial, we use the dictionary {'Fiji': 1, 'Peru': 2} to lookup which row of logits to use for each categorical string."
    ]
@@ -242,7 +242,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "ebm.term_scores_[1] is the lookup table for the continuous feature. Once again, the 0th and last index are for missing values, and unknown values respectively. This particular example has 5 bins consisting of the 0th missing bin index, the three partitions from the 2 cuts, and the unknown bin index. Each row is a single bin that contains 3 class logits. "
+    "ebm.term_scores_[1] is the lookup table for the continuous feature. Once again, the 0th and last index are for missing values, and unseen values respectively. This particular example has 5 bins consisting of the 0th missing bin index, the three partitions from the 2 cuts, and the unseen bin index. Each row is a single bin that contains 3 class logits. "
    ]
   },
   {
@@ -291,17 +291,17 @@
     "\n",
     "                if isinstance(bins, dict):\n",
     "                    # categorical feature\n",
-    "                    # 'unknown' category strings are in the last bin (-1)\n",
+    "                    # 'unseen' category strings are in the last bin (-1)\n",
     "                    bin_idx = bins.get(feature_val, -1)\n",
     "                else:\n",
     "                    # continuous feature\n",
     "                    try:\n",
-    "                        # try converting to a float, if that fails it's 'unknown'\n",
+    "                        # try converting to a float, if that fails it's 'unseen'\n",
     "                        feature_val = float(feature_val)\n",
     "                        # add 1 because the 0th bin is reserved for 'missing'\n",
     "                        bin_idx = np.digitize(feature_val, bins) + 1\n",
     "                    except ValueError:\n",
-    "                        # non-floats are 'unknown', which is in the last bin (-1)\n",
+    "                        # non-floats are 'unseen', which is in the last bin (-1)\n",
     "                        bin_idx = -1\n",
     "        \n",
     "            tensor_index.append(bin_idx)\n",

diff --git a/_sources/ebm-internals-regression.ipynb b/_sources/ebm-internals-regression.ipynb
@@ -107,9 +107,9 @@
     "\n",
     "If there are any feature values that are equal to a bin cut value, they are placed into the upper bin choice. To convert a continuous feature into bins, we can use the numpy.digitize function with a slight adjustment that we'll see below.\n",
     "\n",
-    "EBMs also include 2 special bins: the missing bin, and the unknown bin. The missing bin is the bin that we use if there are any feature values that are missing, like NaN or 'None'. The unknown bin is used whenever we see a categorical value during prediction that was not present in the training set. For example, if our testing dataset had the categorical value \"Vietnam\", or \"Brazil\", then we would use the unknown bin in this example since those countries did not appear in the training set.\n",
+    "EBMs also include 2 special bins: the missing bin, and the unseen bin. The missing bin is the bin that we use if there are any feature values that are missing, like NaN or 'None'. The unseen bin is used whenever we see a categorical value during prediction that was not present in the training set. For example, if our testing dataset had the categorical value \"Vietnam\", or \"Brazil\", then we would use the unseen bin in this example since those countries did not appear in the training set.\n",
     "\n",
-    "The missing bin is always located at the 0th index, and the unknown bin is always at the last index."
+    "The missing bin is always located at the 0th index, and the unseen bin is always at the last index."
    ]
   },
   {
@@ -145,7 +145,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "ebm.term_scores_[1] is the lookup table for the continuous feature in our dataset. The 0th index is again reserved for missing values, and the last index is again reserved for unknown values. In the context of a continuous feature, the unknown bin is used for anything that cannot be converted to a float, so if we had been given the string \"BAD_VALUE\" instead of a number, then we'd use the last bin. The unknown bin's score value can optionally be set to NaN if you would prefer an error condition.\n",
+    "ebm.term_scores_[1] is the lookup table for the continuous feature in our dataset. The 0th index is again reserved for missing values, and the last index is again reserved for unseen values. In the context of a continuous feature, the unseen bin is used for anything that cannot be converted to a float, so if we had been given the string \"BAD_VALUE\" instead of a number, then we'd use the last bin. The unseen bin's score value can optionally be set to NaN if you would prefer an error condition.\n",
     "\n",
     "The 3 remaining scores in the middle correspond to the 3 binned regions, which in our example are: bin #1 [-numpy.inf, 7.5), bin #2 [7.5, 8.5), and bin #3 [8.5, +numpy.inf].\n",
     "\n",
@@ -176,7 +176,7 @@
    "source": [
     "<h2>Sample code</h2>\n",
     "\n",
-    "Finally, here's some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like interactions, missing values, unknown values, or classification.\n",
+    "Finally, here's some code which puts the above considerations together into a function that can make predictions for simplified scenarios. This code does not handle things like interactions, missing values, unseen values, or classification.\n",
     "\n",
     "If you need a drop-in complete function that can work in all EBM scenarios, see the multiclass example in [part 3](./ebm-internals-multiclass.ipynb) which handles regression and binary classification in addition to multiclass and all the other nuances."
    ]

diff --git a/_sources/ebm-internals.md b/_sources/ebm-internals.md
@@ -5,4 +5,4 @@ For a reference on the ExplainableBoostingClassifier and ExplainableBoostingRegr
 This section is divided into 3 parts that build upon each other:
 [Part 1](./ebm-internals-regression.ipynb) Covers regression for pure GAM models (no interactions).
 [Part 2](./ebm-internals-classification.ipynb) Covers binary classification with interactions, ordinals, and missing values.
-[Part 3](./ebm-internals-multiclass.ipynb) Covers multiclass, and 'unknowns'.
+[Part 3](./ebm-internals-multiclass.ipynb) Covers multiclass, and unseen values.
diff --git a/_sources/python/examples/interpretable-regression-synthetic.ipynb b/_sources/python/examples/interpretable-regression-synthetic.ipynb
@@ -248,7 +248,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<h2>For RMSE regression, the EBM's intercept should be close to the mean</h2>"
+    "<h2>For RMSE regression, the EBM's intercept should be identical to the mean</h2>"
    ]
   },
   {
@@ -257,7 +257,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(np.average(y))\n",
+    "print(np.average(y_train))\n",
     "print(ebm.intercept_)"
    ]
   },

diff --git a/aplr.html b/aplr.html
diff --git a/dpebm.html b/dpebm.html
diff --git a/dt.html b/dt.html
diff --git a/ebm-internals-classification.html b/ebm-internals-classification.html
diff --git a/ebm-internals-multiclass.html b/ebm-internals-multiclass.html
diff --git a/ebm-internals-regression.html b/ebm-internals-regression.html
diff --git a/ebm-internals.html b/ebm-internals.html
@@ -423,7 +423,7 @@ <h1>EBM internals<a class="headerlink" href="#ebm-internals" title="Permalink to
 <p>This section is divided into 3 parts that build upon each other:
 <a class="reference internal" href="ebm-internals-regression.html"><span class="doc std std-doc">Part 1</span></a> Covers regression for pure GAM models (no interactions).
 <a class="reference internal" href="ebm-internals-classification.html"><span class="doc std std-doc">Part 2</span></a> Covers binary classification with interactions, ordinals, and missing values.
-<a class="reference internal" href="ebm-internals-multiclass.html"><span class="doc std std-doc">Part 3</span></a> Covers multiclass, and ‘unknowns’.</p>
+<a class="reference internal" href="ebm-internals-multiclass.html"><span class="doc std std-doc">Part 3</span></a> Covers multiclass, and unseen values.</p>
 <div class="toctree-wrapper compound">
 </div>
 </section>

diff --git a/ebm.html b/ebm.html
diff --git a/framework.html b/framework.html
diff --git a/index.html b/index.html
diff --git a/lime.html b/lime.html
diff --git a/lr.html b/lr.html
diff --git a/msa.html b/msa.html
diff --git a/pdp.html b/pdp.html