Skip to content

Commit

Permalink
docs: documentation for reference types
Browse files Browse the repository at this point in the history
  • Loading branch information
sbrugman committed Jul 5, 2022
1 parent e4df434 commit 9cf8117
Show file tree
Hide file tree
Showing 5 changed files with 101 additions and 48 deletions.
47 changes: 1 addition & 46 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,52 +6,6 @@ Some more details on stability report settings, in particular how to set:
the reference dataset, binning specifications, monitoring rules, and where to plot boundaries.


Reference types
---------------

When generating a report from a DataFrame, the reference type can be set with the option ``reference_type``,
in four different ways:

1. Using the DataFrame on which the stability report is built as a self-reference. This reference method is static: each time slot is compared to all the slots in the DataFrame (all included in one distribution). This is the default reference setting.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="self")
2. Using an external reference DataFrame or set of histograms. This is also a static method: each time slot is compared to all the time slots in the reference data.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="external", reference=reference)
3. Using a rolling window within the same DataFrame as reference. This method is dynamic: we can set the size of the window and the shift from the current time slot. By default the 10 preceding time slots are used as reference (shift=1, window_size=10).

.. code-block:: python
settings = Settings()
settings.comparison.window = 10
settings.comparison.shift = 1
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="rolling", settings=settings)
4. Using an expanding window on all preceding time slots within the same DataFrame. This is also a dynamic method, with variable window size. All the available previous time slots are used. For example, if we have 2 time slots available and shift=1, window size will be 1 (so the previous slot is the reference), while if we have 10 time slots and shift=1, window size will be 9 (and all previous time slots are reference).

.. code-block:: python
settings = Settings()
settings.comparison.shift = 1
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="expanding", settings=settings)
Note that, by default, popmon also performs a rolling comparison of the histograms in each time period with those in the
previous time period. The results of these comparisons contain the term "prev1", and are found in the comparisons section
of a report.


Binning specifications
----------------------

Expand Down Expand Up @@ -277,6 +231,7 @@ Now that spark is installed, restart the runtime.
.config("spark.sql.session.timeZone", "GMT")
.getOrCreate()
)
Troubleshooting Spark
~~~~~~~~~~~~~~~~~~~~~

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Contents
:maxdepth: 2

introduction
reference_types
profiles
comparisons
tutorials
Expand Down
7 changes: 5 additions & 2 deletions docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ And you probably want to know this, as you might want to retrain your model.

To monitor the stability over time, we have developed popmon (**pop**\ ulation shift **mon**\ itor). Popmon takes as input a DataFrame (either pandas or Spark), of which one of the columns should represent the date, and will then produce a report that indicates how stable all other columns are over time.

For each column, the stability is determined by taking a reference (for example the data on which you have trained your classifier) and contrasting each time slot to this reference. This can be done in various ways:
For each column, the stability is determined by taking a :doc:`reference <reference_types>` (for example the data on which you have trained your classifier) and contrasting each time slot to this reference.
This can be done in various ways:

* :doc:`Profiles <profiles>`: for example tracking the mean over time and contrasting this to the reference data. Similar analyses can be done with other summary statistics, such as median, min, max or quartiles.
* :doc:`Comparisons <comparisons>`: statistically comparing each time slot to the reference data (for example using Kolmogorov-Smirnov, chi-squared, or Pearson correlation).
Expand Down Expand Up @@ -52,4 +53,6 @@ Of course, the exact thresholds (four and seven standard deviations) can be conf

Illustration of how traffic light bounds are determined using reference data.

For speed of processing, the data is converted into histograms prior to the comparisons. This greatly simplifies comparisons of large amounts of data with each other, which is especially beneficial for Spark DataFrames. In addition, it enables you to store the histograms together with the report (since the histograms are just a fraction of the size of the original data), making it easy to go back to a previous report and investigate what happened.
For speed of processing, the data is converted into histograms prior to the comparisons.
This greatly simplifies comparisons of large amounts of data with each other, which is especially beneficial for Spark DataFrames.
In addition, it enables you to store the histograms together with the report (since the histograms are just a fraction of the size of the original data), making it easy to go back to a previous report and investigate what happened.
8 changes: 8 additions & 0 deletions docs/source/popmon.pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@ popmon.pipeline.amazing\_pipeline module
:undoc-members:
:show-inheritance:

popmon.pipeline.dataset\_splitter module
----------------------------------------

.. automodule:: popmon.pipeline.dataset_splitter
:members:
:undoc-members:
:show-inheritance:

popmon.pipeline.metrics module
------------------------------

Expand Down
86 changes: 86 additions & 0 deletions docs/source/reference_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
Reference types
===============

When generating a report from a DataFrame, the reference type can be set with the option ``reference_type``,
in different ways:

+-----------------+
| Reference Type |
+=================+
| Self |
+-----------------+
| External |
+-----------------+
| Rolling |
+-----------------+
| Expanding |
+-----------------+

Note that, by default, ``popmon`` also performs a rolling comparison of the histograms in each time period with those in the
previous time period. The results of these comparisons contain the term "prev1", and are found in the comparisons section
of a report.

Self reference
--------------

Using the DataFrame on which the stability report is built as a self-reference. This reference method is static: each time slot is compared to all the slots in the DataFrame (all included in one distribution). This is the default reference setting.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="self")
The self-reference compares against the full dataset by default.
It is also supported to use a subset of the beginning of the data are used as reference point, e.g. the training data for a model.
The size of this subset is taken based on the ``split`` parameter.
``split`` accepts (1) a number of samples (integer), (2) a fraction of the dataset (float) or (3) a condition (string).

.. code-block:: python
# use the first 1000 rows as reference
report = df.pm_stability_report(reference_type="self", split=1000)
External reference
------------------

Using an external reference DataFrame or set of histograms. This is also a static method: each time slot is compared to all the time slots in the reference data.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="external", reference=reference)
Rolling reference
-----------------

Using a rolling window within the same DataFrame as reference. This method is dynamic: we can set the size of the window and the shift from the current time slot. By default the 10 preceding time slots are used as reference (shift=1, window_size=10).

.. code-block:: python
# reference_type should be passed to the settings when provided
settings = Settings(reference_type="rolling")
settings.comparison.window = 10
settings.comparison.shift = 1
# alternatively you could do
settings.reference_type = "rolling"
# generate stability report with specific monitoring rules
report = df.pm_stability_report(settings=settings)
Expanding reference
-------------------

Using an expanding window on all preceding time slots within the same DataFrame. This is also a dynamic method, with variable window size. All the available previous time slots are used. For example, if we have 2 time slots available and shift=1, window size will be 1 (so the previous slot is the reference), while if we have 10 time slots and shift=1, window size will be 9 (and all previous time slots are reference).

.. code-block:: python
settings = Settings(reference_type="expanding")
settings.comparison.shift = 1
# generate stability report with specific monitoring rules
report = df.pm_stability_report(settings=settings)

0 comments on commit 9cf8117

Please sign in to comment.