- Contents
Uwe Reichel, audEERING GmbH, Gilching, Germany
- machine learning data splitting tool that allows for:
- group-disjunct splits (e.g. different speakers in train, dev, and test partition)
- stratification on multiple target and grouping variables (e.g. emotion, gender, language)
- set up a virtual environment
, activate it, and installsplitutils
. For Linux this works e.g. as follows:
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
(venv_splitutils) $ pip install splitutils
- project URL: https://github.com/reichelu/splitutils
- set up a virtual environment venv_splitutils, activate it, and install requirements. For Linux this works e.g. as follows:
$ git clone [email protected]:reichelu/spliutils.git
$ cd splitutils/
$ virtualenv --python="/usr/bin/python3" venv_splitutils
$ source venv_splitutils/bin/activate
$ (venv_splitutils) $ pip install -r requirements.txt
def optimize_traintest_split(X, y, split_on, stratify_on, weight=None,
test_size=.1, k=30, seed=42):
''' optimize group-disjunct split into training and test set which is guided by:
- disjunct split of values in SPLIT_ON
- stratification by all keys in STRATIFY_ON (targets and groupings)
- test set proportion in X should be close to test_size (which is the test
proportion in set(split_on))
X: (np.array or pd.DataFrame) of features
y: (np.array) of targets of length N
if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so that it is
additionally tested that all partitions cover all classes.
Else y is assumed to be numeric and no coverage test is done.
split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
on which the group-disjunct split is to be performed. Must be categorical.
stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
the split should be stratified on (groupings could e.g. be sex, age class, etc).
Dict-Values are np.array-s of length N that contain the variable values. All
variables must be categorical.
weight: (dict) weight for each variable in stratify_on. Defines their amount of
contribution to the optimization score. Uniform weighting by default. Additional
key: "size_diff" defines how test size diff should be weighted.
test_size: (float) test proportion in set(split_on), e.g. 10% of speakers to be
k: (int) number of different splits to be tried out
seed: (int) random seed
train_i: (np.array) train set indices in X
test_i: (np.array) test set indices in X
info: (dict) detail information about reference and achieved prob distributions
"size_testset_in_spliton": intended test_size
"size_testset_in_X": optimized test proportion in X
"p_ref_{c}": reference class distribution calculated from stratify_on[c]
"p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
def optimize_traindevtest_split(X, y, split_on, stratify_on, weight=None,
dev_size=.1, test_size=.1, k=30, seed=42):
''' optimize group-disjunct split into training, dev, and test set, which is
guided by:
- disjunct split of values in SPLIT_ON
- stratification by all keys in STRATIFY_ON (targets and groupings)
- test set proportion in X should be close to test_size (which is the test
proportion in set(split_on))
X: (np.array or pd.DataFrame) of features
y: (np.array) of targets of length N
if type(y[0]) in ["str", "int"]: y is assumed to be categorical, so
that it is additionally tested that all partitions cover all classes.
Else y is assumed to be numeric and no coverage test is done.
split_on: (np.array) list of length N with grouping variable (e.g. speaker IDs),
on which the group-disjunct split is to be performed. Must be categorical.
stratify_on: (dict) Dict-keys are variable names (targets and/or further groupings)
the split should be stratified on (groupings could e.g. be sex, age class, etc).
Dict-Values are np.array-s of length N that contain the variable values. All
variables must be categorical.
weight: (dict) weight for each variable in stratify_on. Defines their amount of
contribution to the optimization score. Uniform weighting by default. Additional
key: "size_diff" defines how the corresponding size differences should be weighted.
dev_size: (float) proportion in set(split_on) for dev set, e.g. 10% of speakers
to be held-out
test_size: (float) test proportion in set(split_on) for test set
k: (int) number of different splits to be tried out
seed: (int) random seed
train_i: (np.array) train set indices in X
dev_i: (np.array) dev set indices in X
test_i: (np.array) test set indices in X
info: (dict) detail information about reference and achieved prob distributions
"dev_size_in_spliton": intended grouping dev_size
"dev_size_in_X": optimized dev proportion of observations in X
"test_size_in_spliton": intended grouping test_size
"test_size_in_X": optimized test proportion of observations in X
"p_ref_{c}": reference class distribution calculated from stratify_on[c]
"p_dev_{c}": dev set class distribution calculated from stratify_on[c][dev_i]
"p_test_{c}": test set class distribution calculated from stratify_on[c][test_i]
def binning(x, nbins=2, lower_boundaries=None, seed=42):
bins numeric data.
If X is one-dimensional:
binning is done either intrinsically into nbins classes
based on an equidistant percentile split, or extrinsically
by using the lower_boundaries values.
If X is two-dimensional
binning is done by kmeans clustering into nbins clusters
x: (list, np.array) with numeric data.
nbins: (int) number of bins
lower_boundaries: (list) of lower bin boundaries.
If y is 1-dim and lower_boundaries is provided, nbins will be ignored
and y is binned extrinsically. The first value of lower_boundaries
is always corrected not to be higher than min(y).
seed: (int) random seed for kmeans
c: (np.array) integers as bin IDs
if you use this software for a publication please cite:
Reichel, U.: splitutils - machine learning data partitioning software, version 0.3.0. doi:10.5281/zenodo.10793086, 2024.
author = {Reichel, U.},
title = {splitutils -- machine learning data partitioning software, version 0.3.0},
howpublished = {doi:10.5281/zenodo.10793086},
year = {2024}
- see
- partitions are:
- disjunct on categorical "split_var"
- stratified on categorical "target", "strat_var1", "strat_var2"
- each contain all levels of "target"
import numpy as np
import os
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import optimize_traindevtest_split
# set seed
seed = 42
# size
n = 100
# feature array
data = np.random.rand(100, 20)
# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
"target": target,
"strat_var1": np.random.choice(["L", "M"], size=n, replace=True),
"strat_var2": np.random.choice(["N", "O"], size=n, replace=True)
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
"target": 2,
"strat_var1": 1,
"strat_var2": 1,
"size_diff": 1
# test partition proportion (from 0 to 1)
test_size = .2
# number of disjunct splits to be tried out in brute force optimization
k = 30
train_i, test_i, info = optimize_traintest_split(
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
- see
- Partitions are
- disjunct on categorical "split_var"
- stratified on categorical "target", "strat_var1", "strat_var2"
- each contain all levels of "target"
import numpy as np
import os
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import optimize_traindevtest_split
# set seed
seed = 42
# size
n = 100
# feature array
data = np.random.rand(100, 20)
# target variable
target = np.random.choice(["A", "B"], size=n, replace=True)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# dict of variables to stratify on. Key names are arbitrary.
stratif_vars = {
"target": target,
"strat_var1": np.random.choice(["F", "G"], size=n, replace=True),
"strat_var2": np.random.choice(["H", "I"], size=n, replace=True)
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes.
# Key names must match the names in stratif_vars.
weights = {
"target": 2,
"strat_var1": 1,
"strat_var2": 1,
"size_diff": 1
# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1
# number of disjunct splits to be tried out in brute force optimization
k = 30
train_i, dev_i, test_i, info = optimize_traindevtest_split(
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
Example 3: Split dummy data into training, development, and test partitions, the target and several stratification variables being numeric
- see
- Partitions are
- disjunct on categorical "split_var"
- stratified on numeric "target", and on 3 other numeric stratification variables
import numpy as np
import os
import sys
# add this line if you have cloned the code from github to PROJECT_DIR
# sys.path.append(PROJECT_DIR)
from splitutils import (
example script how to split dummy data into training, development,
and test partitions that are
* disjunct on categorical "split_var"
* stratified on numeric "target", and on 3 other numeric stratification
# set seed
seed = 42
# size
n = 100
# features
data = np.random.rand(n, 20)
# numeric target variable
num_target = np.random.rand(n)
# array with variable on which to do a disjunct split
split_var = np.random.choice(["D", "E", "F", "G", "H", "I", "J", "K"],
size=n, replace=True)
# further numeric variables to stratify on
num_strat_vars = np.random.rand(n, 3)
# intrinsically bin target into 3 bins by equidistant
# percentile boundaries
binned_target = binning(num_target, nbins=3)
# ... alternatively, a variable can be extrinsically binned by
# specifying lower boundaries:
# binned_target = binning(num_target, lower_boundaries=[0, 0.33, 0.66])
# bin other stratification variables into a single variable with 6 bins
# (2-dim input is binned by StandardScaling and KMeans clustering)
binned_strat_var = binning(num_strat_vars, nbins=6)
# ... alternatively, each stratification variable could be binned
# individually - intrinsically or extrinsically the same way as num_target
# strat_var1 = binning(num_strat_vars[:,0], nbins=...) etc.
# now add the obtained categorical variable to stratification dict
stratif_vars = {
"target": binned_target,
"strat_var": binned_strat_var
# weight importance of all stratification variables in stratify_in
# as well as of "size_diff", which punishes the deviation of intended
# and received partition sizes
weights = {
"target": 2,
"strat_var": 1,
"size_diff": 1
# dev and test partition proportion (from 0 to 1)
dev_size = .1
test_size = .1
# number of disjunct splits to be tried out in brute force optimization
k = 30
train_i, dev_i, test_i, info = optimize_traindevtest_split(
print("test levels of split_var:", sorted(set(split_var[test_i])))
print("goodness of split:", info)
- find optimal train, dev, and test set split based on:
- disjunct split of a categorical grouping variable G (e.g. speaker)
- optimized joint stratification on an arbitrary amount of categorical target and grouping variables (e.g. emotion, gender, ...)
- close match of partition proportions in G and underlying dataset X
- brute-force optimization on k disjunct splits of G
- score to be minimzed for train/test set split:
(sum_v[w(v) * irad(v)] + w(d) * d) / (sum_v[w(v)] + w(d))
v: variables to be stratified on
w(v): their weight
irad(v): information radius between reference and test set distribution of factor levels in v
d: absolute difference between test proportions of X and G, i.e. between the proportion of test
samples and the proportion of groups (e.g. speakers) that go into the test set
w(d): its weight
- score to be minimzed for train / dev / test set split:
(sum_v[w(v) * max_irad(v)] + w(d) * max_d) / (sum_v[w(v)] + w(d))
v: variables to be stratified on
w(v): their weight
max_irad(v): maximum information radius of reference distribution of classes in v and
- dev set distribution,
- test set distribution
max_d: maximum of absolute difference between proportions of X and G (see above) calculated for
the dev and test set
w(d): its weight
- let's look at Example 2 above. There
'score': 0.030828359568603338,
'size_devset_in_spliton': 0.1,
'size_devset_in_X': 0.14,
'size_testset_in_spliton': 0.1,
'size_testset_in_X': 0.13,
'p_target_ref': {'B': 0.49, 'A': 0.51},
'p_target_dev': {'A': 0.5, 'B': 0.5},
'p_target_test': {'A': 0.5384615384615384, 'B': 0.46153846153846156},
'p_strat_var1_ref': {'G': 0.56, 'F': 0.44},
'p_strat_var1_dev': {'G': 0.5714285714285714, 'F': 0.42857142857142855},
'p_strat_var1_test': {'F': 0.5384615384615384, 'G': 0.46153846153846156},
'p_strat_var2_ref': {'I': 0.48, 'H': 0.52},
'p_strat_var2_dev': {'I': 0.5, 'H': 0.5},
'p_strat_var2_test': {'I': 0.46153846153846156, 'H': 0.5384615384615384}
- Explanations
- score: see above, score to be minimzed for train / dev / test set split:
- size_devset_in_spliton: proportion of to-be-split-on variable levels in development set
- size_devset_in_X: proportion of rows in X in development set
- size_testset_in_spliton: proportion of to-be-split-on variable levels in test set
- size_testset_in_X: proportion of rows in X in test set
- p_target_ref: reference target class distribution over all data
- p_target_dev: target class distribution in development set
- p_target_test: target class distribution in test set
- p_strat_var1_ref: first stratification variable's reference distribution over all data
- p_strat_var1_dev: first stratification variable's class distribution in development set
- p_strat_var1_test: first stratification variable's class distribution in test set
- p_strat_var2_ref: second stratification variable's reference distribution over all data
- p_strat_var2_dev: second stratification variable's class distribution in development set
- p_strat_var2_test: second stratification variable's class distribution in test set
- Remarks
- for
no development set results are reported - all
keys: key names derived from key names instratify_on
- for