DataGenerators is a data generation package. It enables the generators for structured data to be defined in a natural manner as Julia functions. It can apply search and optimisation techniques to find data that, for example, can improve software testing by generating more effective test data.
You can write your own data generators utilizing the full power of Julia, or use the DataGeneratorTranslators package to automatically create data generators from specifications such as Backus-Naur Form (BNF), XML Schema Definition (XSD), and regular expressions.
Install by cloning the package directly from GitHub -- including two packages it requires, DataGeneratorTranslators
and BaseTestMulti
-- from a Julia REPL:
julia> Pkg.clone("https://github.com/simonpoulding/BaseTestMulti.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGeneratorTranslators.jl")
julia> Pkg.clone("https://github.com/simonpoulding/DataGenerators.jl")
Don't forget to load the package:
julia> using DataGenerators
A generator consists of rules which are written as Julia functions, and are defined using the @generator
macro:
julia> @generator NumXStrGen begin
start() = join(plus(item()))
item() = 'X'
item() = choose(Int,0,9)
end
Data generation begins by executing the start
rule-function, which in turn calls other rule-functions and accepts the values they return. The value returned by the start
rule-function is the value emitted when the generator is run.
Here the start
rule-function uses plus
, a special DataGenerator construct that creates a list of length of 1 or more by repeatedly executing its argument. The length of the list is decided each time plus
is executed.
The argument of plus
is a call to the item
rule-function. There are two item
rule-functions defined in the generator, and which one of the two is executed is decided each time the rule-function is called.
The second item
rule-function uses another DataGenerator construct, choose
, to select a value from a data type: here an integer between 0 and 9. Again, which value is returned is decided each time choose
is executed.
The macro creates the generator as Julia type, in this case named NumXStrGen
. To run the generator so that it emits a datum, first create an instance of the generator, and then call choose on that instance:
julia> g = NumXStrGen()
data generator NumXStrGen with 3 choice points using sampler choice model
julia> choose(g)
"8X37X"
julia> choose(g)
"X0"
(For convenience, it is also possible to apply choose
directly to the generator type itself: choose(NumXStrGen)
.)
The NumXStrGen
generator emits a string consisting of digits and the letter X. The length of the list returned by plus
, which item
rule-function is executed, and the digit returned by choose
are called choice points. The default behaviour is that the random choices are made at these choice points. (A powerful feature of DataGenerators
is that this behaviour can be changed and refined: see the section 'Choice Models' below.) Therefore each time the generator is run using choose
, a string of different length and consisting of different combinations of Xs and digits is returned.
Rule choice points occur when two or more rule-functions in the generator have the same name.
The default behaviour is to choose one of the rule-functions at random, with all having the same probability of being chosen.
Sequence choice points are defined using one of the following constructs:
mult(x)
- returns a list (Vector) of zero or more itemsplus(x)
- returns a list of one or more itemsreps(x, a, b)
- returns a list with length betweena
andb
inclusive (ifb
is omitted, then there is no upper bound)
x
is often a call to another rule-function, in which case either the just rule-function name (e.g. item
) or the full call syntax (item()
) may be used. x
many also be any expression that returns a value, including a constant.
The default behaviour is to choose the length of list according to a Geometric distribution so that short lists are more likely than longer lists.
Value choice points are defined using one of the following constructs:
choose(Bool)
- returnstrue
orfalse
(i.e. a value of type Bool). The default behaviour is to choose these values using a Bernoulli distribution such that true and false have the same probability of being chosen.choose(T, a, b)
whereT
is one of the following numeric types:Int
,Int8
,Int16
,Int32
,Int64
,UInt
,UInt8
,UInt16
,UInt32
,UInt64
,Float16
,Float32
, orFloat64
- returns a value of that is betweena
andb
. If botha
andb
are omitted, no bounds are placed on the value chosen. Ifb
is omitted, no upper is placed on the value chosen. The default behaviour is to choose from values according a uniform distribution so that all values in the range have the same probability of being chosen.choose(String, r)
- returns a string that conforms to the regular expressionr
(r
should be specified using a Julia String rather than a Regex type). Ifr
is not specified, a variable length string of any characters is returned. (This construct uses theDataGeneratorTranslators
package to construct additional generator rule-functions to return values that satisfy the regular expression.)
The Julia built-in function rand
could be used instead of choose
when random values are required, but choose
is preferred since enables finer control over how the values are chosen (see the section 'Choice Models' below). A call to rand
is not identified as a choice point by the @generator
macro.
Although generator rule-functions resemble the production rules of a formal grammar, they differ in that they are functions written in a Turing complete language, here Julia. This is one of the distinguising features of DataGenerators
. It enables rule-functions to be much expressive and compact than formal production rules since, like any other Julia functions, rule-functions can use all the features of the Julia standard library and installed packages. For example, the NumXStrGen
generator above uses the function join
from the standard library to concenate the items in a list to form a string.
The rule-functions need not be limited to short-form function syntax such as item() = choose(Int,0,9)
. Longer-form syntax may also be used in generators, including functions with local variables:
julia> @generator FibStrGen begin
start() = begin
join(plus(fib), " ")
end
function fib()
fn0 = 0
fn1 = 1
for i in 1:choose(Int,0,10)
fn2 = fn0 + fn1
fn0 = fn1
fn1 = fn2
end
fn0
end
end
julia> choose(FibStrGen)
"55 1 34"
julia> choose(FibStrGen)
"0 21 2 55"
Another difference from the production rules of a formal grammar is that arguments may be passed between rule-functions. A consequence of this is that generator as a whole, or a subset of rule-functions, can pass state between them. This mechanism enables constraints between elements within the datum emitted by the generator to be satisfied in a straightforward manner. For example:
julia> @generator DateGen begin
start() = begin
y = year()
m = month()
d = day(y, m)
Date(y, m, d)
end
year() = choose(Int, 1583, 2999)
month() = choose(Int, 1, 12)
day(y, m) = choose(Int, 1, Dates.daysinmonth(Date(y, m)))
end
(Dates.daysinmonth
is a built-in Julia function)
Generators may call other generators. The generators that are called -- the subgenerators -- are declared as parameters in the generator definition, and then instances of the subgenerators are passed as arguments when an instance of the generator is created. For example:
julia> @generator DictGen(keyGen, valueGen) begin
start() = Dict(plus(pair))
pair() = choose(keyGen)=>choose(valueGen)
end
julia> @generator ShortStringGen begin
start() = choose(String, "[A-Z]{5,15}")
end
julia> @generator SmallIntGen begin
start() = choose(Int16)
end
julia> sg = ShortStringGen()
data generator ShortStringGen with 2 choice points using sampler choice model
julia> ig = SmallIntGen()
data generator SmallIntGen with 1 choice points using sampler choice model
julia> g = DictGen(sg, ig)
data generator DictGen with 1 choice points using sampler choice model
julia> choose(g)
Dict{String,Int16} with 3 entries:
"DPFMV" => 16152
"MUFIOY" => -17445
"RSNWTD" => 5122
Additional named parameters can be passed to choose(g)
(where g is a generator instances) to control the generation process:
startrule=<rulename>
(default::start
) - generation begins from specified rule name (which should be specified as a Symbol)maxchoices=<integer>
(default: 10017) - specifies the maximum number of choices to be made, above which an exception is raisedmaxruledepth=<integer>
(default: 11765) - specifies the maximum rule call depth, above which an exception is raisedmaxseqreps=<integer>
(default: 4848) - specifies the an upper limit on the length of lists (sequences), above which an exception is raised
The purpose of the last three parameters is to limit data structures that are unbounded in size (e.g. tree-like structures). A GenerationTerminatedException
is raised when one of the limits is exceeded. Alternatively, robustchoose
can be used in place of choose
in which case the exception is silently caught and nothing
returned as the datum from the generator.
As alternative to writing generators manually, the DataGeneratorTranslators package automatically create generators from supported specifications such as Backus-Naur Form (BNF), XML Schema Definition (XSD), and regular expressions. The resulting generator code can be used immediately by the DataGenerators
package, refined manually (e.g. to incorporate constraints missing from the specification), or as the starting point for manually-created generator. See the README in the DataGeneratorTranslators package for more details.
A choice model determines how choices at choices points are made. There is a clear separation of choice model from the generator in DataGenerators
, and this abstraction permits algorithms to be applied to the choice model regardless of the specifics of the generator. For example, the choice model can be manipulated by metaheuristic algorithms to optimise the characteristics of the data returned by generator; one such characteristic may be the effectiveness of the data in finding faults when used as test inputs.
Choice models may be deterministic or stochastic. By default, a generator is assigned a sampler choice model which is stochastic: each choice point has a probability distribution assigned to it and random numbers sampled from the distribution are used to make the choices. The default behaviour described in the section 'Choice Points' above is a result of the sampler choice model.
Each type choice model provided by DataGenerators
supplies a function that sets the choice model of a generator instance:
julia> @generator SmallIntGen begin
start() = choose(Int16)
end
julia> g = SmallIntGen()
data generator SmallIntGen with 1 choice points using sampler choice model
julia> setchoicemodel!(g, SimpleChoiceModel())
data generator SmallIntGen with 1 choice points using simple choice model
julia> setchoicemodel!(g, SamplerChoiceModel(g))
data generator SmallIntGen with 1 choice points using sampler choice model
julia> setchoicemodel!(g, NMCSChoiceModel(g, x->length(x)))
data generator SmallIntGen with 1 choice points using NMCS choice model (policy: sampler choice model)
The simple choice model is a naive stochastic choice model that is used mainly for testing the DataGenerators
package. The NMCS choice model is described below in the section 'Optimising the Generation Process'.
Further choice models, such as deterministic choice models, will be provided in future versions of the DataGenerators
package. Currently, the default sampler choice model is recommended for generating random data.
Choice models typically have parameters. The parameters of a sampler choice model are the set of distribution parameters that control the shape of the probability distributions assigned to each choice point. For example, a Geometric distribution that determines the length of lists created by a sequence choice point has one parameter between 0 and 1. Values of this parameter closer to 0 result in longer sequence on average; values closer to 1 results in shorter lists on average. Manipulating the parameters of the choice model changes the characteristics of the data emitted by the generator. To put it another way: the sampler choice model defines a probability distribution over all data that could be emitted by the generator, and changing the parameters changes that distribution.
If we wish to optimise the probability distribution defined by the choice model -- for example, to favour test data with particular useful characteristics -- this can be done by manipulating the parameters of the choice model. The following functions are provided for this purpose:
choicemodel(g)
whereg
is a generator instance - returns the current choice model for that generatorgetparams(cm)
wherecm
is a choicemodel - returns the choice model parameters as aVector{Float64}
setparams!(cm, v)
wherecm
is a choicemodel, andv
is aVector{Float64}
- sets the choice model parameters to those in Vector (in the same order asgetparams
)paramranges(cm)
wherecm
is a choicemodel - returns the bounds on the choice model parameters as aVector{Tuple{Float64,Float64}}
(in the same order asgetparams
) where the first entry in the Tuple is the lower bound, and the second the upper bound
To facilitate optimisation by metaheuristic algorithms, the sampler choice model accepts any vector of parameters that satisfy the ranges specified by paramranges
. Any constraints between subsets of parameter values in the Vector that are not met are handled sensibly by `setparams!' instead of raising an exception, and so such constraints need not be considered by the algorithm.
The following example uses Different Evolution algorithm to optimise the parameters of the model to return arithmetic expressions with a length of approximately 100. (To install the BlackBoxOptim
package used in this example, use Pkg.add("BlackBoxOptim")
.)
using DataGenerators
using BlackBoxOptim
# generator for arithmetic expressions
@generator ExprGen begin
start() = expression()
expression() = operand() * operator() * operand()
operand() = "(" * expression() * ")"
operand() = (choose(Bool) ? "-" : "") * join(plus(digit))
digit() = choose(Int,0,9)
operator() = "+"
operator() = "-"
operator() = "/"
operator() = "*"
end
# function to return length of the expression, or a penalty value of 1000 if a limit was reached in the generator
exprlength(g) = begin
expr = robustchoose(g)
expr == nothing ? 1000 : length(expr)
end
# mean length of 200 expressions sampled for the generator
meanexprlength(g) = mean([exprlength(g) for i in 1:200])
# function to return a closure that acts as a fitness function (lower is better) for the generator
fitnessfn(g) = params -> begin
setparams!(choicemodel(g), params)
abs(100 - meanexprlength(g))
end
# create a generator instance
eg = ExprGen()
# optimise for a maximum of 60 seconds
optimresult = bboptimize(fitnessfn(eg); SearchRange=paramranges(choicemodel(eg)), MaxTime=60.0)
# set the parameter of the choice model to the best candidate found
setparams!(choicemodel(eg), best_candidate(optimresult))
# generate using the optimised choice model
choose(eg)
See our paper "Finding Test Data with Specific Properties via Metaheuristic Search" [1] for an example of this technique applied to generating trees of a specified size and depth.
An alternative to optimising the choice model parameters is to optimise each choice made by the generator as it is made. One such approach is to 'look ahead' to estimate the effect of each choice using Nested Monte-Carlo Search (NMCS), a form of Monte-Carlo Tree Search. We implement this as the NMCS choice model as the following example, using the same expression generator as above, demonstrates:
using DataGenerators
@generator ExprGen begin
start() = expression()
expression() = operand() * operator() * operand()
operand() = "(" * expression() * ")"
operand() = (choose(Bool) ? "-" : "") * join(plus(digit))
digit() = choose(Int,0,9)
operator() = "+"
operator() = "-"
operator() = "/"
operator() = "*"
end
# define fitness function for a single datum (penalty value of 1000 if a limit was reached in the generator)
# target is expression of length 100
fitness(expr) = expr == nothing ? 1000 : abs(100 - length(expr))
# create a generator instance
eg = ExprGen()
# set NMCS choice model: the second parameter is a fitness function applied to the generated datum,
# and the third number of choices to evaluate at each choice point: higher numbers improve accuracy
# at the cost of time taken to generate
# note that the original choice model is retained as the policy used by NMCS
setchoicemodel!(eg, NMCSChoiceModel(eg, fitness, 2))
# generate using the NMCS choice model
choose(eg)
See our papers "Generating Structured Test Data with Specific Properties using Nested Monte-Carlo Search" [2] and "The Automated Generation of Human-Comprehensible XML Test Sets [3] for examples of this technique applied to generating trees and XML respectively.
A second approach for optimising choice model parameters is to re-estimate these parameters from choices made when generating data that has the desired characteristics. This approach uses generation traces -- records of the choices made -- that can be obtained by using generate
instead of choose
to execute the generator. In the following example, this technique is applied to the same expression generator as in the two examples above:
using DataGenerators
# generator for arithmetic expressions
@generator ExprGen begin
start() = expression()
expression() = operand() * operator() * operand()
operand() = "(" * expression() * ")"
operand() = (choose(Bool) ? "-" : "") * join(plus(digit))
digit() = choose(Int,0,9)
operator() = "+"
operator() = "-"
operator() = "/"
operator() = "*"
end
# create a generator instance
eg = ExprGen()
# execute the generator 200 times, keeping traces where the expression length is within 20% of the target length
besttraces = Vector{Any}()
for i in 1:200
try
expr, state = generate(eg)
if 80 <= length(expr) <= 120
push!(besttraces, state.cmtrace)
end
catch _e
if !isa(_e, GenerationTerminatedException)
throw(_e)
end
end
end
# re-estimate choice model parameters from the best traces
estimateparams!(choicemodel(eg), besttraces)
# generate using the optimised choice model
choose(eg)
See our paper "Automated Random Testing in Multiple Dispatch Languages" [7] for an example of this technique applied to the automated generation of typed data for testing Julia functions.
DataGenerators is based in a number of research articles describing our approach (called GodelTest):
[1] R. Feldt and S. Poulding, "Finding Test Data with Specific Properties via Metaheuristic Search", ISSRE 2013 (best paper award!)
[2] S. Poulding and R. Feldt, "Generating Structured Test Data with Specific Properties using Nested Monte-Carlo Search", GECCO 2014
[3] S. Poulding and R. Feldt, "The Automated Generation of Human-Comprehensible XML Test Sets, NasBASE, 2015
[4] S. Poulding and R. Feldt, "Re-using Generators of Complex Test Data", ICST 2015
[5] R. Feldt and S. Poulding, "Broadening the Search in Search-Based Software Testing: It Need Not Be Evolutionary", SBST 2015
[6] R. Feldt, S. Poulding, D. Clark and S. Yoo, "Test Set Diameter: Quantifying the Diversity of Sets of Test Cases", ICST 2016
[7] S. Poulding and R. Feldt, "Automated Random Testing in Multiple Dispatch Languages", ICST 2017