- Title: Nullable Redesign
- Authors: John Myles White <[email protected]>
- Created: December 9, 2016
- Status: work in progress
The representation of missing values in Julia is too complicated. We currently have at least three kinds of null-like constructs:
- A
Void
type with a singleton valuenothing
, which is the return value of functions that don't return another kind of value; - A
Nullable{T}
parametric type with many values, which allows programmers to represent the possibility of missing/invalid values; - An
#undef
quasi-value, which reflects the use of null pointers in Julia's implementation and is exposed after allocating memory without initializing that memory.
This Julep describes a proposal to merge these three constructs into one construct.
In particular, the proposal is to replace all existing null-like constructs
with a redesigned version of the current Nullable
type. This new version of
Nullable
will involve both (a) changing the internal implementation of the
current Nullable
type and (b) clarifying the semantics of the Nullable
type. Because the name Nullable{T}
is a little long, we will use the
notation, ?T
, as a short-hand for Nullable{T}
throughout this document.
Making this short-hand a genuine part of Julia's syntax is one part of this
Julep's proposed design.
The core goals for this Julep are to:
- Simplify the language by replacing many similar null-like constructs with a single null-like construct.
- Ensure that the resulting representation of
null
is a first-class citizen of the language. In particular, we will remove the second-class citizen status of#undef
, which is a quasi-value that can be present in uninitialized data structures, but cannot be inserted into those data structures. - Clarify how functions can be applied to nullable types by formalizing an
extensible protocol for applying methods defined over
T
to values of type?T
. - Make it possible to write optimized code for nullable types without having
to call
Core.Inference.return_type
orpromote_type
. - Introduce syntactic sugar for nullable types and also introduce special binary operators for working with nullable-typed objects.
The optional goals for this Julep are to:
- Replace the
NullableArray{T}
type withArray{?T}
, while maintaining the ability to pass the raw masked values as a single block of memory through Julia's FFI in the way thatNullableArray{T}
currently allows passing thevalues
field without passing theisnull
mask.
The prerequisites for this Julep are to:
- Complete optimizations to the Julia compiler to handle
Union{T, Void}
as efficiently as our currentNullable
type handles the union of the set of non-null valuesNullable{T}(val)
and the singleton set containing the null valueNullable{T}()
. These compiler changes may include memory layout changes as well as optimization of run-time dispatch among pre-specified types via low-cost branches.
Assuming that compiler optimizations are put in place for union types, the
first step for this Julep is to replace the concrete type Nullable{T}
with an
optimized non-concrete union type. The options for replacing Nullable{T}
that have been proposed are the following union types:
Union{T, Void}
Union{Some{T}, Void}
Union{T, Null{T}}
Any newly optimized union type would have the same semantics as union types currently have: in particular, a union type is not a concrete type and, as a consequence, there are no values of any non-trivial union type. The use of a non-concrete type is important both for optimization purposes and for clarifying the possible semantics we can define for the new nullable types, as we would be replacing a concrete type with a non-concrete type.
Deciding between these three options will require balancing a variety of
concerns. The arguments for each option are summarized below. We include all
of them for clarity, although we believe that the arguments on behalf of
Union{Some{T}, Void}
are the strongest.
When a function returns Union{T, Void}
, the user can immediately apply all
methods defined on T
when the value happens to be of type T
. Allowing users
to call existing methods without having to unwrap non-null values can be
convenient.
In this regard, the use of a Union{T, NAtype}
pattern in DataArrays was
perceived by many to offer greater convenience than our current Nullable{T}
type provides, although the use of a union type was also the cause of that
library's historical poor performance because of the compiler's non-optimized
handling of union types. This Julep assumes that union types will be optimized
such that poor performance is no longer a consequence of using a union type as
the implementation of ?T
.
Instead of concerns about inferior performance, one major argument against
using Union{T, Void}
is that a Union{T, Void}
type can only be used
reliably in code if the methods being called on instances of the union type
are defined for both the T
and Void
sub-types of that union. Going forward,
this document will refer to these two cases as the non-null and null cases,
respectively.
Because the Union{T, Void}
representation uses a union type, it is possible
for a method to be defined only over a subset of the values in the union. In
particular, the possibility of directly calling f(x::T)
on a value of type
Union{T, Void}
means that it is possible for code to handle the non-null case
without being capable of handling the null case. We will refer to situations in
which code handles only one of the non-null or null cases without handling the
other case as failing to exhibit totality, a term used because the code does
not define an execution path for the totality of all possible inputs.
To ensure totality, users must either (a) use functions for which both
f(x::T)
and f(x::Void)
are defined or (b) use higher-order functions
that can take in a function with a method of the form f(x::T)
and employ a
default semantics for null values to handle the f(x::Void)
case. In
particular, we will refer to the pattern of extending an f(x::T)
method to
handle f(x::Union{T, Void)
by assuming that f(x::Void) = nothing
as
lifting.
The unique concern with Union{T, Void}
is f(x::T)
will often be defined
because it is exists even in the parts of Julia that deal with only
non-nullable types. When T
is Int
, this path is any function on integers.
Many such functions exist and it is possible that users will assume that
f(x::Void)
will also exist because f(x::Int)
exists.
In DataArrays, the closely related problem of needing definitions of
f(x::NAtype) = NA
for all functions f
was addressed by manually defining
f(x::NAtype) = NA
for all functions f
that commonly occurred in practice.
Although this piece-meal approach worked effectively in many cases because the
majority of real code makes use of only a small number of core functions,
there was always yet another function that to had be defined manually when a
user found that f(x::NAtype)
was not defined for their favorite function f
.
In this sense, the exact problems introduced by manually vectorizing scalar
functions in Base Julia were repeated again when adding methods for NAtype
:
instead of describing a universal pattern using a generic abstraction, we
repeated that pattern manually over and over again, while always discovering
new cases that had been left out.
In addition to requiring repetitive code to provide uniform semantics for
f(x::Void)
, there are concerns that the convenience of not having to unwrap
values may encourage unsafe practices that will not be caught by shallow testing
practices. For example, if a user defines f(x::T)
, but not f(x::Void)
,
and they only test their code against values of type T
, they may release
code that they believe will work for Union{T, Void}
, but will actually
raise a MethodError
at run-time if nothing
is ever encountered.
That said, if we adopted typealias Nullable{T} Union{T, Void}
as our new
implementation of Nullable{T}
, such concerns about safety could be addressed
by static analysis tools that check that null values are always handled
whenever they are allowed by a method's type signature. We might use something
like the approach taken in
Flow for Javascript.
Finally, the use of Union{T, Void}
introduces issues with method specificity
and the related issue of replacing run-time dispatch with compile-time
dispatch. In particular, the core problem is that T <: Union{T, Void}
, so the
types are not disjoint, leading to specificity conflicts.
To see how the lack of disjointness can introduce problems, imagine that a user defines the following two functions:
f(x::Int) = 1
f(x::Union{Int, Void}) = 2
In this case, the second method, which provides a definition for values of type
Int
, will actually never be called on any values of type Int
because the
second method definition is strictly less specific than the first method
definition for values of type Int
. This can even lead to universally dead
paths in code that are not obvious at first glance:
f(x::Int) = 1
f(x::Union{Int, Void}) = isa(x, Void) ? 2 : 3
Here the second part of the branch will never be taken, but the unreachability
of this branch can only be determined by non-local analysis as it is the
existence of the method, f(x::Int) = 1
, that renders a branch in other code
unreachable. Such dead paths are expected in large Julia codebases and can be
addressed by static analysis tools, but it may be problematic that this
representation of ?T
could increase the frequency with which these dead paths
are defined by accident.
As such, the Union{T, Void}
approach suggests that one would only need to
define methods for f(x::Int)
and f(x::Void)
in practice, rather than
the union type. Because the f(x::Void)
definitions would be trivial in
many cases, we would still need to use general approaches for lifting.
This approach involves wrapping the non-null side of a nullable in a Some
wrapper that allows one to distinguish T
from Some{T}
. This small amount of
indirection has many implications because it ensures that
T <: Union{Some{T}, Void}
is not true, breaking many of the patterns we noted
with Union{T, Void}
.
For example, because a wrapper is employed, no functions with methods defined
on T
will work on objects of type Union{Some{T}, Void}
unless those methods
are defined for f(x::Some{T})
and f(x::Void)
. For example, the existence of
f(x::Int)
does not imply that f(x::Some{Int})
or f(x::Void)
will be
defined.
This limitation can also be framed as a benefit because it ensures that
multiple methods can co-exist without the specificity concerns raised for
Union{T, Void}
.
For example, it would be possible to write code that hits all branches of the following method definitions:
f(x::Int) = 1
f(x::Union{Some{Int}, Void}) = isa(x, Void) ? 2 : 3
To repeat this point in other terms,
the use of Some{T}
instead of T
provides a way of having the T
and ?T
types co-exist without specificity issues, but it simultaneously ensures
that no existing methods for T
automatically apply to ?T
. It provides
arguably cleaner semantics (by making T
and ?T
disjoint) at the cost of
making the non-null case for nullable types less convenient to work with.
As such, this implementation of ?T
would require that we push users towards
the consistent usage of specialized lifting machinery that knows how to handle
f(x::Union{Some{T}, Void})
given only knowledge of how f(x::T)
works. For
one approach to this problem, see the later section on lifting functions from
non-nullable types to nullable types.
Because users will have to work with lifted functions, this approach makes it
more difficult for users to write code that works for the non-null case, but
does not work for the null case. But the possibility of defining methods for
only one of Some{T}
or Void
means that a totality checker would still be
needed because users can always define only one of f(x::Some{T})
and
f(x::Void)
. (We should note that the current implementation of Nullable{T}
also does not ensure totality.)
Another point in favor of Some{T}
is that it is sometimes useful to
distinguish between two confusable outcomes in cases where a method returns
something like Union{Some{Void}, Void}
-- as might happen in a hash table
implementation that returns Some(nothing)
when a key is present and
associated with nothing
as a value, but returns nothing
when a key is not
present. This is another implication of working with disjoint types for T
and
?T
.
On the other hand, the use of Some{T}
raises issues about how we expect T
,
Some{T}
and Void
to interact. For example, consider the following code:
function sum(xs::Vector{Union{Some{Int}, Void}})
s = 0
for x in xs
s += x
end
return x
end
What type does this return when xs = ?Int[]
? What type does it return when
xs = ?Int[Some(1)]
? What type does it return when xs = ?Int[nothing]
?
At a minimum, the use of Some{T}
would require careful thought about
type conversion and type promotion rules since we would need to define methods
for combinations like +(x::Int, y::Some{T})
. Almost surely, we would need to
allow promoting T
to Some{T}
to make common code patterns work without
much verbosity.
This final approach allows us to distinguish Null{Bool}
from Null{Int}
in
cases in which that distinction might be useful. R makes such a distinction,
although it seems to be done only because it is falls out naturally from an
implementation that uses distinct sentinel values for each nullable type in
the language. Otherwise, this approach is similar to the Union{T, Void}
case
described above, except for one crucial problem that would seem to suggest this
approach offers few benefits over our current implementation of Nullable{T}
.
That problem is essentially the problem we already have with Nullable{T}
:
writing functions that return concrete types as output generally requires
information about the return types of any functions used inside of a function
because one needs to know the return type associated with the non-null case
even when writing code to handle the null case. Needing to know about the code
path not taken will be referred to the "counterfactual return type problem" in
what follows.
To see how this issue arises in practice, consider the following possible
implementation of map(f, x::Nullable{T})
given our current implementation of
Nullable{T}
:
function map(f, x::Nullable{T})
if !isnull(x)
return Nullable(f(get(x)))
else
return Nullable{Core.Inference.return_type(f, (T, ))}()
end
end
For example, ensuring that map(sqrt, Nullable{Int})
always returns
Nullable{Float64}
requires that the null path in the code be aware of the
return type of sqrt(x::Int)
. This information is not generally available in
Julia.
In static languages with arrow types, the counterfactual return type problem is
not a problem at all because the return type of a function is guaranteed to be
well-defined by the semantics of the language and guaranteed to be knowable at
compile-time, so one can refer to properties of f
without having to call into
the implementation details of the language. But Julia does not provide such
guarantees, so type-inference must be invoked to determine information about
the return type of the counterfactual branch. An alternative approach is to
use promote_type
to explicitly register return types, but this is clearly
problematic as it essentially duplicates a function's declared return type,
but does so using a mechanism that can be inaccurate.
Note that this problem does not generally occur for the non-null case because
we are expected to evaluate f
in that scenario. It is only a problem when we
need to know the return type of f
, but must not determine this type by
any actual evaluation of f
. Because this problem does not occur in the
non-null case, it does arise for either Union{T, Void}
or
Union{Some{T}, Void}
. The problem is unique to Union{T, Null{T}}
and our
current Nullable{T}
implementation as one must determine the type parameter
T
even for the null case.
Once an implementation of Nullable{T}
is agreed upon, the core semantic issue
for nullable types is maximizing code reuse by making lifting easy to use. In
other words, we want to make it easy to evaluate f(x::Nullable{T})
without
requiring users to define methods for every function to handle the null case,
for which the semantics are so consistent. We want to avoid requiring that
users write code for the null case because we know that the overwhelming
majority of functions will return null
as their output whenever any of
their arguments are null
.
As mentioned earlier, extending the definition of functions from T
to ?T
is called lifting. We propose the following extensible approach to implementing
lifting that defaults to the semantics we expect will occur most commonly, but
allows implementing alternative semantics where needed.
First, we define broadcast(f, x::Nullable)
as follows:
immutable Lifted{T}
func::T
end
lift(f) = Lifted(f)
broadcast(f, x::Nullable) = broadcast(lift(f), x)
broadcast(f′::Lifted, x::Nullable) = hasvalue(x) ? f′.func(value(x)) : null
This allows us to give default semantics to all functions that implement the natural approach to lifting, while still making it possible to handle special case functions that require distinct semantics.
Some such functions are:
isnull(x::?T)
get(x::?T)
get(x::?T, default::T)
Because we would use broadcasting, lifting would happen via dot-syntax:
for example, x .+ y
would be lifted by default. In addition, we would special
several operators like +
, -
, ==
, etc. to automatically be equivalent to
broadcasting on nullables.
Following the lead of
Flow for Javascript, we
would introduce ?T
as sugar for Nullable{T}
. We have also considered C#'s
T?
. Which is chosen will likely depend upon how they interact with the
ternary operator.
We might also introduce a null-coalescing operator, ??
, with
right-associativity that essentially performs lifting: for example,
x ?? y ?? 0
would be equivalent to hasvalue(x) ? x : (hasvalue(y) ? y : 0)
.
Finally, we might introduce .?
for lifted field access such that
x?.field_name
would be equivalent to getfield.(x, :field_name)
.
There are several special operators that need to be defined. In particular, we want to distinguish between non-short-circuiting Booleans and short-circuiting Booleans. Following C#'s lead, we will not allow short-circuiting Boolean operators to be applied to nullable types.
At present, #undef
occurs whenever memory is allocated for mutable types
without initializing that memory. Thus in the current version of Julia, the
following occurs:
julia> Array(String, 1)
1-element Array{String,1}:
#undef
Our proposal would remove this behavior entirely.
Instead, there would be two possibilities:
Array{String}
can never contain#undef
values and therefore must be initialized whenever it is allocated.Array{?String}
is always initialized by the mere act of allocation, as the implementation will default to setting the value ofnull
in all positions in the array.
This would likely occur concurrently with an attempt to use calloc
everywhere instead of malloc
.
When working with nullables, we generally need to check a mask before we can decide the value of an operation on nullable typed objects. This check can sometimes induce a branch that is costly relative to the cost of the core operation being performed (e.g. addition) and the existence of the branch may also impede compiler optimizations such as automatic SIMD vectorization.
As such, we will want to ensure that lift(f, args)
is customized to use
ifelse
vs if
for some functions like addition.
mean(v::Array{?Int})
median(v::Array{?Int})
pdf(Binomial(n, 0.5), x)
wheren::?Int
andx::?Int