Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A data-friendly alternative to Nullable #132

Closed
andyferris opened this issue Oct 20, 2016 · 78 comments
Closed

A data-friendly alternative to Nullable #132

andyferris opened this issue Oct 20, 2016 · 78 comments

Comments

@andyferris
Copy link
Member

The semantics of Nullable have been under heavy discussion over in Julia base at JuliaLang/julia#19034 (comment).

Personally, I had been somewhat confused as to the core problems but @johnmyleswhite makes a very compelling argument which really solidifies (for me) why there has been tension regarding adding features to Nullable. I'll copy it here (I hope you don't mind, John).

There are two almost totally irreconcilable reasons you might care about Julia's Nullable type:

  • You are a software engineer and you deal with things that are "null" -- like null pointers and null handles to databases. You absolutely do not want such a value to propagate automatically. Many systems have been rebuilt almost from scratch (Javascript in Flow and PHP in Hack) in part because automatic propagation of nulls is a nightmare for large-scale software design.
  • You are a data analyst and you want to deal with missing values. You want to be able to execute arbitrary expressions against databases that may contain missing values and you want these missing values to be propagated automatically.

Julia's Nullable type was not meant to be optimized for either use case-- it was meant to be a building block for other packages to expand upon. This has come into direct conflict with the community's increasing desire to prevent type piracy. There isn't a simple solution, but it's worth keeping in mind that many of the proposals in this thread aren't Pareto improvements for the Julia community: allowing expressions like 1 + Nullable(1) is likely to do harm to software engineers in order to benefit data analysts.

To me, if we are going to make a way forward, we should really begin developing a data-friendly alternative to Nullable and leave Base.Nullable for the software engineers. I think this will allow for much more rapid progress. My proposal would be to create a type which behaves semantically as close as possible to Union{T, NA} while remaining type-stable, where NA is the missing value type in the DataArrays.jl package which behaves somewhat like NaN does for Float64.

However, this needs to be a discussion for the community, and it doesn't necessarily have to involve Base Julia at all. I'll ping a bunch of people and see what happens.

@johnmyleswhite @JuliaData (not sure if that works so I'll just add everyone @ararslan @davidagold @dmbates @kleinschmidt @nalimilan @quinnj @richardreeve @Scidom @shashi @simonbyrne )

@ararslan
Copy link
Member

I was actually thinking the same thing going through the base thread, that it may be worthwhile to pursue an entirely separate type for statistical missingness. Thanks for getting this discussion started, Andy, much appreciated!

As we think about this, we should also take a hard look back at DataArrays, which was the initial attempt to do exactly this. What did and did not work well?

I'll also cc @JeffBezanson and @StefanKarpinski here.

@andyferris
Copy link
Member Author

What did and did not work well?

I'll start with the obvious and say type stability was a "slight" issue...

@ararslan
Copy link
Member

We should also be careful not to negate the massive amount of work many of the JuliaStats folks have put into making Nullables work in a statistical context.

@davidanthoff
Copy link

I like this idea a lot. In fact, the more I think about it, I think I'll just use that approach for now in Query.jl. There is a super simple way for me to make Query.jl still work with sources and sinks that are based on Nullable, but inside I would use a NAable type...

@andyferris
Copy link
Member Author

andyferris commented Oct 21, 2016

Sure. There's no reason we can't define convert(::Type{NAable}, ::Nullable) (or whatever we call this new type).

On that note, the obvious bikeshedding - what is a good name for a stats-y nullable type?

  • Optional{T}
  • Maybe{T}
  • NAable{T}
  • NA{T}
  • Missing{T} (maybe Missable{T})
  • ?{T} (yes this actually works, also ¿{T} and ⸮{T} but they don't parse)
  • Knowable{T} (or Unknowable{T}) - a value that might be either known or unknown.

@davidanthoff
Copy link

Given that this type is meant to squarely target data scientists, it would be nice to stay as close as possible to names/concepts that folks in that area know. I think that speaks for something with NA in it. If we were to go with NAable, we could define const NA = NAable{Union{}}, and then one could even write things like a == NA instead of isna(a), which I think would make this even simpler to use.

@ararslan
Copy link
Member

ararslan commented Oct 21, 2016

I dislike NAable (and I also dislike the precedent in R of NA denoting missing). Optional and Maybe I think are more computer science-y than would be familiar for most statistically minded folks (though I think I could learn to embrace either). Of @andyferris' suggestions, 👍 to Missing{T}.

@andyferris
Copy link
Member Author

we could define const NA = NAable{Union{}}

I'm surprised Nullable{Union{}}() actually constructs... I have to say that is kind-of cool. I do support having a generic null value that can be isequal or maybe == to others.

@ararslan
Copy link
Member

one could even write things like a == NA instead of isna(a), which I think would make this even simpler to use.

I think a behavior like this could be nice, though in general I think I prefer using is* functions for things like this. But if we do go with NA, that has the potentially to be very confusing for people coming here from R, since anything == NA in R is NA.

@davidanthoff
Copy link

Missing as the type seems odd to me, after all sometimes the value is not missing :) Maybe Missable for the type, and then const missing = Missable{Union{}}()? On the flipside, missing is long, NA is much nicer on that score.

But if we do go with NA, that has the potentially to be very confusing for people coming here from R, since anything == NA in R is NA.

That is a good point...

@ararslan
Copy link
Member

"Missable" is a word but it means something unrelated.

@andyferris
Copy link
Member Author

andyferris commented Oct 21, 2016

Looking at some other languages (from wikipedia), there are only two common classes of naming:

  • Many languages use a typename like Optional and have a null instance none, nothing or null and many seem use the word some to construct or pattern match a non-null value.
  • Haskell (and idris): Type Maybe, null instance Nothing and other instance Just x.

Optional is overwhelmingly most common on wikipedia's list. Doesn't mean we can't go our own direction.

"Missable" is a word but it means something unrelated.

Yes I don't think Missable is quite correct.

@nalimilan
Copy link
Member

I've contemplated creating an alternative nullable-like type a few times, but honestly I'm not sure it would solve all problems.

  1. The argument that we shouldn't repeat the same mistakes as for vectorized versions of functions is fundamentally right as regards the general case: we cannot reimplement all functions manually to have lifting semantics. So we need a generic system to lift a function at the call site. The framework at WIP: Nullable lifting infrastructure JuliaLang/julia#18758 fills most needs, and I think some syntactic sugar like f?(x) or f.?(x) which @JeffBezanson mentioned would be nice. Since that's explicit, it isn't an issue for people who don't want null propagation.

  2. There's actually an argument for not using a custom type: since Nullable is a standard type, we can imagine introducing syntactic sugar for it (as in many other languages), like Int? -> Nullable{Int}, 1? -> Nullable(1), etc. We can't have this with a custom type. Same if you would like all functions to automatically lift by default: this would be possible only via a change in Base.

  3. So the main (or maybe only) conflict regards lifting operators and mixing nullables and scalars. People who don't want nulls to propagate don't want standard operators to have lifting semantics, and it might indeed be hard to satisfy both conceptions of Nullable. I don't have a good solution to that, except maybe to suggest that we use Nullable for missing data semantics, and create a new Option type in Base with a strict non-propagation behavior. Both types would behave the same in most contexts, except with regard to operators. Not sure it's worth it (but the definition of Nullable is only a few lines, so it's not a big cost either).

  4. I don't think we even agree about the semantics == should have on nullables: should it return, Bool or Nullable{Bool}? should NULL == NULL return true or false? I don't even know myself what solution I prefer. So adding a separate type wouldn't solve that debate either (or at least not yet).

@andyferris
Copy link
Member Author

andyferris commented Oct 21, 2016

I totally agree there will be (major!) difficulties, @nalimilan. And there are major issues of semantics that need to be discussed

In my opinion, I see two self-consistent, useful, sensible semantics for a nullable type. I think we should implement both.

  • A container which has zero or one elements. All unpacking is done by the user. It is safe for purposes where automatic nullable propagation is better to avoid. This will be Base.Nullable (if only for the fact that that is how Jeff thinks Nullable should behave, and it's more-or-less how things stand now).
  • A new type such as NA{T} which aims to be equivalent to Union{NULL,T} in a semantic sense. I think since we have isequal to solve NaN difficulties, we can have NULL != NULL but isequal(NULL,NULL) = true. This makes NULL and NaN share similar properties. We will have problems making it work for all functions automatically, but we can aim to solve 95% of common cases.

In the future, as ideas progress and Julia evolves, I'm betting we'll find nicer ways of working with the former (e.g. using . broadcast syntax or new syntax with ?).

Even if the latter type is implemented outside of Base (for now) I'm also betting working with it will get easier. E.g. NA{T} could inherit traits from T in a future version of Julia that easily allows traits for dispatch. Or if we solve JuliaLang/julia#14919, then maybe we can define things like

(f::Function)(x::Nullable) = hasvalue(x) ? f(x.value) : NA{return_type(f, Tuple{eltype(x)})}()

There could be a million other helpful things we can't even imagine yet. Like @davidanthoff said quite well here, we shouldn't give up on trying just because its a little disgusting or difficult or imperfect on our first try.

@nalimilan
Copy link
Member

I don't see anything in your proposal which says what using a new type would allow us doing right now. How would it help getting type-stable nullable arrays/data frames ready for a release in the next few weeks (or even by the time Julia 1.0 is released)?

The (f::Function)(x::Nullable) approach is certainly appealing, but it would only work for one-argument functions. It's hard to design a similar mechanism which would work for any number of arguments (both nullable and non-nullable) without introducing lots of ambiguities with any function which accepts nullables. I think we would need changes inside Julia to get this to work. So, yes, maybe at that point we could introduce a separate type which would automatically lift arguments, but that would be the final step after a full design has been implemented inside Base.

(See discussion a JuliaStats/NullableArrays.jl#85 about == semantics. In general I encourage you to read all issues linked in the various related PRs as well as in JuliaData/DataFrames.jl#1008. There are been quite a lot of design discussions all over the years.)

@davidagold
Copy link

Call overloading on functions doesn't seem to be permitted:

julia> (f::F){F<:Function}(x) = f(x)
ERROR: function type in method definition is not a type

But if it were, writing (f::F){F<:Function}(xs...) = lift(f, xs...) would immediately handle argument signatures of all lengths and "mixedness".

@davidanthoff
Copy link

I don't see anything in your proposal which says what using a new type would allow us doing right now.

I'm converting Query.jl to use a new type NAable now, and it pretty much solves all the problems I had in that area. The main benefit is that I can now implement the white-list approach without having to a) convince folks that it should be in base (seems hopeless at at this point) and b) without type piracy. I should also add that I actually don't have any open question in terms of what the semantics of these lifted functions should be within the context of Query.jl (things like == etc.), so having my own type allows me to just implement those semantics.

This approach works well for Query.jl, even if the rest of the data universe doesn't adopt the NAable type. Right now I simply convert anything Nullable into a NAable at the start of any query, and then the other way round when someone collects it into a data structure where one would expect a Nullable.

Obviously this doesn't help with the old API stuff like df[:a] = df[:b] + 2 etc., but at least for Query.jl I think that pretty much solves the problem, and it should also make Query.jl pretty usable for a release of DataFrames that uses NullableArrays.

I'd be happy to a) rename the type to something better if there is a consensus and b) move all of that into its own package eventually. But for now I just want to move forward with Query.jl, so I'll just have it there for now.

@quinnj
Copy link
Member

quinnj commented Oct 21, 2016

It's certainly great to have someone charge forward on this approach. One of my biggest concerns was just finding someone to implement and maintain the code. I think it'll be great to see how things unfold with it's usage and it will provide some good chances to learn as we go (separate from Base and with more flexibility to change/update things).

@amellnik
Copy link

This is an exciting thread. My use of Julia falls squarely into the data analyst camp, and being able to easily work with dataframes is my highest priority in these changes. The allure of DataArrays was significant -- they worked almost exactly like normal arrays in most cases, and isna and similar took care of the rest. The associated performance cost is offset by the ease of use and development in my use case.

@andyferris's second option or something along those lines seems like a good step forward, until Nullabes are in a better place. For me at least, the DataFrames master is practically unusable.

@andyferris
Copy link
Member Author

Thanks @davidanthoff for going forward with this. I was going to tell @nalimilan that unfortunately I quite probably wouldn't have time to create an entirely new and complete system in a few weeks, but it seems that David will. :)

Though, I would recommending splitting off NAable to a package ASAP. This will allow you to journal your design decisions cleanly and for feedback/feature requests from the community and for others to contribute and make PRs - the administration effort should pay off for you in terms of contributed ideas and code (and I find users are really good at finding bugs). I know you want to move quickly with Query.jl but I hope this wouldn't get in your way (if the package changes API or even name while it is quite young, you won't be annoying anyone too much).

@andyferris
Copy link
Member Author

I've been reading (and re-reading) about earlier discusions surrounding == as Milan suggested, and I see David and Milan (and many others) have been over all of this before. Though, I still firmly believe that users will sometimes want NULL == NULL to be false and sometimes they will want it to be true. Milan wrote

I've done a small review of choices made by other languages, and indeed they don't seem to bother about the inconsistency between NaN and NULL. All languages I surveyed which support == for the equivalent of NULL consider two NULL values as equal. In Python, None == None -> True (though using is is recommended). In Ruby and Swift, nil == nil -> true. In C++, C#, Java, JavaScript and Kotlin, null == null -> true.

Only SQL, R and SAS return the equivalent of Nullable{Bool}, and there NULL == NULL -> NULL.

However I'm not sure (I could be wrong) those languages had a well thought out system of == and isequal, and < and isless, like Julia has had since the beginning. I think a lot of thought went into those features from the Julia designers, and I really appreciate it. And despite the fact that all those other languages chose something different in the past, I don't think that means we have to do the same (caveat: I realize Query.jl aims to be close to LINQ syntax). I really think we should follow NaN semantics, and from my experience learning Julia I discovered the difference between == and isequal relatively early on so I don't feel this will be a huge burden to new users. After all, the idea is simpler to explain than three valued logic.

Another thing - I saw a comment equating hashing and isequal by Stefan. It would be nice to ensure NAable hashes in a deterministic way if it is NULL (maybe a clever inner constructor could zero out the T bits, or we overload hash).

@nalimilan
Copy link
Member

Though, I still firmly believe that users will sometimes want NULL == NULL to be false and sometimes they will want it to be true.

And sometimes they will want it to be NULL, which is the hardest decision to make.

Another thing - I saw a comment equating hashing and isequal by Stefan. It would be nice to ensure NAable hashes in a deterministic way if it is NULL (maybe a clever inner constructor could zero out the T bits, or we overload hash).

That's easy: don't hash the value of null.

@davidanthoff
Copy link

Here we go: https://github.com/davidanthoff/NAables.jl

I also have a branch in Query.jl that uses NAable instead of Nullable everywhere: queryverse/Query.jl#70. It works without problems with any source or sink that uses Nullable, so from an interoperability point of view this doesn't really change anything.

Right now all the code in NAables.jl is also in Query.jl, but that is just until we've decide on a final name for NAables.jl and METADATA registration.

I'm still somewhat nervous about the Query.jl switch, so I'll probably wait a little longer before I merge this. But right now it solves a huge issue, namely I can get rid of all type piracy, and I don't really see any downside...

@ExpandingMan
Copy link

ExpandingMan commented Oct 24, 2016

I think the answer to the question of the behavior of Nullable(a) == Nullable(b) is simple: it's false <: Bool.

This is quite simply because false == (NaN == NaN). It's always been this way, and it's always worked fine, in fact that's why we have isnan. I think the behavior of NaN is a very sensible guide to how to make a data nullable type behave.

Note that since (NaN == NaN) <: Bool even though NaN is floating point it follows that it's perfectly fine for (Nullable(a) == Nullable(b)) <: Bool and not Nullable{Bool}.

@nalimilan
Copy link
Member

No, it's definitely not that simple (read discussions I liked to above). For example, R uses a completely different approach. There are several legitimate strategies, and it's hard to know which one is best. There are also subsidiary questions, like do we want null != null to return true (like for NaN) or false (for consistency with ==)?

@davidagold
Copy link

Having Nullable{T}() != Nullable{T}(x) seems like a great way to subtly invalidate an analysis, especially if the rest of your data manipulation framework propagates null values.

@ExpandingMan
Copy link

I don't know, I've worked with NaN's for a long time. In fact, before my current job I was a physicist, and since I dealt almost exclusively with doubles, NaN was pretty much the only type of null value I ever used, and I never, ever had any reason to complain about it. In fact, if I were still in that position and didn't have to use annoying data types like String and DateTime, I'd probably only ever use NaN now and would be completely disinterested in this whole discussion. Furthermore NaN != NaN was a nice easy way of checking for them.

I guess I have to concede that not everyone shares this experience, but I would like to make one further point: NaN is an ultra-standard thing that's been around for 40 years in all sorts of code used for all sorts of purposes. R (and pandas for that matter, as they do a really abysmal job of dealing with this particular issue, but I guess it is kind of the canonical example of what's wrong with python), on the other hand, is a niche thing.

@nalimilan
Copy link
Member

If you think R and Pandas are a niche thing, would you also qualify SQL as such (cf. Wikipedia on NULLs in SQL)? Then what's left in the realm of data tables management?

@ExpandingMan
Copy link

Sorry, I was probably a bit overzealous in my language. I was just trying to make the point that floating point NaN is a very standard, universally understood thing in a way that certain conventions in R and pandas are not. I'm still arguing for using NaN as the paradigm, but completely understand that there are other, perfectly legitimate ways of doing things.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Sep 19, 2017

Based on a Slack conversation, this is what I feel needs to be done for 1.0:

  1. Remove Nullable{T} from Base.
  2. Add Some{T} to Base
  3. Use Union{Some{T}, N} and Union{T, N} variously in place of Nullable{T} (depending on whether N <: T or not), where N is either Void or some other engineer’s null type.
  4. Decide what N above is: it could be Void with instance nothing or some new Null type with instance null (presumably).

Things that don't need to be done for 1.0:

  • Add a data scientist’s null to Base. This may never need to happen: there are different semantics of missing data and people may want to use different ones or even many at the same time. The bits-type union optimization work means that user-defined null types are automatically efficient.

  • Give a meaning to T?. Even though we want to use this syntax for some kind of missing data, now that we've freed up the syntax, we can reclaim it at any point in the 1.x release series. We may end up wanting this syntax for the built-in engineer's null, for some "standard" data scientist's null, or let it be user-definable so that different people can decide for themselves what they want it to mean.

@quinnj
Copy link
Member

quinnj commented Oct 2, 2017

yet there is total disagreement on exactly how such a type should work or even what it should be called

Since there is no single coherent story

These statements are too strong and are probably where some of the frustration is stemming from in this discussion. As far as I know from this and other related discussions, there is only a single dissenter (@davidanthoff) on how a missing value type should behave, be defined, or be called. Everyone else who has chimed in has agreed that true | invalid = invalid, which is the behavior on Nulls.jl master.

For me, I certainly understand the hesitance to consider including another "standard library" package at this point before a 1.0 release, but I also firmly believe that this is a topic that has already been iterated on for years in the julia ecosystem, with all design & iteration leading to current Nulls.jl. I also firmly believe that it would be a godsend for the data ecosystem to have a "blessed" representation of missingness, that packages could rely on and build against. It would go a long ways to making the "marketing event" of 1.0 more complete as a swath of data-related packages could also be released using a coherent strategy for missingness. I think Julia stands to "miss out" on more than it gains by leaving them out for 1.0.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 2, 2017

I should point out that any data-frames-like data analysis tooling package should absolutely not make assumptions about what kind of nulls are used. This was one of the major mistakes of the original DataFrames – the DataFrame type should be completely agnostic to how columns are represented and the whole business of DataArrays lying about their element type is a disaster (i.e. claiming that the eltype of DataVector{T} is T instead of Union{T, NAtype}).

If you want a data column that can hold integers or missing values, its type should be Union{Int, Missing}. This means that the user could decide exactly what null semantics they need – it would not be forced on them by the ecosystem. In fact, even if we agree on a single null type for data analysis, the tooling should still be diligent in not assuming that type since it will result in better factored software to not do so, and it will leave the window open for future innovation in this area.

Many people also don't want to have null support forced on them – it should be possible to have an Int typed column that cannot hold any kind of null value. I think we're all on the same page here, but it bears stating explicitly.

@mlhetland
Copy link

mlhetland commented Oct 2, 2017

@quinnj From what I can see, the definition is (|)(b::Bool, a::Null) = ifelse(b, true, null)? I.e., it would return true, not null (or, under the renaming, invalid; i.e., it would be more like @StefanKarpinski's unknown)? I guess I'm just misunderstanding :-)

@quinnj
Copy link
Member

quinnj commented Oct 2, 2017

@mlhetland, no, sorry, it was my mistake of saying Nulls.jl follows @StefanKarpinski's description of invalid. I confused it with the earlier discussion around true == null = null, and failed to look closely enough to see it was in fact true | null = true.

@richardreeve
Copy link

@quinnj's "marketing event" is exactly what I'm talking about - I think this would be a real declaration that the language is intended for use by life scientists, and I could personally use this to good effect tomorrow to argue that we ought to be thinking of a (long-term!) strategy of moving our department's research and ultimately our teaching over to a real programming language.

I also believe that @ExpandingMan is completely wrong both about NaN-like behaviour being okay and about people not caring about inconsistencies. Missing data is everywhere in the life sciences, and behaves exactly how @quinnj and I have described. It's also often not floating point - factors are a big issue in particular. I've personally stopped using R in anger completely and replaced it with Julia, but I will never try to teach a language with contradictory and inconsistent syntax to my students because it's more trouble than it's worth.

I'm not trying to put words in anyone's mouth (sorry, @StefanKarpinski!), I'm just telling you what I believe to be true. As everyone acknowledges, this is a really important thing to get right, but we know how to do it now, and so we should just bite the bullet and do it.

PS None of this requires DataFrame to behave in any particular way, as I said above, and it would be a bad thing if it did, as would lying about element types. All it requires is to define in Base some type that behaves like Null in JuliaData/Nulls.jl master - I couldn't care less what it's called - and then define a non-overloadable ? postfix operator such that T? is just Union{T, Null}. Then we have a language that has built-in support for missing data.

@rofinn
Copy link
Member

rofinn commented Oct 2, 2017

FWIW, I still haven't seen a strong argument for why Nulls.jl needs to be in Base. As it stands, I'd like to keep Base as minimal as possible, so the language can iterate independent of these datascience features. Is the issue simply not wanting to type using Nulls?

@StefanKarpinski
Copy link
Member

You wouldn't even need to do using Nulls if packages for loading data used Nulls for you.

@quinnj
Copy link
Member

quinnj commented Oct 2, 2017

@rofinn, technically, it's mainly about being able to have the syntax T? desugar to Union{T, Null}, as well as having a few definitions for any, all, and == on AbstractArrays that work correctly on Union{T, Null} eltype arrays, which could otherwise be considered type piracy, (though admittedly a lightweight form, so more like type-petty-theft).

Another benefit I just thought of is the "proximity to the type system"; the Null type in Nulls.jl carries interesting semantics and use-cases with it as a form of "sentinel value for any type when part of a Union". Having the type definition "close" to where changes to the type system happen is nice because it gets more actively considered, much like @JeffBezanson's comment the other day here.

I standby my arguments from above though that it also sends out a strong message that we've finally settled on a sensible standard that everyone can rely on.

@StefanKarpinski
Copy link
Member

There are a few other reasons one might want a built in null type. Making null || true == true and null || false == null etc. work is one. There's also some consideration about whether nothing is appropriate as a missing or unknown value since it's more like C's void return type – which cannot be used as a value than like C's null value.

@rofinn
Copy link
Member

rofinn commented Oct 2, 2017

though admittedly a lightweight form, so more like type-petty-theft

Yeah, I'd be okay with that.

I'm still concerned that we're conflating missing values in software development and statistics. I can understand wanting something like Union{Some{T}, Null} and Union{T, Null} in Base instead of Nullable, but I feel like that's a separate conversation from how nulls should be used for statistics. IMHO, I'd still expect 2 + null to error and 2 + NA == NA.

@richardreeve
Copy link

I'm not sure why you'd expect that. Adding a number to a missing value isn't an error, we just don't know what the answer is, surely? Or am I getting confused about what you're saying?

@andyferris
Copy link
Member Author

The problem with ~ was not that it was overloadable, but that it was a macro – that was just such a weird quirk of the operator that it ended up being regrettable almost as soon as we introduced it.

Yes, there were many problems here... however the specific problem I was referring to was scarcity - that it was unique. It was the only macro operator, and it becomes a race to see who claims it first. I decided not to use it specifically because I could foresee that another popular package might be used in tandem with whatever it was I was playing with at the time (sorry, the details elude me right now). It seems to me that if there were different "nulls" using ? from different packages, its easy to foresee some user wanting (or being forced into by dependencies) to use more than one such package simultaneously.

@andyferris
Copy link
Member Author

andyferris commented Oct 2, 2017

I would recommend that the name of the missing value type should suggest its semantics. The word "null" is lousy since it suggests no semantics at all. There are at least three kinds of missing data: ...

+100 to this.

Not intentionally trying to muddy the waters, but I did start playing with Optional{T} being a container with a single Union{T, Unknown{T}} inside. I assumed it would be easier for users to know how to overload a method to take Unknown{T} - consider the projection of the function from all members of T and write down the most specific output (such as true | Unknown{Bool}() == true but true & Unknown{Bool}() == Unknown{Bool}()). There were issues discussed relating to covariance vs invariance and the pro's and con's of having the T parameter, but I still think "unknown" conveys the semantics for what data scientists want for missing data (well, at least in my case).

As for null / Null.... I do wonder whether this should be something that has to do with missing references and null pointers. (E.g. a Nullable{T} could be much like a RefValue{T} but be allowed to contain the zero pointer - together with the realization that mutable struct T is behaving more-or-less like a Ref{T} (for an immutable struct T) with some getfield/setfield! overloading, it might work out OK for uninitialized arrays of references, etc - I dunno, pretty speculative and off-topic, but just making the point that the term "null" could have a better/different/more precise use).

@rofinn
Copy link
Member

rofinn commented Oct 2, 2017

@richardreeve Adding a number to a missing value could be an error depending on what the context of that missing value is (e.g., we don't know what T is in Union{T, Null}). Given the broad use of null (or Nullable) in software development I'd be inclined to err on the side of not propagating. I'm fine if there is a separate NA type which does propagate, but that provides context on what the missing value represents and how it is intended to be used.

@richardreeve
Copy link

@rofinn Okay, I was getting confused about what you were saying - I thought you were discussing using null in statistics from the context of the message... I agree that other contexts have different requirements. Hence your observation that we're conflating missing values in different situations I guess! I'd be very happy to see different names (e.g. NA) for the different contexts to avoid confusion, but that doesn't mean we don't know how each should work in its own right.

@andyferris
Copy link
Member Author

andyferris commented Oct 2, 2017

At the moment I see these 4 as semantically clear and distinct possibilities. @rofinn, @richardreeve - it seems to me that you are discussing 3 vs 4.

  1. nothing, Void - the result of no code, e.g. nothing === (begin; end) (a part of the core language)
  2. null, isnull, Nullable - something to do null pointers. Could alternatively use "undef" as the noun here as that's what users currently "see". (a part of the core language)
  3. Maybe{T} - a container containing either zero or one element of type T, useful for "software engineering". Also, Some{T} vs. exceptional result types (error code, etc) would be a very decent alternative here, or we currently have Nullable{T} for this and might want to avoid churn. In any of these cases, it requires manual unwrapping. (could be in Base, a stdlib, or a package)
  4. Unknown / Unknown{T} - semantically means that we expect in reality there exists some value of type T that should be here, but we don't currently know what it is. Intended for data scientists, meant to be used unwrapped like true | unknown == true. (could be in Base, a stdlib, or a package, with the caveat that perhaps we might decide we want this to interact with if in some way)

While I appreciate that other languages have lamented the sheer number of different types of "nullable" types, I see here four semantically distinct things - it would be hard to accidentally use one of the 4 in place of another and not get errors relatively quickly.

@richardreeve
Copy link

@andyferris And to be honest, although I understand the importance of the others, I only care about 4. And though there may be technical reasons for putting each of them in Base or not, I think all of them should go in partly for "marketing" reasons to make a "strong statement" about support, and partly because as we've just perfectly demonstrated this is very confusing so we have to have a completely clear core definition of each so there is no disagreement about which one is being talked about and used at any given time!

@andyferris
Copy link
Member Author

I very much appreciate that sentiment, @richardreeve. If I might offer some interpretation here for you - I think to many (most) here, the v1.0 milestone marks the beginning of some stability in the language - that we won't be changing the syntax or making too many breaking changes to the types in Base and so-forth (though I'm sure change will remain much more rapid than many other languages). It definitely doesn't mean the language supports all the features we want it to include.

Thus, the prerequisite to putting something completely new like in Base at this stage is that it has (nearly) universal support (which is a simple proxy/indication that no-one can see that we need to make breaking changes in the immediate future). I still feel we have a problem in reaching such universal consensus (and I dunno, the fact that brainstorming of radical changes is still occurring,might indicate we haven't really reached a point of stability/maturity, either).

The nice thing is that we can still have definite plans to add case 4 to Base or a standard library in v1.1, and release v1.0 sooner, which might benefit a wider range of users than delaying v1.0 for a long time or for making mistakes which means we have to release v2.0 soon with a lot of breaking changes. We can still make this "strong statement" at a future date. OTOH it's my opinion that we can (and should) tolerate a small delay to v1.0 to sort this stuff out, if that's at all feasible.

@mlhetland
Copy link

@andyferris Which other uses than no. 3 do you see for null pointers (i.e., no. 2)? And how is this different from the use of nothing, which seems to be the current plan for canonical handling of this? If code either returns something or not, you get Union{T, Void}, no? Assigning that to a variable seems to be an example of a case where there might or might not be a value. As opposed to the unknown value case (NA). And these are the main two variants that have been discussed, no? Something vs nothing and known vs unknown?

@JeffBezanson
Copy link

null || false == null

I do think there are cases in Base where it will be nice/ergonomic to handle null correctly out of the box, but I really, really don't want to do this. The problem is that since you don't know whether null is true or false, you don't know whether the branch should be taken. This would also require putting Null not just in Base, but deep inside the compiler.

@nalimilan
Copy link
Member

Yes, null | false === null is probably enough. || and && are control flow operators, so I think they should behave like if and throw an error when they encounter null. For control flow you have to either execute a block or not execute it: null propagation is not really possible, you have to choose to treat null either as true or as false. Incidentally, this is was Erik Meijer and Eric Lippert advised based on the C# behavior and experience, while advocating at the same time three-valued logic for & and |.

@mlhetland
Copy link

It may be a cheat, but … it may not be entirely unreasonable to just forbid basing control flow on it, as you don’t know what you’re supposed to do? I mean, if you have null || func(), and func has side effects, returning null won’t cut it. Seems to me this should just be an error, just like using null as the condition of an if statement?

@mlhetland
Copy link

Eh. Right. Wrote my response in parallel with @nalimilan, there.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 3, 2017

The problem is that since you don't know whether null is true or false, you don't know whether the branch should be taken.

Both. For || you evaluate until you hit a true, for && you evaluate until you hit a false. I was not advocating for allowing null in other boolean contexts like if or ? : – only in short-circuit chains. This behavior is not magical – it falls straight out of the 3VL truth tables and stopping evaluation as soon as the result of a chain is determined, which is exactly the rule we already use. I don't care too much about this, but if you're going to have null in base and allow it to participate in & and | logic then making && and || work analogously seems somewhat sensible.

@mlhetland
Copy link

Sure, not magical or crazy :-) But it does seem that the short-circuit operators are idiomatically used as conditionals in Julia, in which case the lines seem to blur, perhaps? Things like something || error("…") are everywhere. In practice, then, null is masquerading as false. Might be fine, but if it doesn't make sense for of or ? :, then maybe this doesn't make sense either?

More destructive in cases like x == 0 && println("It's zero!"), which would print if x is null, no? At the very least the documentation should warn against the use of such idioms whenever null might appear, I guess?

@nalimilan
Copy link
Member

That makes sense, though that's low priority compared with other features we would want to work with null. If we add support for null in Base that question would certainly have to be considered.

@oxinabox
Copy link

oxinabox commented Feb 6, 2019

close now?

@nalimilan
Copy link
Member

At last!

@nalimilan nalimilan transferred this issue from another repository May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests