Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non uniform null behaviour #830

Closed
avianey opened this issue Apr 17, 2017 · 24 comments
Closed

Non uniform null behaviour #830

avianey opened this issue Apr 17, 2017 · 24 comments

Comments

@avianey
Copy link

avianey commented Apr 17, 2017

Mathjs is having strange behaviour regarding null :

math.eval('null')

returns null

math.eval('null+null')

returns 0 and so does every function call like cos(null) that returns 1
shouldn't the parser return null as long as there's one undefine value in the tree ?

@josdejong
Copy link
Owner

The reason is that math.js converts null into a number with value 0 when needed to do a calculation with it. Same with strings: math.add("2", 3) returns 5. I'm always open to reconsider these decisions if there is a good reason.

Related topic: #353.

@Nekomajin42
Copy link
Contributor

I think there is a good reason why null exists. It means something does not have a value. It's empty. Zero means it has a numeric value, which is zero. In statistic functions it matters if you have an array containing empty elements or zeroes.

@avianey
Copy link
Author

avianey commented Apr 18, 2017

In fact I need to identify when an expression has an undefined result...
For instance, if x = null then any expression using x should be null... meaning undefined
What I get is 1 = cos(x).
(I'm using null because undefined break mathjs ;-))

I'll try to

  1. math.compile(expr)
  2. identify vars from the scope used in expr
  3. then see if they are undefined in the scope

I'll post here if this approach succeed...

@balagge
Copy link

balagge commented Apr 19, 2017

I also beleive the general conversion logic that converts null to 0 is too universal and causes loss of information if null is intended to mean "value not available or not applicable".

I use null in that sense. The meaning of null is also different from undefined, because null means a deliberate omission of a value (i.e. the value is known to be missing or empty), whereas in the case of undefined this may mean "not yet computed, but may be computed later".

The current conversion makes null a synonym of 0, so having null as a value makes no difference (and, after a computation/function call, this information is lost completely from the result). In any case, I would also support the idea of removing the default universal conversion of null to 0, thus creating a possibility to use null meaningfully, (even if that use is quite limited). In my opinion, any computation performed on null should also return null (as a general rule; of course, exceptions may exist, and should be studied individually).

One case where null could mean 0 effectively is sparse matrices, but I am not fully aware of the sparse matrix implementation, so I am not sure if that makes sense or is relevant at all.

@josdejong josdejong mentioned this issue May 7, 2017
13 tasks
@josdejong
Copy link
Owner

Thanks guys, these are good arguments. Ok let's change the behavior of null in the first next breaking release (v4), see #682.

Thinking about the changes that we need to do:

  • Drop implicit conversion from null to 0, like in math.add(null, 0)
  • Change the behavior of stats functions like min, max, mean to simply ignore null values, so mean([2, null, 4]) should return 3 instead of 2.

Are there any more behaviors that need to be changed regarding null?

@whindes
Copy link

whindes commented Jun 1, 2017

We need an implementation for parse.set and parse.eval

let expression = "(calcium_mgdL < 7.2) or (calcium_mmolL > 3.5)"

//TRUE because null is evaluated to zero (and we expect FALSE).
parse.set("calcium_mgdL", null );//We want something FALSE or ignored.
//TRUE
parse.set("calcium_mmolL", 4.1 );

//Expected behavior:
//The left side of the expression before OR should evaluate to FALSE.
//The right side of the expression after OR is evaluated to TRUE.
parse.eval(expression) === true (Even though the first part will evaluate 0 < 7.2)

Current behavior using undefined instead of null fails the whole expression which is something we do not want.

The workaround is create the expression to handle null/zero as boolean false.

let workaroundExpression = "(calcium_mgdL and (calcium_mgdL < 7.2)) or (calcium_mmolL and (calcium_mmolL > 3.5))"

@josdejong
Copy link
Owner

@whindes I'm not sure whether I understand your point. Do you mean that changing the behavior for null would make it more cumbersome to work with variables that can be both null or a number? (i.e. the proposed breaking change would make your life harder?)

@firepick1
Copy link
Collaborator

firepick1 commented Jun 14, 2017

Having recently been burned by null and the native Math.min, I have become convinced that null should not be coerced to anything but itself. The native Math.min and Math.max do the following:

var x; // variable to be clipped
Math.max(Math.min(-1, x), 100) // returns 0 ! yukky :(

This "feature" of native Math wasted hours of my time, since I was relying on the assumption that operations with null return null. That is indeed the case with "+" and "-". My brain simply applied induction to arrive at the assumption that max and min would behave in a similar civilized manner. They do not.

The expectation I have is that any math package should be consistent in its handling of null. Since mathjs is built on Javascript, we do have the language itself creating the expectation that null is not zero.

I use min and max as mathematical operators (like +), not as statistical operators.

For the statistical use of min and max with collections of things, I would even prefer to do the following at length since the intent reads clearly:

Math.min(lotsOfNumbers.map(n => n == null ? 0 : n));

If we need a statistical min and max that assume 0, I suggest we introduce new functions to handle the expected statistical behavior. For example, we might have minStat and maxStat for null coercion to 0.

p.s., Amusingly, Wolfram Alpha parses null as 0. So 1+null is 1 for them. But we are Javascript.

@Nekomajin42
Copy link
Contributor

@firepick1
The problem with null is that it means a missing value. We don't know if it would be zero, or any other number. It's like calling every stranger Joe. Sure, we can do that, but because we don't know their names, we should use "person unknown" or such.

In this case, I think the good solution is what the MS Excel does. It just skips the null values in an array, and works only with the rest of the array.

If you have a rare use case, when you have to treat null as zero, you can always use the built-in map function to pre-process your data, and have actual zeros in the array.

@firepick1
Copy link
Collaborator

For mathematics, I favor consistency over "good", since "good" is subjective and endlessly debatable. I will admit that with Javascript we may be doomed in any effort to achieve consistency. For example 1+null is null but -1<null is true (YUK!!!). Sadly, math.min is actually consistent with Javascript's own < operator, so I can't really complain. Objection retracted. I'm already Javascript abused, so I'll keep my peace.

@josdejong
Copy link
Owner

I think we all agree that the behavior of null should changed from implicitly converting to 0 to keeping it explicitly null (unlike the behavior of JavaScript).

See also the arguments of @honestserpent : #682 (comment), though lets continue the discussion here in #830.

To summarize the thoughts so far:

  • mean([2, null, 4]) should return null. Same with max, min, etc
  • We have a separate function or mode to deal with null by ignoring it. Can be nanmean or for passing an option like mean([...], {'null': 'ignore'})
  • In case of wanting to treat null as 0, you can always convert null's explicitly via number(matrix).

Any help with implementing the new behavior will be welcome. Anyone interested in picking it up?

@Nekomajin42
Copy link
Contributor

@josdejong
I don't think that every function should return with null automatically if some of the inputs are null. I think they should filter the null values, and do not take them into account. A few examples:

  • mean([2, null, 4]) should return 3
  • min([2, null, 4)] should return 2
  • 2 + null should return 2 (because there is nothing to add)
  • 2 * null should also return 2 (because there is nothing to multiply by)

On the other hand sqrt(null), pow(null, 2) or pow(2, null) and other functions like these should return null of course. (While sqrt(4, null, 9) should return [2, null, 3].)

Can the behaviour be tied to the predictable config option? If it's true, all the functions should return null. If it's false, functions should filter the null values and do their jobs with the remaining inputs.

@josdejong
Copy link
Owner

hm, I'm not sure about the behavior of operators + and * that you propose. I can follow your reasoning but I think the behavior will become quite unpredictable if some functions ignore null whilst others process it and return null. Ideally, the behavior should be easy to understand.

I find it a strong argument that when some value is null, there is something special going on, and the default behavior should be to return null in that case and force the user to take action.

As for the behavior of statistics functions like mean([2, null, 4]), we need two different implementations, and we need to decide upon the default behavior:

  • one implementation that returns null when any of the values is null.
  • one implementation that ignores null values.

I think that the most common use case for a function like mean is wanting to ignore null values, and it may be inconvenient to have to deal with nulls every time again. On the other hand, it shouldn't be that hard to use the alternative nanmean or nullmean or something like that.

@Nekomajin42
Copy link
Contributor

If you think about it, the way I see it, each function treats null the same way. They all ignore it.

  • If null is the only input, they return null, because there is nothing to work on.
  • If there are multiple inputs, there are two cases:
    • If a function works on an array element wise, then null elements return null, and other elements work normal.
    • If a function works with the whole array, then null values are filtered.

About the addition and multiplication, prod[2, null, 3]) should return 6. It is the equivalent of 2*null*3,

This is how MS Excel and similar spreadsheet programs work. I don't know about Mathlab and other software, but it seems quite logical to me.

About the implementations, I think it would be best to handle both cases with one function, and a global config option or an optional parameter. Multiple implementations with different names would be messy.

Either way, in my opinion, the default behaviour should be the one that ignores null. I think it, because it is the simple case. Don't mind null, carry on. In the other case, if we mind null, there can be different solutions to handle it. Keep it as is and get null in return, or treat it as 0 or any other default value, and do the given calculation with it.

@josdejong
Copy link
Owner

That makes sense indeed, it's a consistent behavior. Let me give it some more thought.

@officer-rosmarino
Copy link

officer-rosmarino commented Jun 19, 2017

@Nekomajin42 @josdejong I think this is just a different way of approaching the problem. Of course both of them are ok and will handle null value properly.

But here are my 2 cents (I am gonna use the mean function as example, but this applies to every other function -median, sum, etc-).

I think that implementing the default mean function behavior to ignore and remove null may bring developers to not be able to spot and identify potential bugs in their code. For example: in my case, I have to calculate the mean of an array of not-null values. I need to know if some values was erroneously set to null.

If the default behaviour of mean is to just ignore null values, I may be brought to think everything is fine, while insted there is a bug that has to be found and fixed, which may be found only after a lot of time, because it would not result in an error.

Of course I can always check before running a mean, but this means a lot of more code everytime I use the mean function.

Instead, if the mean function would not ignore null values by default, a potential bug would be instantly found.

I think that if the developer wants to ignore null values, he should explicitly agree on that, either by using a separate function (eg. nanmean) or by providing options mean([...], {'null': 'ignore'}).

I think this behavior should apply to every function, sum and prod included.

Summarizing, here what I would expect to see (substitute nan-versions of functions with options if you want):

  1. mean([1,4, null]) should return null
  2. nanmean([1,4, null]) should return 2.5
  3. sum([1,4, null]) should return null
  4. nansum([1,4, null]) should return 5
  5. other functions

I also would like to add that Matlab, Python (numpy) and R, 3 of the most used programming languages for data science, all deal with nan values by treating them as nan and not by ignoring them by default.

Some example links:
Matlab's sum
Matlab's nansum
Matlab's mean
Matlab's nanmean
numpy's mean
numpy's nanmean

@josdejong
Copy link
Owner

Thanks for sharing your ideas @honestserpent . This echos the opinion of others: null is a special value, and when encountering null, something is going on, and the user should explicitly handle it. Indeed similar to NaN.

@sfescape
Copy link

sfescape commented Nov 9, 2017

I would also suggest allowing a default value for null (or undefined) to be explicitly set, perhaps in the scope variable.

@josdejong
Copy link
Owner

Thanks for your input @sfescape . Some functions allow passing your own default value, like subset, is that what you mean?

@sfescape
Copy link

Yes, exactly.

@officer-rosmarino
Copy link

Do you guys think this may be a good first contribution, or this would be too hard? Maybe I would like to take it and try to develop this as my first GitHub contribution ever

@josdejong
Copy link
Owner

@sfescape ok that's indeed also my preferred solution, passing along with the function itself (not something globally configurable)

@honestserpent great that you want to contribute! I think this topic (#830) is not te best to start with, because it requires changing code at quite some different places and levels in the project. An easy one to start with is for example #964. Or you could go deep into maths and try to improve the performance of determinant, see #908 . There are plenty of issues that you can pick up, is there any of your particular interest?

@josdejong
Copy link
Owner

In the v4 branch I've removed implicit conversion of null to 0, and I've improved the error messages of the statistics function a bit. This means that the statistics functions simply will break when you pass a null, forcing you to explicitly solve the issue yourself by either converting null to 0, or by filtering out zero's. For example:

math.mean([1,2, null, 3])
//TypeError: Cannot calculate mean, unexpected type of argument (type: null, value: null)

// Strategy 1: convert null to 0
math.mean(math.number([1,2, null, 3])) // 1.5

// Strategy 2: filter null values
math.mean(math.filter([1,2, null, 3], x => x !== null)) // 2

@josdejong
Copy link
Owner

mathjs v4 is released today, containing the improved handling of null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants