add `Base.isunique` #15803

tpapp · 2016-04-08T13:27:24Z

A common idiom for testing if values in a collection C are unique is

length(unique(C))==length(C)

or variations, eg when the length n is known,

length(unique(C))==n

These show up in various tests, assertions, inner constructors. However, if the goal is only to test for uniqueness, not to obtain a list of unique elements, then something like

function isunique(C)
    seen = Set{eltype(C)}()
    for x in C
        if in(x, seen)
            return false
        else
            push!(seen, x)
        end
    end
    true
end

is much more efficient: it can terminate early and avoids the construction of the intermediate array. Also, programmer's the intent is more clear.

I am of course reluctant to suggest adding functions to Base, but it would good to have unique and isunique in the same module.

The text was updated successfully, but these errors were encountered:

toivoh · 2016-04-08T13:34:13Z

I agree that this is a nice thing to have, but I'm maybe we can find a better name? It's not really about whether the argument is unique, but whether it has unique elements. I feel that this could lead to some confusion. I guess that you would also want to add a method that always returns true for sets, and maybe for key iterators for dicts.

tpapp · 2016-04-08T13:48:39Z

Another name I was thinking of is allunique, but maybe someone has a better suggestion.

And of course there should be specialized methods when they make sense (eg also for subtypes of Range).

If you agree on a name, I would be happy to submit a PR.

StefanKarpinski · 2016-04-08T14:02:59Z

alldistinct?

stevengj · 2016-04-08T20:33:32Z

+1 for alldistinct or isdistinct.

To bolster @tpapp's point, I did a quick search of registered Julia packages, and found at least 10 packages which use this idiom.

A couple of additional suggestions:

Might be nice to have an isdistinct(collections...) which loops over multiple collections, so that you don't need to concatenate them into a single collection first just to check uniqueness. Though technically this is redundant with isdistinct(flatten([collections....])), given flatten from add Flatten iterator #14805.
I found a number of packages with checks like length(unique(C)) >= M or length(unique(C)) == N for some small constant N, e.g. length(unique(C)) == 1 checks. Obviously, this can be done much more efficiently just by adding length(seen) >= M && return true to something like your implementation above. Maybe that could be combined into the same function via a keyword argument isdistinct(collection, numdistinct=N, mindistinct=M)

JeffreySarnoff · 2016-04-09T02:39:47Z

+1 This introduces simplification wlog and improves 'at a glance' understanding.
alldifferent

JaredCrean2 · 2016-04-09T03:05:51Z

Is this going to cause confusion with functions (that don't currently exist, but might one day) to check if elements of an array alias each other? unique strikes me as referring to values but distinct sounds more like identity to me.

tpapp · 2016-04-09T06:01:47Z

@stevengj: For the second use case (which actually uses the value of length(unique(C))) I find the keyword argument solution confusing. One could introduce countdistinct, or even better, a function distinctset, which is like unique but returns a set instead of a vector. Indeed, if unique did not need to maintain order and could return a set, the two functions would be the same.

However, in the PR for this issue I don't want to complicate things, so I will just do a PR for alldistinct. The other function can be added later on if there is a demonstrated need.

Regarding the first use case: I really like the idiom with flatten in #14805. Given that it will be available, I don't think it is necessary to provide alldistinct(collections...), even though alldistinct(C, s::Set=Set()) is tempting because I could just chain them together.

toivoh · 2016-04-09T06:46:39Z

+1 for alldistinct, and for keeping it simple.

stevengj · 2016-04-09T12:36:29Z

@tpapp, for the second use case, the point is you don't need to know the value of length(unique(C)). You only need to know a lower bound on the value. In most cases this will allow you to exit the loop early, before examining all the elements.

nalimilan · 2016-04-23T13:27:05Z

Sorry for restarting the bikeshedding so late, but what's the reason to prefer alldistinct to allunique? Saying "distinct" instead of "unique" sounds like it relies on a different concept, which isn't the case.

nalimilan · 2016-04-24T13:59:07Z

Scott proposed another possible name: hasduplicates.

tpapp · 2016-04-24T14:23:08Z

@nalimilan: I would be fine with either as I see no a priori reason for picking one over the other (I guess this happens with names frequently); however, alldistinct is supported by multiple people above so I did the PR that way. If there is a compelling reason I can redo it, but if not, I would prefer to have #15914 merged so that the issue can be closed.

nalimilan · 2016-04-24T14:29:51Z

Choice of terms is indeed in large part arbitrary. But once a term is chosen, it's important IMHO to be consistent across the API. In the present case, I thought allunique would be better for consistency with unique. For example, ?unique would list both functions. (Again, sorry for the annoyance.)

@StefanKarpinski @stevengj Any reason why you preferred alldistinct rather than allunique?

kmsquire · 2016-04-25T00:22:37Z

FWIW, I like this argument for consistency.

On Sunday, April 24, 2016, Milan Bouchet-Valat [email protected]
wrote:

Choice of terms is indeed in large part arbitrary. But once a term is
chosen, it's important IMHO to be consistent across the API. In the present
case, I thought allunique would be better for consistency with unique.
For example, ?unique would list both functions. (Again, sorry for the
annoyance.)

@StefanKarpinski https://github.com/StefanKarpinski @stevengj
https://github.com/stevengj Any reason why you preferred alldistinct
rather than allunique?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#15803 (comment)

StefanKarpinski · 2016-04-25T16:01:43Z

I don't think that saying the elements are "all unique" is correct. Each element is in itself unique, but the fact that they are all individually unique doesn't mean anything – that's true of any collection. When you say unique(itr) you're asking for its unique values. It is meaningful to say that all of the values are distinct, on the other hand.

StefanKarpinski · 2016-04-25T16:04:13Z

To argue against myself, the term "unique" means "unlike any other" which in this case could implicitly be taken to mean "within this collection" – and if each value is unlike any other in the collection then that is what we want. I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

nalimilan · 2016-04-25T16:13:04Z

Indeed no element is "unique" individually, but that also applies to "distinct". I really don't see the difference. :-)

That's why I think we should keep the "unique" terminology.

StefanKarpinski · 2016-04-25T16:16:21Z

Yeah, I'm ok with allunique then.

jeffreysarnoff-dev · 2016-04-25T16:54:31Z

I see the difference. A is distinct from B iff there is a distinction that distinguishes A from B (and this is the sense that alldistinct carries). A is unique iff there is exactly one occurrence of A (singleton types are allunique).

tpapp · 2016-04-27T17:43:43Z

It seems that the only thing that's preventing a merge is a decision on naming. I have a weak preference for alldistinct for reasons mentioned by @StefanKarpinski and @jeffreysarnoff-dev, but given a signal from a Julia language member that this is all that is preventing the merge and a decision for an alternative name, I would be happy to update the PR to any other name.

While I understand that various arguments can be made for various names, IMO the benefit of having a function for this common idiom is the most important thing.

nalimilan · 2016-04-27T18:14:04Z

I was under the impression that I won the fight for allunique since nobody appears to be in opposition to it. :-) So why not update the PR to use that name? Then the merge won't need to wait.

StefanKarpinski · 2016-04-27T18:14:38Z

+1 on allunique

Tetralux · 2016-04-27T18:15:50Z

~~I also +1 on allunique~~ (See my comment, below.)

tpapp · 2016-04-27T18:41:17Z

Thanks for the quick replies, made the change in the PR.

Allows testing if elements in a collection are distinct (when compared with isequal). Terminates early in case of repeated elements, has methods for collections which are by construction distinct. Also added tests. See discussion at issue #15803.

StefanKarpinski · 2016-04-27T21:13:31Z

Closed by 2cc803b

tkelman · 2016-04-28T03:05:45Z

I missed the change in opinion here, but allunique to me has the connotation of all(unique(foo)) for collections of collections which this is not doing

StefanKarpinski · 2016-04-28T03:12:30Z

Given that all(unique(x)) and all(x) do the same thing this confusion seems unlikely.

tkelman · 2016-04-28T03:15:47Z

Not if x is a collection of collections. Elsewhere when we concatenate names like maxabs2 it's because combining the composition into a single pass is more efficient. This isn't consistent with that precedent.

tkelman · 2016-04-28T04:45:45Z

I actually liked the hasduplicates suggestion, and that has no chance for confusion since there will likely never be a one argument has

Tetralux · 2016-04-28T05:37:47Z

On reflection, I think I may have been a little hasty with my +1...

I think that @StefanKarpinski's concession...

I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

...was a mistake.

Distinct-ness has a clear linguistic meaning, whereas unique is a little ambiguous in this case; "[They are] readily distinguishable" vs "the only one of its kind".

I do see the merit in @tkelman's thought, though. I doubt there would be any confusion with that one.

Based on which, alldistinct or hasduplicates would both be better suited to this.

It is worth noting however, that if we were to call this function alldistinct, it might not make a lot of sense to have unique; it might want a rename to distinct. Something to think about.

tkelman · 2016-04-28T05:43:37Z

If unique were renamed to distinct then alldistinct would have the same might-imply-composition downside as allunique has now. We might want to use the name distinct for something later, and you could argue that our use of unique is a little different than Matlab's and coreutils uniq since we don't output sorted results (or require sorted inputs).

tkelman · 2016-04-30T09:36:21Z

Would anyone object to renaming to hasduplicates? If I don't hear anything I'll open a PR, though we should leave it open a little longer this time to get more opinions.

KristofferC · 2016-04-30T09:46:08Z

Late to the party here but hasduplicates is in my view a better name that is easier to mentally parse.

stevengj · 2016-04-30T12:34:41Z

I don't see the problem with allunique.

tkelman · 2016-04-30T12:50:47Z

It's unrelated to all, that name sounds like a composition to me but it's really a different collection property concept. Trying to see what's overall least objectionable.

nalimilan · 2016-04-30T14:01:43Z

The term duplicates is used e.g. in DataFrames.jl, but nowhere in Base, so I still prefer allunique. I don't think there's any possible confusion with all(unique, x). We don't ship any function starting with all which would be a shortcut for all(f, x).

tkelman · 2016-04-30T14:19:16Z

I was actually quite surprised that the convention was so uniform, we have nothing else with an all prefix and only haskey with a has prefix, every other exported property checking function starts with is.

What is duplicates used for in DataFrames? It seems more self explanatory than unique which you'd have to be familiar with from another language to recognize what that concept is referring to.

johnmyleswhite · 2016-04-30T15:27:24Z

duplicates finds rows that would violate a unique key constraint if all of the columns of the DataFrame were a unique key.

Perhaps the names at https://stat.ethz.ch/R-manual/R-devel/library/base/html/duplicated.html are better: duplicated and anyduplicated.

tkelman · 2016-04-30T15:38:40Z

Judging by

anyDuplicated(.) is a “generalized” more efficient shortcut for any(duplicated(.)).

I'm guessing you mean "better" for DataFrames' usage rather than the function being discussed in this issue?

johnmyleswhite · 2016-04-30T15:41:26Z

I was actually suggesting anyduplicated as the name being bikeshedded here.

nalimilan · 2016-04-30T16:35:02Z

That would indeed make sense if Base had a duplicated function (which would return true for elements which are equal to at least one element preceding them in the collection). Doesn't sound like an absurd addition to Base.

StefanKarpinski · 2016-04-30T17:58:10Z

Wouldn't the more general version of this be a function which given a vector returns an Int vector of the same length where each value is the number of times that element has been seen at that point? I'm not entirely convinced that such a specific thing belongs in the standard library, however – it's pretty easy to write this function yourself, after all.

tkelman · 2016-05-03T02:30:56Z

Ah. I didn't really follow what "violate a unique key constraint ..." meant.

The R precedent is good enough for me, final up/down on anyduplicated? #16152

StefanKarpinski · 2016-05-03T14:50:54Z

Honestly, I think I'm going to think allunique and then remember, "oh wait, it's not called that, it's called something weird... oh yeah, anyduplicated... and then I need to negate it too."

StefanKarpinski · 2016-05-03T14:52:07Z

I'm also just not concerned about ambiguity between allunique(x) and all(unique(x)) because the latter is a silly thing to do.

tkelman · 2016-05-03T14:57:59Z

What does the "all" mean in the name then?

stevengj · 2016-05-03T15:15:19Z

Given that we have unique(a), I think that allunique(a) is easier to remember and is reasonably unambiguous. English is not a very precise language, but I think "allunique = "all elements are unique" = "all elements are distinct" = "no elements are duplicated"` should be clear to most people.

stevengj · 2016-05-03T15:18:26Z

(If you google "all unique" elements you'll find that this usage is pretty commonplace.)

ymer · 2017-01-04T23:51:58Z

How about allsame(). Doesn't it have the same argument for inclusion?

nalimilan · 2017-01-05T13:44:32Z

@ymer That's easy to write efficiently as all(x->x == X[1], X).

stevengj · 2017-01-05T14:20:28Z

function allsame(X)
    isempty(X) && return true
    X1 = first(X)
    return all(x -> isequal(x, X1), X)
end

would be more correct (doesn't assume X is indexable), and uses isequal like allunique. The fact that it is easy to get this wrong is an argument for inclusion.

On the other hand Python doesn't seem to have an all_same function (though of the Python posted solutions on stackoverflow are buggy for the same reasons as above), nor does Ruby, nor does Haskell. If other languages don't include an allsame function in their standard libraries, that's an argument against including it in Julia.

ymer · 2017-01-05T14:54:29Z

But those languages don't have an allunique function either. So it seems to me that the argument for inclusion of allsame (or allidentical maybe) still stands, given that allunique is included. It's also assymmetrical to have one function and not the other, even if the implementation is easier for one of them.

ymer · 2017-01-05T15:19:08Z

I also prefer alldistinct to allunique semantically. (And prefer both to anyduplicated.) That's what I would say in normal speech "they are all distinct (from each other)". But I see the point that unique is already a function. Interestingly the Iterators package has a distinct function.

tpapp · 2017-01-05T15:21:54Z

Maybe there is a common pattern here:

function allsatisfy(pred, itr)    # is there a nicer way to code this?
    state = start(itr)
    done(itr, state) && return true
    (prev_elt, state) = next(itr, state)
    while !done(itr, state)
        (elt, state) = next(itr, state)
        if pred(prev_elt, elt)
            prev_elt = elt
        else
            return false
        end
    end
    true
end

allsatisfy(<, 1:10)             # true
allsatisfy(>, 1:10)             # false
allsatisfy(==, 1:10)            # false
allsatisfy(<, [])               # true
allsatisfy(<, [1])              # true
allsatisfy(==, fill(1,10))      # true

Caveats:

not tested extensively,
I am terrible at naming things,
perhaps this should go in a separate issue? it has little to do with allunique IMO.

jeffreysarnoff-dev · 2017-02-22T08:04:23Z

I am not attending these:

same	duplicate
more often we apply our computational	too two-y
skill to ascertain and utilize nonsameness	less flexible

📆

candidates	the contraquality	usefulness with its dual
distinct	indistict, absent discriminative factor	2
unique	alike, with discernable commonality	3

Conceptually, anyunique is more crisp andclearly stated than anydistinct.
That is how I arrive here to suggest we use allunique.

tpapp mentioned this issue Apr 17, 2016

Add Base.allunique #15914

Merged

StefanKarpinski closed this as completed Apr 27, 2016

tkelman mentioned this issue May 1, 2016

Rename allunique to anyduplicated #16152

Closed

add Base.isunique #15803

add Base.isunique #15803

Comments

tpapp commented Apr 8, 2016

toivoh commented Apr 8, 2016 via email

tpapp commented Apr 8, 2016

StefanKarpinski commented Apr 8, 2016

stevengj commented Apr 8, 2016

JeffreySarnoff commented Apr 9, 2016

JaredCrean2 commented Apr 9, 2016

tpapp commented Apr 9, 2016

toivoh commented Apr 9, 2016

stevengj commented Apr 9, 2016

nalimilan commented Apr 23, 2016

nalimilan commented Apr 24, 2016

tpapp commented Apr 24, 2016

nalimilan commented Apr 24, 2016

kmsquire commented Apr 25, 2016

StefanKarpinski commented Apr 25, 2016

StefanKarpinski commented Apr 25, 2016

nalimilan commented Apr 25, 2016

StefanKarpinski commented Apr 25, 2016

jeffreysarnoff-dev commented Apr 25, 2016

tpapp commented Apr 27, 2016

nalimilan commented Apr 27, 2016

StefanKarpinski commented Apr 27, 2016

Tetralux commented Apr 27, 2016 • edited Loading

tpapp commented Apr 27, 2016

StefanKarpinski commented Apr 27, 2016

tkelman commented Apr 28, 2016

StefanKarpinski commented Apr 28, 2016

tkelman commented Apr 28, 2016

tkelman commented Apr 28, 2016

Tetralux commented Apr 28, 2016

tkelman commented Apr 28, 2016

tkelman commented Apr 30, 2016

KristofferC commented Apr 30, 2016

stevengj commented Apr 30, 2016

tkelman commented Apr 30, 2016

nalimilan commented Apr 30, 2016

tkelman commented Apr 30, 2016

johnmyleswhite commented Apr 30, 2016

tkelman commented Apr 30, 2016

johnmyleswhite commented Apr 30, 2016

nalimilan commented Apr 30, 2016

StefanKarpinski commented Apr 30, 2016

tkelman commented May 3, 2016

StefanKarpinski commented May 3, 2016

StefanKarpinski commented May 3, 2016

tkelman commented May 3, 2016

stevengj commented May 3, 2016

stevengj commented May 3, 2016

ymer commented Jan 4, 2017

nalimilan commented Jan 5, 2017

stevengj commented Jan 5, 2017 • edited Loading

ymer commented Jan 5, 2017

ymer commented Jan 5, 2017

tpapp commented Jan 5, 2017

jeffreysarnoff-dev commented Feb 22, 2017

add `Base.isunique` #15803

add `Base.isunique` #15803

Tetralux commented Apr 27, 2016 •

edited

Loading

stevengj commented Jan 5, 2017 •

edited

Loading