Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Base.isunique #15803

Closed
tpapp opened this issue Apr 8, 2016 · 55 comments
Closed

add Base.isunique #15803

tpapp opened this issue Apr 8, 2016 · 55 comments

Comments

@tpapp
Copy link
Contributor

tpapp commented Apr 8, 2016

A common idiom for testing if values in a collection C are unique is

length(unique(C))==length(C)

or variations, eg when the length n is known,

length(unique(C))==n

These show up in various tests, assertions, inner constructors. However, if the goal is only to test for uniqueness, not to obtain a list of unique elements, then something like

function isunique(C)
    seen = Set{eltype(C)}()
    for x in C
        if in(x, seen)
            return false
        else
            push!(seen, x)
        end
    end
    true
end

is much more efficient: it can terminate early and avoids the construction of the intermediate array. Also, programmer's the intent is more clear.

I am of course reluctant to suggest adding functions to Base, but it would good to have unique and isunique in the same module.

@toivoh
Copy link
Contributor

toivoh commented Apr 8, 2016 via email

@tpapp
Copy link
Contributor Author

tpapp commented Apr 8, 2016

Another name I was thinking of is allunique, but maybe someone has a better suggestion.

And of course there should be specialized methods when they make sense (eg also for subtypes of Range).

If you agree on a name, I would be happy to submit a PR.

@StefanKarpinski
Copy link
Member

alldistinct?

@stevengj
Copy link
Member

stevengj commented Apr 8, 2016

+1 for alldistinct or isdistinct.

To bolster @tpapp's point, I did a quick search of registered Julia packages, and found at least 10 packages which use this idiom.

A couple of additional suggestions:

  • Might be nice to have an isdistinct(collections...) which loops over multiple collections, so that you don't need to concatenate them into a single collection first just to check uniqueness. Though technically this is redundant with isdistinct(flatten([collections....])), given flatten from add Flatten iterator #14805.
  • I found a number of packages with checks like length(unique(C)) >= M or length(unique(C)) == N for some small constant N, e.g. length(unique(C)) == 1 checks. Obviously, this can be done much more efficiently just by adding length(seen) >= M && return true to something like your implementation above. Maybe that could be combined into the same function via a keyword argument isdistinct(collection, numdistinct=N, mindistinct=M)

@JeffreySarnoff
Copy link
Contributor

+1 This introduces simplification wlog and improves 'at a glance' understanding.
alldifferent

@JaredCrean2
Copy link
Contributor

Is this going to cause confusion with functions (that don't currently exist, but might one day) to check if elements of an array alias each other? unique strikes me as referring to values but distinct sounds more like identity to me.

@tpapp
Copy link
Contributor Author

tpapp commented Apr 9, 2016

@stevengj: For the second use case (which actually uses the value of length(unique(C))) I find the keyword argument solution confusing. One could introduce countdistinct, or even better, a function distinctset, which is like unique but returns a set instead of a vector. Indeed, if unique did not need to maintain order and could return a set, the two functions would be the same.

However, in the PR for this issue I don't want to complicate things, so I will just do a PR for alldistinct. The other function can be added later on if there is a demonstrated need.

Regarding the first use case: I really like the idiom with flatten in #14805. Given that it will be available, I don't think it is necessary to provide alldistinct(collections...), even though alldistinct(C, s::Set=Set()) is tempting because I could just chain them together.

@toivoh
Copy link
Contributor

toivoh commented Apr 9, 2016

+1 for alldistinct, and for keeping it simple.

@stevengj
Copy link
Member

stevengj commented Apr 9, 2016

@tpapp, for the second use case, the point is you don't need to know the value of length(unique(C)). You only need to know a lower bound on the value. In most cases this will allow you to exit the loop early, before examining all the elements.

@nalimilan
Copy link
Member

Sorry for restarting the bikeshedding so late, but what's the reason to prefer alldistinct to allunique? Saying "distinct" instead of "unique" sounds like it relies on a different concept, which isn't the case.

@nalimilan
Copy link
Member

Scott proposed another possible name: hasduplicates.

@tpapp
Copy link
Contributor Author

tpapp commented Apr 24, 2016

@nalimilan: I would be fine with either as I see no a priori reason for picking one over the other (I guess this happens with names frequently); however, alldistinct is supported by multiple people above so I did the PR that way. If there is a compelling reason I can redo it, but if not, I would prefer to have #15914 merged so that the issue can be closed.

@nalimilan
Copy link
Member

Choice of terms is indeed in large part arbitrary. But once a term is chosen, it's important IMHO to be consistent across the API. In the present case, I thought allunique would be better for consistency with unique. For example, ?unique would list both functions. (Again, sorry for the annoyance.)

@StefanKarpinski @stevengj Any reason why you preferred alldistinct rather than allunique?

@kmsquire
Copy link
Member

FWIW, I like this argument for consistency.

On Sunday, April 24, 2016, Milan Bouchet-Valat [email protected]
wrote:

Choice of terms is indeed in large part arbitrary. But once a term is
chosen, it's important IMHO to be consistent across the API. In the present
case, I thought allunique would be better for consistency with unique.
For example, ?unique would list both functions. (Again, sorry for the
annoyance.)

@StefanKarpinski https://github.com/StefanKarpinski @stevengj
https://github.com/stevengj Any reason why you preferred alldistinct
rather than allunique?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#15803 (comment)

@StefanKarpinski
Copy link
Member

I don't think that saying the elements are "all unique" is correct. Each element is in itself unique, but the fact that they are all individually unique doesn't mean anything – that's true of any collection. When you say unique(itr) you're asking for its unique values. It is meaningful to say that all of the values are distinct, on the other hand.

@StefanKarpinski
Copy link
Member

To argue against myself, the term "unique" means "unlike any other" which in this case could implicitly be taken to mean "within this collection" – and if each value is unlike any other in the collection then that is what we want. I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

@nalimilan
Copy link
Member

Indeed no element is "unique" individually, but that also applies to "distinct". I really don't see the difference. :-)

That's why I think we should keep the "unique" terminology.

@StefanKarpinski
Copy link
Member

Yeah, I'm ok with allunique then.

@jeffreysarnoff-dev
Copy link

I see the difference. A is distinct from B iff there is a distinction that distinguishes A from B (and this is the sense that alldistinct carries). A is unique iff there is exactly one occurrence of A (singleton types are allunique).

@tpapp
Copy link
Contributor Author

tpapp commented Apr 27, 2016

It seems that the only thing that's preventing a merge is a decision on naming. I have a weak preference for alldistinct for reasons mentioned by @StefanKarpinski and @jeffreysarnoff-dev, but given a signal from a Julia language member that this is all that is preventing the merge and a decision for an alternative name, I would be happy to update the PR to any other name.

While I understand that various arguments can be made for various names, IMO the benefit of having a function for this common idiom is the most important thing.

@nalimilan
Copy link
Member

I was under the impression that I won the fight for allunique since nobody appears to be in opposition to it. :-) So why not update the PR to use that name? Then the merge won't need to wait.

@StefanKarpinski
Copy link
Member

+1 on allunique

@Tetralux
Copy link
Contributor

Tetralux commented Apr 27, 2016

I also +1 on allunique (See my comment, below.)

@tpapp
Copy link
Contributor Author

tpapp commented Apr 27, 2016

Thanks for the quick replies, made the change in the PR.

StefanKarpinski pushed a commit that referenced this issue Apr 27, 2016
Allows testing if elements in a collection are distinct (when compared
with isequal). Terminates early in case of repeated elements, has
methods for collections which are by construction distinct. Also added
tests. See discussion at issue #15803.
@StefanKarpinski
Copy link
Member

Closed by 2cc803b

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

I missed the change in opinion here, but allunique to me has the connotation of all(unique(foo)) for collections of collections which this is not doing

@StefanKarpinski
Copy link
Member

Given that all(unique(x)) and all(x) do the same thing this confusion seems unlikely.

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

Not if x is a collection of collections. Elsewhere when we concatenate names like maxabs2 it's because combining the composition into a single pass is more efficient. This isn't consistent with that precedent.

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

I actually liked the hasduplicates suggestion, and that has no chance for confusion since there will likely never be a one argument has

@Tetralux
Copy link
Contributor

On reflection, I think I may have been a little hasty with my +1...

I think that @StefanKarpinski's concession...

I guess I just had a preference for "distinct" since it focuses on the relationship between values and not a property of each value. Maybe either would be ok.

...was a mistake.

Distinct-ness has a clear linguistic meaning, whereas unique is a little ambiguous in this case; "[They are] readily distinguishable" vs "the only one of its kind".

I do see the merit in @tkelman's thought, though. I doubt there would be any confusion with that one.

Based on which, alldistinct or hasduplicates would both be better suited to this.

It is worth noting however, that if we were to call this function alldistinct, it might not make a lot of sense to have unique; it might want a rename to distinct. Something to think about.

@tkelman
Copy link
Contributor

tkelman commented Apr 28, 2016

If unique were renamed to distinct then alldistinct would have the same might-imply-composition downside as allunique has now. We might want to use the name distinct for something later, and you could argue that our use of unique is a little different than Matlab's and coreutils uniq since we don't output sorted results (or require sorted inputs).

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2016

Would anyone object to renaming to hasduplicates? If I don't hear anything I'll open a PR, though we should leave it open a little longer this time to get more opinions.

@KristofferC
Copy link
Member

Late to the party here but hasduplicates is in my view a better name that is easier to mentally parse.

@stevengj
Copy link
Member

I don't see the problem with allunique.

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2016

It's unrelated to all, that name sounds like a composition to me but it's really a different collection property concept. Trying to see what's overall least objectionable.

@nalimilan
Copy link
Member

The term duplicates is used e.g. in DataFrames.jl, but nowhere in Base, so I still prefer allunique. I don't think there's any possible confusion with all(unique, x). We don't ship any function starting with all which would be a shortcut for all(f, x).

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2016

I was actually quite surprised that the convention was so uniform, we have nothing else with an all prefix and only haskey with a has prefix, every other exported property checking function starts with is.

What is duplicates used for in DataFrames? It seems more self explanatory than unique which you'd have to be familiar with from another language to recognize what that concept is referring to.

@johnmyleswhite
Copy link
Member

duplicates finds rows that would violate a unique key constraint if all of the columns of the DataFrame were a unique key.

Perhaps the names at https://stat.ethz.ch/R-manual/R-devel/library/base/html/duplicated.html are better: duplicated and anyduplicated.

@tkelman
Copy link
Contributor

tkelman commented Apr 30, 2016

Judging by

anyDuplicated(.) is a “generalized” more efficient shortcut for any(duplicated(.)).

I'm guessing you mean "better" for DataFrames' usage rather than the function being discussed in this issue?

@johnmyleswhite
Copy link
Member

I was actually suggesting anyduplicated as the name being bikeshedded here.

@nalimilan
Copy link
Member

That would indeed make sense if Base had a duplicated function (which would return true for elements which are equal to at least one element preceding them in the collection). Doesn't sound like an absurd addition to Base.

@StefanKarpinski
Copy link
Member

Wouldn't the more general version of this be a function which given a vector returns an Int vector of the same length where each value is the number of times that element has been seen at that point? I'm not entirely convinced that such a specific thing belongs in the standard library, however – it's pretty easy to write this function yourself, after all.

@tkelman
Copy link
Contributor

tkelman commented May 3, 2016

Ah. I didn't really follow what "violate a unique key constraint ..." meant.

The R precedent is good enough for me, final up/down on anyduplicated? #16152

@StefanKarpinski
Copy link
Member

Honestly, I think I'm going to think allunique and then remember, "oh wait, it's not called that, it's called something weird... oh yeah, anyduplicated... and then I need to negate it too."

@StefanKarpinski
Copy link
Member

I'm also just not concerned about ambiguity between allunique(x) and all(unique(x)) because the latter is a silly thing to do.

@tkelman
Copy link
Contributor

tkelman commented May 3, 2016

What does the "all" mean in the name then?

@stevengj
Copy link
Member

stevengj commented May 3, 2016

Given that we have unique(a), I think that allunique(a) is easier to remember and is reasonably unambiguous. English is not a very precise language, but I think "allunique = "all elements are unique" = "all elements are distinct" = "no elements are duplicated"` should be clear to most people.

@stevengj
Copy link
Member

stevengj commented May 3, 2016

(If you google "all unique" elements you'll find that this usage is pretty commonplace.)

@ymer
Copy link

ymer commented Jan 4, 2017

How about allsame(). Doesn't it have the same argument for inclusion?

@nalimilan
Copy link
Member

@ymer That's easy to write efficiently as all(x->x == X[1], X).

@stevengj
Copy link
Member

stevengj commented Jan 5, 2017

function allsame(X)
    isempty(X) && return true
    X1 = first(X)
    return all(x -> isequal(x, X1), X)
end

would be more correct (doesn't assume X is indexable), and uses isequal like allunique. The fact that it is easy to get this wrong is an argument for inclusion.

On the other hand Python doesn't seem to have an all_same function (though of the Python posted solutions on stackoverflow are buggy for the same reasons as above), nor does Ruby, nor does Haskell. If other languages don't include an allsame function in their standard libraries, that's an argument against including it in Julia.

@ymer
Copy link

ymer commented Jan 5, 2017

But those languages don't have an allunique function either. So it seems to me that the argument for inclusion of allsame (or allidentical maybe) still stands, given that allunique is included. It's also assymmetrical to have one function and not the other, even if the implementation is easier for one of them.

@ymer
Copy link

ymer commented Jan 5, 2017

I also prefer alldistinct to allunique semantically. (And prefer both to anyduplicated.) That's what I would say in normal speech "they are all distinct (from each other)". But I see the point that unique is already a function. Interestingly the Iterators package has a distinct function.

@tpapp
Copy link
Contributor Author

tpapp commented Jan 5, 2017

Maybe there is a common pattern here:

function allsatisfy(pred, itr)    # is there a nicer way to code this?
    state = start(itr)
    done(itr, state) && return true
    (prev_elt, state) = next(itr, state)
    while !done(itr, state)
        (elt, state) = next(itr, state)
        if pred(prev_elt, elt)
            prev_elt = elt
        else
            return false
        end
    end
    true
end

allsatisfy(<, 1:10)             # true
allsatisfy(>, 1:10)             # false
allsatisfy(==, 1:10)            # false
allsatisfy(<, [])               # true
allsatisfy(<, [1])              # true
allsatisfy(==, fill(1,10))      # true

Caveats:

  1. not tested extensively,
  2. I am terrible at naming things,
  3. perhaps this should go in a separate issue? it has little to do with allunique IMO.

@jeffreysarnoff-dev
Copy link

I am not attending these:

same duplicate
more often we apply our computational too two-y
skill to ascertain and utilize nonsameness less flexible

📆

candidates the contraquality usefulness with its dual
distinct indistict, absent discriminative factor 2
unique alike, with discernable commonality 3

Conceptually, anyunique is more crisp andclearly stated than anydistinct.
That is how I arrive here to suggest we use allunique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests