Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About SparseVectors #11324

Closed
lindahua opened this issue May 18, 2015 · 63 comments
Closed

About SparseVectors #11324

lindahua opened this issue May 18, 2015 · 63 comments
Labels
needs decision A decision on this change is needed sparse Sparse arrays
Milestone

Comments

@lindahua
Copy link
Contributor

There's been a lot of debate about whether we should incorporate sparse vectors into Julia 0.4.

Here are three options:

  • (A) Use SparseMatrixCSC to emulate a sparse vector. This, as many have pointed out in other issues, breaks the consistency with the dense arrays. It is also unnatural and inefficient.
  • (B) Add SparseVector to base before 0.4 is released. The functionalities have been developed and tested in SparseVectors.jl. These functionalities are relatively simple and they don't need a long time to get mature. ([WIP] SparseVector support + tests #8718)
  • (C) Deprecate sparse vector related functions in Base, and add SparseVector to base after 0.4 is released. (Remove methods for creating "sparse vectors" #11323)

Personally, I don't like (A), and I am fine with either (B) or (C).

One opinion against (B) is that there can be multiple ways to represent a sparse vector. However, IMHO, the way used in SparseVectors.jl seems to be the most natural way, and it is consistent with both CSC and COO formats. I understand that other formats may be more suitable for certain situations. For such cases, it is not difficult to develop a new sparse vector format (using a different type name of course).

cc: @tkelman @ViralBShah @StefanKarpinski @mlubin

Since @ViralBShah has tagged #8718 as 0.4, and that PR was closed, I tag this as 0.4 in order to make sure that we come to a decision before 0.4 is released.

@lindahua lindahua added needs decision A decision on this change is needed sparse Sparse arrays labels May 18, 2015
@lindahua lindahua added this to the 0.4.0 milestone May 18, 2015
@ViralBShah
Copy link
Member

I am in favour of (B) and to get it all right for 0.4. I don't like just removing the functionality and then adding it again later, which would break codes yet again. This will be an improvement over what already exists, IMO.

@mlubin
Copy link
Member

mlubin commented May 18, 2015

Agreed that the implementation in SparseVectors.jl is an improvement over what already exists. Don't take my objection over the maturity and name too seriously if there's a consensus for option (B).

@andreasnoack
Copy link
Member

The "inconsistency" between dense and sparse arrays doesn't go away by adding a sparse vector type. Even with a sparse vector, there is still no sparse arrays for dimensions higher than two, so I don't buy the consistency argument and therefore also not the "unnatural" argument. It might be very useful with a sparse vector type, but I think we should focus of the class of operations that could benefit from a sparse vector type and some performance measures.

@tkelman
Copy link
Contributor

tkelman commented May 19, 2015

Here's your inconsistency argument:

julia> A = rand(5,5); As = sparse(A);

julia> size(A[:, 3])
(5,)

julia> size(As[:, 3])
(5,1)

I'm repeating myself, but I don't care that sparse doesn't generalize beyond dimension 2 yet. We're doing dimension 1 wrong, along with the operations that result in dimension 1 from dimension 2. Let's just fix that for now.

@ViralBShah
Copy link
Member

It is these kinds of inconsistencies I don't like, which make the system feel lacking. I wish we had N-d too, but that is likely to happen in JuliaSparse rather than Base.

I also feel that a sparse vector will be useful with non-numeric data types as well.

@ScottPJones
Copy link
Contributor

Although, for non-numeric data types, what is the type for fields not represented?
The types of sparse arrays and vectors I have implemented (for database use) didn't store null or undefined fields. For Julia, I guess you'd need type Any, and Nothing.

@mlubin
Copy link
Member

mlubin commented May 19, 2015

@andreasnoack, the storage for a sparse vector and a 1-column CSC matrix is essentially the same, I don't think there are any performance differences. The argument is that the current semantics are wrong.

@andreasnoack
Copy link
Member

I still think that it would be useful to see examples application of sparse vectors. The indexing behavior of special matrix types is hard. It is a standard attack on the special matrix types and one of the reasons they might not fit well as subtypes of AbstractArray. I don't think indexing is the most important feature of sparse matrices or special matrices in general. They are mainly about * and \.

@johnmyleswhite
Copy link
Member

One common sparse vector application:

  • Read in a text corpus document-by-document and train a logistic regression classifier with an L1 penalty on parameters. Your parameter vector is a sparse vector and each new document (of which there should be exactly one in memory at a time) is a sparse vector. The only thing you need to do is compute dot products and update the parameter vector.

@lindahua
Copy link
Contributor Author

The SparseVectors package was initially motivated by the need to represent sparse samples in machine learning applications (@johnmyleswhite's example is an important case).

For sparse vectors, we often want to compute dot products, compute distances between them, add a sparse vector to another vector (in the gradient descent step), etc. Matrix-vector product is also very useful.

These are real applications. Look at recent papers on stochastic optimization, many new algorithms studying the procedures that work on sparse samples, one at a time.

@lindahua
Copy link
Contributor Author

Looks like, the general opinion is towards option (B). I will make a PR later this week.

@andreasnoack
Copy link
Member

Thanks for the examples. I think all of them could work for vectors represented by Nx1 sparse matrices. After all, that approach works well in MATLAB, but go ahead with another type. I won't obstruct the process further.

@mbauman
Copy link
Member

mbauman commented May 19, 2015

I preface this with the caveat that I do not do sparse linear algebra, but I have been playing with indexing a lot lately. I've not tried this yet, either.

Would it be possible to add a "phony" dimensionality to the CSC type, allowing it to pretend to be one or multi-dimensional? All indexing would be reduced or expanded to its canonical two dimensions. This will be very inefficient for high dimensional arrays, but it should be fairly easy to make it semantically correct and consistent (especially if we get #10525 and #11242). The JuliaSparse organization can provide more specialized types that are more efficient in the non-2-dimensional cases.

@ScottPJones
Copy link
Contributor

Somewhat OT: Since SparseVectors have decidedly numeric meaning in Julia (i.e. it is zero values that are removed, instead of null values), how would you name a package for storing efficiently sparse vectors of any type, where many values may be null? [think of JSON arrays, RDBMS rows, etc.]... Does something like that already exist, and I just haven't found it yet? Thanks!

@quinnj
Copy link
Member

quinnj commented May 19, 2015

Scott, check out the DataArrays.jl package.

On Tue, May 19, 2015 at 8:29 AM, Scott P. Jones [email protected]
wrote:

Somewhat OT: Since SparseVectors have decidedly numeric meaning in Julia
(i.e. it is zero values that are removed, instead of null values), how
would you name a package for storing efficiently sparse vectors of any
type, where many values may be null? [think of JSON arrays, RDBMS rows,
etc.]... Does something like that already exist, and I just haven't found
it yet? Thanks!


Reply to this email directly or view it on GitHub
#11324 (comment).

@tkelman
Copy link
Contributor

tkelman commented May 19, 2015

@ScottPJones Nullable types, or DataArrays potentially.

@mbauman that sounds finicky and more complicated than just adding a separate type for 1-d. If we're going to distinguish vectors from matrices within the type system for dense arrays, consistency really demands that we do the same for sparse. Faking it adds complexity to the CSC type, whose internals are already quite commonly used and relied upon not changing by packages.

I have an application where I build up a sparse CSC matrix representation column-by-column, often reordering columns but rarely changing their contents. For this, an array-of-SparseVectors is an ideal representation. Representing this as a 1-column CSC matrix per "vector" adds overhead for an extra integer field and an extra 2-element integer array for the column pointers for each object, which I'd really rather avoid since my columns have very few nonzeros on average.

@mlubin
Copy link
Member

mlubin commented May 19, 2015

@tkelman, I would love to be able to vcat an array of sparse vectors to get a sparse CSC matrix. I think I'm coming around on this.

@tkelman
Copy link
Contributor

tkelman commented May 19, 2015

hcat, you mean?

@mlubin
Copy link
Member

mlubin commented May 19, 2015

Yes, hcat

@tkelman
Copy link
Contributor

tkelman commented May 19, 2015

sparse matrices or special matrices in general [...] are mainly about * and \.

I'm going to have to disagree with you pretty strongly there. I use sparse matrices very heavily, but I do essentially zero linear algebra with them in Julia. Why? Because Julia's sparse matvec and matmul are still quite a bit slower than doing it in C++ or using MKL's sparse blas extensions, and the sparse factorizations available in Base aren't complete enough for my needs (SuiteSparse has no Bunch-Kaufman indefinite LDL^T).

My sparse matrices do have linear (and often custom nonlinear) algebraic meaning, but that gets sent to a compiled solver to do all the actual work. So indexing, reductions, data structure traversal and creation within Julia are really important to me on sparse and structured array types (re: JuliaLang/LinearAlgebra.jl#136). I should send Jiahao an email and see about visiting MIT and working on it for a few weeks leading up to JuliaCon.

Eventually Julia will be fast enough to write the solvers themselves - if I can get diagonal-D quasidefinite factorizations from Cholmod to work properly I've been meaning to write a Mehrotra QP solver as a proof-of-concept. Will be interesting to benchmark against Ipopt.

@ViralBShah
Copy link
Member

@jiahao is going to be travelling from now to JuliaCon. I will probably be around MIT a few days before and after JuliaCon - so would love to push on sparse matrix work in any way possible.

@ViralBShah
Copy link
Member

@mbauman The CSC/CSR format is dense in one dimension. It doesn't generalize well to higher dimensions, where you really just want the COO format - representing indices as tuples. I have thought about faking it in CSC, but you just get extra overhead for processing vectors. I would love to make the two representations close enough, so that one can write generic code for iterating through either SparseMatrixCSC or SparseVector.

@ViralBShah
Copy link
Member

@ScottPJones You can store anything in Julia's sparse data structures already:

julia> sparse([1,2], [1,2], ["hello", "world"])
2x2 sparse matrix with 2 ASCIIString entries:
    [1, 1]  =  "hello"
    [2, 2]  =  "world"

@ScottPJones
Copy link
Contributor

@ViralBShah What about the following?

julia> z = sparse([1,2], [1,2], ["hello", 1.5])
2x2 sparse matrix with 2 Any entries:
    [1, 1]  =  "hello"
    [2, 2]  =  1.5

julia> z[1,2]
ERROR: `zero` has no method matching zero(::Type{Any})
 in getindex at sparse/sparsematrix.jl:773

The problem is that Julia's sparse data structures all have 0 not be represented, but for RDBMS rows, you want NULL to not be represented....

@ViralBShah
Copy link
Member

Julia's sparse data structures although not strictly restricted to numeric values, have only been used with numeric values so far. Hence all the rough edges for non-numeric data. This discussion should really be on the mailing list and not here in this issue, where it is off-topic.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

Why should SparseVectors be in base? What is the benefit of this?

This is a bugfix. Read the entire thread, this is fixing a semantic inconsistency in the way the current sparse matrix implementation interacts with 1-dimensional indexing.

Sparse matrices are far too large, important, and widely used to consider removing before 0.4. We're already months behind schedule, let's not make it any worse.

How much effort that takes and pain it causes is contingent on how cleanly the excision goes (which is an open question).

Considering this has gone poorly every time we've tried it, I have serious reservations about doing this again until we work out how to do it more smoothly. For one thing I've suggested leaving functionality in base but moving it into independent modules instead of exporting it by default, which would leave the functionality still available but behind an import, and cause a much easier-to-deal-with breakage in packages.

@mlubin
Copy link
Member

mlubin commented May 21, 2015

@tknopp @StefanKarpinski, I was on the fence about including sparse vectors in Base, but really this addresses a significant usability and semantics issue with the current implementation, and better yet there's already code to do so and @lindahua is willing to push through a PR.

Agreed with @tkelman. Let's make the incremental improvement now and not create a huge new time sink for everyone to deal with.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

Precompilation will be addressed in time, more significant is what consequences saying "sparse matrices are not important enough to worry about in base" would have to the package version of sparse matrices being able to work well at all. If base Julia ignores the existence of sparse matrices and writes all routines in a way that they can only ever be efficient for dense matrices, that will be crippling. You need to rewrite all of LinAlg, every reduction, all of broadcast, all indexing and concatenation over again for every sparse matrix type. And again, squared, for all conversions and linear algebra operations between different types. We need a better framework in base first, for which SparseMatrixCSC can be the first implementation - ideally a small, pluggable one that could live out in a package, but the framework needs to exist first.

@StefanKarpinski
Copy link
Member

@ViralBShah wrote:

I believe this functionality is useful as is too. We will discover fallbacks to generic implementations, but we are pretty good about fixing them quickly.

This attitude is what I object to. We had this ongoing steady stream of broken sparse matrix stuff for a long time. We are past the point where it's acceptable to have that in Base – we're trying to get rid of that kind of thing; it's certainly shouldn't be adding more of.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

We had this ongoing steady stream of broken sparse matrix stuff for a long time

Stefan, I don't understand why you think it's suddenly fixed. It isn't yet.

@StefanKarpinski
Copy link
Member

It's better than it used to be and I object to introducing more half-baked functionality into base.

@mlubin
Copy link
Member

mlubin commented May 21, 2015

@StefanKarpinski, I think @lindahua's proposed PR is already or can be brought to a state that's not half baked before merging.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

This isn't half-baked. Things have only gotten better by gradual improvement. Which this is a case of.

@tknopp
Copy link
Contributor

tknopp commented May 21, 2015

Sparse matrices are far too large, important, and widely used to consider removing before 0.4. We're already months behind schedule, let's not make it any worse.

Why should being to large be an argument for not moving it in a package? Importance and widely used are pretty subjective. In my subjective opinion PyPlot is a lot more important and widely used than the sparse module.

If I understand Tony correctly there will be major overhaul for the sparse module in the future (e.g. generalization to CSR matrices...) and in my opinion there is no benefit of having this development in base. It will make the development even much harder than if it would be developed in a package. Or does nobody see this point?

@mlubin
Copy link
Member

mlubin commented May 21, 2015

@tknopp, I agree with you and have stated before that CSR shouldn't be developed in base, but that's not related to this PR. Sparse vectors aren't a major overhaul, they're a bugfix.

@tknopp
Copy link
Contributor

tknopp commented May 21, 2015

@mlubin: This is not a PR yet but a question how to proceed. If you all want this than simply go ahead and do it. The reason I started this discussion is to show the advantages of this being in a package. And I still have not seen an argument why it would be a bad thing.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

We agree with you on everything except timeframe, and maturity of the code removal process. 0.4 shouldn't be removing any more code. 0.4 should also stop adding code that isn't fixing bugs for that matter, and try to get out the door before fall. The bug here is As[3,:] returns a 1-column matrix instead of a 1-d vector, if there's a way to fix that bug before 0.4 without adding a new type or introducing some ugly fragile workarounds (#11323), speak up. #8718 was on Viral's personal wishlist to get into 0.4 for some time.

There's also a non-negligible risk of the sparse code bitrotting and losing a lot of visibility if it gets moved out of base. Sparse matrices will need the right infrastructure to exist in base first (which will have to be developed here) before they can work well as a package, and just as importantly they will need enough people to maintain them or they will wither and die.

@tknopp
Copy link
Contributor

tknopp commented May 21, 2015

We agree with you on everything except timeframe, and maturity of the code removal process. 0.4 shouldn't be removing any more code. 0.4 should also stop adding code that isn't fixing bugs for that matter, and try to get out the door before fall. The bug here is As[3,:] returns a 1-column matrix instead of a 1-d vector, if there's a way to fix that bug before 0.4 without adding a new type or introducing some ugly fragile workarounds (#11323), speak up. #8718 was on Viral's personal wishlist to get into 0.4 for some time.

  • Timeframe will always be bad
  • Removing the package is an equally valid bugfix ;-)

There's also a non-negligible risk of the sparse code bitrotting and losing a lot of visibility if it gets moved out of base. Sparse matrices will need the right infrastructure to exist in base first (which will have to be developed here) before they can work well as a package, and just as importantly they will need enough people to maintain them or they will wither and die.

This is where I object to. If there is not enough momentum to maintain this then the code is simply not important and should definitely not be in base. Visibility loss is also something which can be said about almost every package and we need other mechanisms to highlight "must have" packages.

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

Timeframe will always be bad

Completely wrong. The perfect time to make major, significant, breaking reorganizations and changes is early in a -dev cycle. We're not early in a -dev cycle, we're overdue for determining scope completion to close the -dev cycle to move to -pre and a release. We've blown past the 6 months for -dev then 3 months in -pre plan already.

Removing the package is an equally valid bugfix ;-)

Not until a replacement is available, tested, and working. And Base is capable of supporting that package working at all.

If there is not enough momentum to maintain this then the code is simply not important and should definitely not be in base.

We're way off topic, but the counterargument to that is how few commits to base are necessary to make use of this type of code once it's there and in place. Miles, Iain, etc have very few base commits, but their packages are a major driver of adoption. If Julia has a single killer app right now where it's undoubtedly better than the competition, it's OR, which you can't do at all without sparse matrices. Let's not cause major breakage there by deleting code without preparing a replacement first.

@johnmyleswhite
Copy link
Member

If Julia has a single killer app right now where it's undoubtedly better than the competition, it's OR, which you can't do at all without sparse matrices.

I think this is the core point of Tony's argument and it's completely accurate from what I can see.

@tknopp
Copy link
Contributor

tknopp commented May 21, 2015

Completely wrong. The perfect time to make major, significant, breaking reorganizations and changes is early in a -dev cycle. We're not early in a -dev cycle, we're overdue for determining scope completion to close the -dev cycle to move to -pre and a release. We've blown past the 6 months for -dev then 3 months in -pre plan already.

Yes you are right, I agree with that

We're way off topic

These discussions always end up with a statement that we are off topic. But where if not here should these discussions happen?

Let's not cause major breakage there by deleting code without preparing a replacement first.

I did not say first remove it and then do the package. Of course the order is the other way around.
But you are probably right that doing the package is a little bit more complicated due to the binary dependencies. If these were not there it would be a simple copy of the sparse directory into a new git repo

@tknopp
Copy link
Contributor

tknopp commented May 21, 2015

@johnmyleswhite:

  • What is OR?
  • Does OR require a package? If yes where would be the issue if an additional package is loaded?

@tkelman
Copy link
Contributor

tkelman commented May 21, 2015

OR = operations research, aka optimization, so JuMP, Convex, MathProgBase, the solver bindings, and to some extent Optim.

Yes these require external packages, and it probably wouldn't be the end of the world to add an additional dependency package to the mix, BUT, it needs to be done right. It's not yet clear to me, as things stand right now, if taking sparse matrices out of base wholesale can leave both parts working as well together as they did before moving them.

edit: I do think this is something we should eventually do, but it's a substantial amount of work and we shouldn't rush it. Either early in 0.5-dev or early in 0.6-dev, if someone wants to push on this then it's worth doing at that time.

@mlubin
Copy link
Member

mlubin commented May 21, 2015

We all want the same thing here, but I don't find the proposal to block a bugfix in favor of a major reorganization as we're trying to finish up 0.4 very convincing. @lindahua has put his valuable time and effort into developing sparse vectors, and we all have better things to spend our time on than dealing with the fallout of removing everything sparse from Base.

@ViralBShah
Copy link
Member

+1

@lindahua lindahua mentioned this issue May 24, 2015
6 tasks
@ViralBShah ViralBShah modified the milestones: 0.5, 0.4.0 Jun 5, 2015
@tknopp
Copy link
Contributor

tknopp commented Jun 5, 2015

Reading this thread in retrospect taking into account Virals change in the milestone hopefully shows that it could be a good move to decouple the sparse array development from Julia...

@tkelman
Copy link
Contributor

tkelman commented Jun 5, 2015

@tknopp there's still no particularly clean way of both fixing the indexing inconsistency bug with SparseVectors as a package while also avoiding massive package breakage or name conflicts by trying to move sparse matrices wholesale into packages. If you would like to start helping do some of the work to prepare package migrations for some of the functionality that's in base right now, please go ahead.

@ViralBShah
Copy link
Member

@tknopp I request not bringing off-topic discussions into issues. I also do not see how this conclusion is drawn - but this is a discussion to be had elsewhere. If you do wish to continue this discussion, I suggest the mailing list or an appropriate issue.

@malmaud
Copy link
Contributor

malmaud commented Oct 14, 2015

Can this be closed in light of #13440?

@tkelman tkelman closed this as completed Oct 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed sparse Sparse arrays
Projects
None yet
Development

No branches or pull requests