-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include parallel collections in Toolkit? #31
Comments
Would be nice to have indeed For bigger projects that would require things like |
If you depend on Toolkit and other libraries too, it all just ends up piled into your classpath and your application code is free to use or ignore or whatever it wants. |
@iusildra see also the typelevel toolkit. |
I feel like something like parallel collections would be nice, but probably not parallel collections themselves. IIRC the codebase was pretty old, had a lot of weird non-idiomatic/no-longer-idiomatic APIs (e.g. configuring things via mutability?), wasn't super well maintained, and didn't get widely adopted even after 10+ years. AFAIK the recommendation for years about using |
@lihaoyi I am not using toolkit and I am not likely to use it in a near future, but I am using parallel collections in my application to parallelize a few inner loops processing thousands of items. If I am not to use parallel collections, what should I use instead?
Could you provide some link? I do not see any kind of warning in https://docs.scala-lang.org/overviews/parallel-collections/overview.html. It says:
|
I'll answer this point by point.
I would describe it as mature and well-established. No one has needed to write a competitor, because this one does the job. I don't see why it matters how old the code is unless there is actually something wrong with the code.
I acknowledge that the configuring the thread pool usage is a little weird, but it's rare that anyone doing Toolkit-y, script-y type things would need to configure it at all. Most of the time you just want to do some stuff in parallel and it does that. Easy things easy, harder things possible.
It doesn't need any maintenance. There aren't any open bugs to speak of.
No? There are 3500 hits at https://github.com/search?q=scala-parallel-collections&type=code . You might have this impression if people aren't talking about it, but sometimes there isn't a lot to say about a workhorse library like this.
I agree with Ondrej. There is no such recommendation. One aspect of the parallel collections that did have a negative reputation was how much they deepened and complicated the collections hierarchy when they were introduced. But we fixed that in Scala 2.13, when the parallel collections were re-engineered, became a separate library, and stopped being intertwined with serial collections. |
From the toolkit introduction:
I was under the impression that only libraries supporting Scala.js and Scala Native would be included, which I don't think is the case for the parallel collections. Although I guess one could publish them for Scala.js and Scala Native 0.4.x with dummy implementations that just wrap the sequential collections. (As an aside, I do use the parallel collections a lot on scala-cli scripts, so I'm excited to see this landing) |
This may be where my impression came from. I must admit, I don't have any concrete objections here, other than a vague feeling. Perhaps it's no longer a problem since it's been modularized. How do parallel collections play with If parallel collections are now a module, does that mean they are open to modifications? If so, then we should definitely consider sanding off some of the rough edges as part of including it in the toolkit, e.g. replacing the mutable threadpool thing with an
I've generally used val foo = items.map(x => Future{doThing(x)}).map(Await.result(_, Duration.Inf)) I admit it's a bit more clunky to use than |
Nowadays, there are fs2, zio , pekko/akka streams out there, -1 for including parallel collection |
There isn't anything I'm aware of that matches the simplicity of the parallel collections, so I think it's worth including for that reason. |
Just to echo Mr Li's impression, but maybe it's just that (as happens) the real-world effectiveness did not match heightened expectations. As usual, Mr Ichoran's summation is especially succinct and persuasive. The reply to He-Pin's objection, and one is liable to mix up one's kerrs, is that the toolkit is about simplicity and scope. So an easy-to-use solution of limited scope is desirable. |
What about sending a MR to scala/dotty compiler and let the compiler use it first? @SethTisue submit a poll on reddit and see? |
I forgot to say that I checked Released for Scala 3 nullifies any suggestion the project is moribund. Actually, #14 in this list is impressive: Worth adding that lack of tickets does not directly imply quality, but may imply usability in proportion to the use cases people use it for. @He-Pin also worth adding that suitability for a limited use case does not imply suitability for others. But one may ask if "inclusion in the toolkit" constitutes an endorsement, and should the user receive further guidance. I'm sorry this topic missed the recent survey that folks were complaining was already too long. "And before we let you hit submit, please tell us your thoughts on the parallel collections. Yes, that one. Good for noobs who have no opinions about concurrency and/or parallelism? or is par short for paradigmatic?" |
@som-snytt Hum, interesting , Zio is on the list too, IIRC, zio has zero dependency. @SethTisue What's the status of this scala/scala-parallel-collections#22 |
That's about 2.12 and 2.12 isn't relevant to the Toolkit, which only targets 2.13 and 3
Interesting. I've recorded the suggestion at scala/scala-parallel-collections#251 . Note that the Toolkit includes os-lib, so that's a major precedent for including something that does stuff that only works on the JVM. |
I put a little straw poll up at https://twitter.com/SethTisue/status/1703774958789255199 to see what people think |
The point of having Toolkit is to be batteries-included for day-to-day operation, like parsing JSON a la Python, and hopefully set the new and experienced users on the paved path of Scala. Parallel operation is an interesting one because it is one of the recognizable benefit of adopting functional programming, and to some extent it's the battle ground of "how we do things" in Scala for the last decade, especially in terms of balancing high volume of request and slow/limited IO operations. I feel like, either you're coming from Akka, Typelevel, or even plain ExecutionContext, one consensus is to avoid performing blocking operation without marking it as such. In other words, the simplicity of parallel collection might do the wrong thing for exactly the kind of use cases that one would want to use parallel collection, like reading many files |
There is also this course: https://www.coursera.org/learn/scala-parallel-programming. IIRC, when I went through the course, it wasn't spelled out too clearly/often that while the parallel collections used to be integrated with the main Scala release, they later got separated out into module https://github.com/scala/scala-parallel-collections. I see in https://github.com/scala/scala-parallel-collections/blob/439b9c6e7e68c0407d69f7b09074ed03c82271aa/README.md?plain=1#L36 that in older versions of Scala one could just invoke .par on a collection, whereas in later versions, the following import is needed to be able to do that: import scala.collection.parallel.CollectionConverters._ One thing that might discourage some people from looking further into parallel collections is the sentence in bold in the following passage in https://www.packtpub.com/product/learning-concurrent-programming-in-scala/9781783281411 (1st edition):
Not sure if the second edition differs in this respect. P.S. if there is ever a third edition, updated for Scala 3, I suggested a photo for the front cover: https://twitter.com/philip_schwarz/status/1530584650481127430 |
However, it's released only for the JVM. It makes sense, as there is currently no way to do that on other platforms. It will change after the next minor of Scala Native, so we may consider releasing Adding a dummy implementation for any platform would do more harm than good. One would expect that If we added |
One question here: does scala-parallel-collections have any issues around blocking? e.g. A naive When you use |
scala-parallel-collections is implemented using @szymon-rd will the new Scala Native support @lihaoyi I suspect this answers your question as well. the implementation isn't naive I don't know if @axel22 sees GitHub notifications, but perhaps he'd like to weigh in, as the first author of https://infoscience.epfl.ch/record/165523/files/techrep.pdf ("On a Generic Parallel Collection Framework", Aleksandar Prokopec, Phil Bawgell, Tiark Rompf, Martin Odersky, June 2011) |
That it uses forkjoinpool doesnt really answer my question; Futures can use forkjoin pool too, and still suffer this failure mode. One option is you run out of threads, one option is you spawn more threads and eventually run out of memory because threads are expensive. I'm not aware of any third option apart from Loom virtual threads that are a lot cheaper. Naive or not, these are hard problems in the design space for concurrency/threading frameworks, so I'd want to know how this stuff works. The fact that someone probably thought very hard about it a decade ago doesn't tell me what tradeoffs they ended up choosing, and thus what pitfalls any user-land code will have to be careful to avoid. A quick skim of the docs didnt pull up anything here, maybe it's not a problem, but it's worth confirming |
Yes, it already does in the 0.5.0 snapshots. |
Probably I'm the only one who ported https://github.com/axel22/ScalaDays2012-TrieMap shortly before bed last night. It was a no-brainer and worked the first time in 2023 under WSL, the requisite decade since
I wonder if toolkit includes |
occasional/casual users are still a use case |
some support at https://users.scala-lang.org/t/scala-toolkit-0-2-0-is-out-discussion/9355/4 :
|
Yes I support this, especially currently the students of the Parallel Programming course are in dire need of an easy quick way to use |
I was a bit surprised recently to realize that we didn't include scala-parallel-collections in the Toolkit.
Just now I looked at the old spreadsheet of candidate libraries, assuming I'd see it was considered and rejected or postponed, but I don't even see a spreadsheet entry for it? Was it really never considered?
I think we should include it, or at least discuss including it. It has a lot of things going for it:
scala.
namespaceWhat about the library's usefulness/importance?
For some tasks, it's extremely convenient and can give a large speed benefit. I think it didn't end up being quite as widely used as we'd originally envisioned, but when the unit of work that you need to happen in parallel is large enough, it's super easy to use parallel collections to get a big speedup.
Here's an example. I have a Scala-CLI script that does several hundred independent GitHub API queries. With only this small diff:
I sped up the script from taking 5 minutes to taking less than 30 seconds.
The text was updated successfully, but these errors were encountered: