Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make materialize_batch extendable without piracy #18

Open
ericphanson opened this issue Feb 12, 2023 · 5 comments
Open

Make materialize_batch extendable without piracy #18

ericphanson opened this issue Feb 12, 2023 · 5 comments

Comments

@ericphanson
Copy link
Member

ericphanson commented Feb 12, 2023

It would be nice to be able to extend it, in case e.g. something else needs to happen when collecting individual batch items for a particular type of Batcher or BatchItem. I think the issue is now we have a table instead of a single item, so we can't just dispatch off the item type. One option is allowing users to pass a singleton type to use for dispatch which gets passed around, or something like that.

In my case, I have 2 needs that the current method doesn't do:

  • I want to use map instead of asyncmap since my materialize_batch_item is doing a bit more work and it's not IO bound (I think)
  • I don't want to actually return Arrays yet but rather a Vector{Samples} and an array, since I still want the samples wrapper for a bit longer. [This part I could avoid by refactoring to move more work into this portion since I do need a raw array in the end, but it would be nice to be able to do that more incrementally without piracy]

edit: also realized materialize_batch_item tends to need piracy too, since it comes in as a NamedTuple, not a BatchItem (or such) object, at least w the Legolas v0.4 branch

@kleinschmidt
Copy link
Member

The extension approach so far has been focused on dispatching on the fields of the batch item (e.g. the batch_channels can be something other than a Vector{String}). But yeah I can see how that doesn't work at the level of batch assembly like you're talking about here.

I want to use map instead of asyncmap since my materialize_batch_item is doing a bit more work and it's not IO bound (I think)

is there a downside to using asyncmap in your case? I could forsee if each batch item is chunky then you may need to do them sequentially to fit in RAM but otherwise I can't really see a downside.

I don't want to actually return Arrays yet but rather a Vector{Samples} and an array, since I still want the samples wrapper for a bit longer

this one is harder, and I think your idea of wrapping the output of iterate_batch from your custom batches struct in a type you control is probably the best option. Looks like that would require (at the moment) an additional bit of indirection to make get_channel_data return the samples instead of the array, but that can be controlled with dispatch on batch_channels.

@ericphanson
Copy link
Member Author

For asyncmap, I am calling into XGBoost to try to get some heuristic labels via another model (…really not sure this will help 😄) and I don’t really know how well the C wrapper handles concurrency.

I also would like to pass an RNG through materialize_batch to do some augmentation there.

@kleinschmidt
Copy link
Member

IMO the Right Way to handle the augmentation threading is in a "channel selector" type that "freezes" the RNG state/augmentation parameters at iterate_batch time (in general, we can't guarantee that materialize_batch happens in any particular order, so the batch itself needs to contain all the state necessary to reproducibly load the data if that's important).

@kleinschmidt
Copy link
Member

I don’t really know how well the C wrapper handles concurrency.

I guess whether that's an issue will depend on whether the julia scheduler will switch to another task while teh C call is happening.

@ericphanson
Copy link
Member Author

I think it also depends how well global state etc is managed in the C lib. Bc even if Julia doesn’t task switch at ccall, it could in-between setup and execution calls (if there are, I haven’t looked into it). Hopefully it would be ok.

glennmoy pushed a commit that referenced this issue Apr 26, 2023
* WIP copy paste

* disturbuted

* update tests/src

* actually, you know, run the tests

* export

* not actually required for tests

and increases compile times

* need to wait...

* update docstrings

* api docs

* constructor docstring too

* Update src/batch_services.jl

Co-authored-by: Eric Hanson <[email protected]>

* Update src/batch_services.jl

Co-authored-by: Eric Hanson <[email protected]>

* fix docstring

* rmprocs, refactor retry, finalize channel on stop

* actually fix

* graceful error handling in status, tests

* still test on 1.7 since we declare compat

* missed test update

* more edge case tests

Co-authored-by: Eric Hanson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants