Skip to content
This repository has been archived by the owner on Oct 26, 2024. It is now read-only.

User-facing API review #268

Open
3 tasks
tbenthompson opened this issue Feb 7, 2023 · 11 comments
Open
3 tasks

User-facing API review #268

tbenthompson opened this issue Feb 7, 2023 · 11 comments

Comments

@tbenthompson
Copy link
Member

tbenthompson commented Feb 7, 2023

Time calcifies designs. We're writing more and more models and tutorials using imprint. Every month, it'll get more expensive (in terms of effort/time) to change APIs. Once we have external users, we'll need to worry about backwards compatibility before changing APIs.

James, Mike, could you both spend 30-60m thinking about this and share thoughts?

I want to mostly focus on imprint rather than "internal" stuff like Adagrid since it will always be easier to make changes on internal tools where we have access to 100% of the calling code.

Let's use this issue as a place to gather complaints and comments about the comment design. Either add a comment below, or add a link to another issue.

External/open-source tooling:

Internal tooling:

@tbenthompson
Copy link
Member Author

  • Actually implementing and trying out the detailed parameter for sim_batch. Is that a useful thing?
  • Maybe: rename sim_batch to stat_batch.

@JamesYang007
Copy link
Member

JamesYang007 commented Feb 7, 2023

I've actually never played around with calibration so no inputs there yet. All the model interface looks good to me for now. I haven't felt like it was awkward in any way. Only one suggestion:

  • Gridding was something that felt like a barrier sometimes. Sometimes I want to actually manually just create a grid and feed it in. Problem is that it's not just the thetas I need worry about; I need to make sure all the other columns in grid.df is initialized properly as well. Can we standardize a function for general gridding where user feeds in any subset of the columns in grid.df and the function assumes they're "correct" and infers what the rest of the columns are? For example, if I feed in thetas only, it will figure out null hypos flags (among other columns). If I feed in both thetas and null_hypos flags, it will only figure out the rest of the columns.

@tbenthompson
Copy link
Member Author

Problem is that it's not just the thetas I need worry about; I need to make sure all the other columns in grid.df is initialized properly as well.

I'm guessing you want something halfway in between init_grid and cartesian_grid. I've wanted that too.

Function signature something like:

def create_grid(theta, radii, null_hypos=None, prune=True):
    ... 

Then, I think I would set grid.init_grid to be private and call it like _raw_init_grid. Then the two main grid creation functions would be create_grid and cartesian_grid.

If I feed in both thetas and null_hypos flags, it will only figure out the rest of the columns.

Does this work? I think we always need theta and radii. I suppose if we just had theta, we could construct a Voronoi diagram for theta and then make bounding boxes for each voronoi cell to construct the tiles. Is that what you were imagining? That actually could be super nice and ergonomic for 1D and 2D problems. It would get a bit harder to understand and implement in 3D and higher...

@tbenthompson
Copy link
Member Author

tbenthompson commented Feb 7, 2023

Question for @JamesYang007.

Should Grid.prune() remove inactive tiles?

  1. What is an inactive tile? Our grids are append-only until explicitly pruned or subsetted. So, when a null hypothesis splits an input tile into two children, what happens is that the new grid actually contains three tiles: the parent tile marked as parent['active'] = False and the children with children['active'] = True. We also have a parent_id column. This inactivity/append-only design is also used heavily in the adagrid code. The nice thing about this is that you can look at the tiles to figure out how the grid got to be the way it is.
  2. Currently, Grid.prune() removes tiles that are fully in the alternative space, but does nothing about inactive tiles.
  3. Having inactive tiles accidentally in our results has caused a couple bugs recently.
  4. This is because ip.validate and ip.calibrate output results for all the tiles that are passed to them including tiles that are inactive. This seems like something worth changing! I think ip.validate and ip.calibrate should produce output only for active tiles (and perhaps also print a warning if those functions were passed inactive tiles).
  5. Grid.active() explicitly removes the inactive tiles.
  6. Should we make Grid.prune() also remove inactive tiles?
  7. We could rename Grid.active to Grid.prune_inactive and rename Grid.prune to Grid.prune_alternative.
  8. We could add a second (default True) parameter to cartesian_grid (and create_grid) above called prune_inactive=True.

@JamesYang007
Copy link
Member

JamesYang007 commented Feb 7, 2023

I'm guessing you want something halfway in between init_grid and cartesian_grid. I've wanted that too.

Function signature something like:

def create_grid(theta, radii, null_hypos=None, prune=True):
    ... 

Then, I think I would set grid.init_grid to be private and call it like _raw_init_grid. Then the two main grid creation functions would be create_grid and cartesian_grid.

Yah this is perfect!

If I feed in both thetas and null_hypos flags, it will only figure out the rest of the columns.

Does this work? I think we always need theta and radii.

Oopsie I forgot about radii, but yes, the create_grid is basically what I'm talking about.

@JamesYang007
Copy link
Member

JamesYang007 commented Feb 7, 2023

As a user of just the cartesian_grid (and soon create_grid) I like prune_inactive=True!! While you're at it, to keep the prune_inactive, prune_alternative naming consistent, we can rename prune to prune_alternative in cartesian_grid.

@tbenthompson
Copy link
Member Author

Current public API of imprint:

  • Grid class and its methods
    • also grid constructors like create_grid and cartesian_grid.
    • The contents of the grid dataframe.
    • plot_grid
  • NullHypothesis class
    • HyperPlane
    • hypo helper function for easily defining planar nulls.
  • Model class
    • family and family_params
    • sim_batch
  • calibrate
  • validate
  • batch and batch_all - currently "public" because they're potentially useful for a model writer.
  • logging (but imprint doesn't do very much logging right now)

@tbenthompson
Copy link
Member Author

Reading through the above, I think one more place where we should modify the API is family and family_params. It would be nice to allow defining a custom family. We can just check whether family is a string or not. If it's not, we expect it to follow the same interface as the built-in bounds.

@JamesYang007
Copy link
Member

JamesYang007 commented Feb 7, 2023

Ah I was gonna go about this by making the users register it in the dictionary:

bound_dict = {
"normal": bound.normal.NormalBound,
"normal2": bound.normal2.Normal2Bound,
"scaled_chisq": bound.scaled_chisq.ScaledChiSqBound,
"binomial": bound.binomial.BinomialBound,
"exponential": bound.exponential.ExponentialBound,
}

but this is probably safer and more scalable.

@tbenthompson
Copy link
Member Author

tbenthompson commented Feb 7, 2023

We could get rid of family_params and have the Model-writer pass an object for family:

import imprint as ip
class ExampleModel(ip.Model):
    def __init__(self, ...):
        self.family = ip.bound.Binomial(n=10)

@JamesYang007
Copy link
Member

I like this a lot more. I wasn't fond of having to think about family and family_params as separate entities. We also don't have to do the dictionary checking anymore then. Validation/calibration just needs to check at some point that the model class does have a member called family.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants