Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compensation (Python API) #340

Open
wants to merge 28 commits into
base: develop
Choose a base branch
from
Open

Compensation (Python API) #340

wants to merge 28 commits into from

Conversation

castillohair
Copy link
Collaborator

Solves the Python API portion of #238.

@JS3xton and I discussed a few issues to resolve before this gets accepted. I will transcribe those later (unless you beat me to it).

@castillohair
Copy link
Collaborator Author

castillohair commented Oct 9, 2020

These are replies to comments sent by @JS3xton via email.

I would want to see a minimum working example.

I'm going to assume that this point refers to an example here in the PR comments, showing that the function works. I'll do that soon.

I would want to see the new documentation, tutorial, and example material.

Three things here:

  • Documentation: Are you referring to anything besides the function docstring?
  • Tutorial: This depends on having an example dataset. See next point.
  • Example material: I guess this refers to a public-facing example dataset. One issue is that it's being surprisingly hard to find a dataset appropriate for compensation, since we didn't use multiple fluorophores too much. I have kind of a crappy example (with cells moving all over the place in FSC/SSC because their growth was weird). Karl hasn't been able to find a full compensation dataset yet. I remember you measured a GFP/mCherry transfer function. Do you have this dataset? Does it have autofluorescence and single fluorophore controls? If so, it would be a great example dataset. The other issue is whether this new dataset replaces the one you just added for the violin plots, or coexist with it. I don't like the latter too much but I could live with it I guess. If you found your dual color example it could actually illustrate both violins and compensation very effectively.

It would be nice if Excel UI integration was also done so all compensation-related changes could be introduced at once, but I don't think it's necessary. Moreover, I think major unresolved issues remain regarding whether and how to report summary statistics for compensated data. Without the Excel UI, though, compensation is only useful for Python API users.

Yeah unfortunately we never got any of this done, so I'm not willing to port this into the Excel UI for now. There's precedent of capabilities only available for Python users: the ellipse gate function, the new reusable density gate you made, and maybe more.

I think compensate.compensate() should be in the transform module (it conforms to the transform module function signatures), e.g. transform.to_compensated().

Eh. While I think you're technically right, there is precedent for moving transform-like functionality outside of this module to showcase a specific feature (i.e. the mef module). More generally, I've become skeptical about the need to have a dedicated module for "transformations". I think this came out of our old view that it was worth distinguishing between "channel" units and "a.u.", and therefore having a module that transformed between these two. But having worked with a lot of flow cytometry data, including data from more modern instruments which are stored directly in a.u., I started seeing channel units as an intermediate step that should not be used for anything. If present-day me had to remake FlowCal from scratch, I'd probably have FCSData objects be directly converted to a.u. upon loading, and eliminate the transform module, since we never used it for anything other than the to_rfi() function. That way FCSData objects from old and new instruments will be automatically in a.u., improving consistency. tldr: I prefer to keep compensation in its own module.

To conform to the mef precedent, compensate should have compensate.get_transform_fxn(). It may still be worth keeping compensate.get_compensation_params() if easily retrieving the autofluorescence vector and the bleedthrough matrix is useful.

compensate.get_transform_fxn() is not a bad idea. But note that the use case is different compared to calibration: People expect to be able to use compensation controls from a previous day, the same way you use autofluorescence from a different day. This means that you should have a way to actually obtain the compensation coefficients from a control experiment, and a way to use them later (or I guess you could load the raw data of your control experiment every time). In contrast, you're always expected to have a beads file with your current experiment that you can load and use with mef.get_transform_fxn(). I'll think about how to organize this.

I think the default statistic_fxn() for compensate.get_compensation_params() should be mean, since it has a probability-theory-justified interpretation, as discussed in the issue. (But I'm not willing to die on that hill.)

Sure.

What if I had replicates of the controls used by compensate.get_compensation_params() to calculate the autofluorescence vector or the bleedthrough matrix? Do we just not support that?

Yeah that's a good question. a0 is the result of applying statistic_fxn() to the autofluorescence sample. If we had many, do we make a0 the mean of these? or the statistic_fxn() of these? or do we concatenate events and take the statistic_fxn() of the result? You can see how this can turn into one of those endless stats debates. I'm fine supporting only one replicate.

In the GitHub issue, you mention labeling the compensated data "Compensated [channel name]", but the code doesn't appear to do that. (I don't think to_mef() changes the channel names, so this may no longer be a good idea. to_mef() does appear to ensure the range is updated, though.)

Yeah I don't know what I was thinking. We don't change the channel name to "Calibrated FL1" after applying to_mef().

Thinking wistfully ahead, I wonder where code to compensate to "absolute" fluorescence would go? I.e. using emission and absorbance spectra, laser power, etc. to create a signal independent of the instrument, like we discussed in the interlab study. Would that belong in compensate? If so, it would be nice if that could be added later in a backwards-compatible way.

I think at the end it will come down to a linear transformation anyway, which is what this module implements. And making room for a more sophisticated compensation method is yet another reason to keep compensate as its own separate module.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 10, 2020

  • Documentation: Are you referring to anything besides the function docstring?

Maybe the compensate module docstring, and any other function docstrings that arise, but yeah that's generally what I had in mind.

I remember you measured a GFP/mCherry transfer function. Do you have this dataset? Does it have autofluorescence and single fluorophore controls? If so, it would be a great example dataset. The other issue is whether this new dataset replaces the one you just added for the violin plots, or coexist with it. I don't like the latter too much but I could live with it I guess. If you found your dual color example it could actually illustrate both violins and compensation very effectively.

Ooo boy, I'll have to dig and see if I can find that. I don't remember what all controls I ran either, so I'll get back to you.

IF that data set looks good, I agree it could very well replace the current example dataset we just added in the plot.violin() PR. I would want all the same functionality to be demonstrated with it, though (e.g., we'd have to calculate a new mathematical model). That dataset is also technically unpublished, so we may want to get approval to use it.

I think compensate.compensate() should be in the transform module (it conforms to the transform module function signatures), e.g. transform.to_compensated().

Eh. While I think you're technically right, there is precedent for moving transform-like functionality outside of this module to showcase a specific feature (i.e. the mef module). More generally, I've become skeptical about the need to have a dedicated module for "transformations". I think this came out of our old view that it was worth distinguishing between "channel" units and "a.u.", and therefore having a module that transformed between these two. But having worked with a lot of flow cytometry data, including data from more modern instruments which are stored directly in a.u., I started seeing channel units as an intermediate step that should not be used for anything. If present-day me had to remake FlowCal from scratch, I'd probably have FCSData objects be directly converted to a.u. upon loading, and eliminate the transform module, since we never used it for anything other than the to_rfi() function. That way FCSData objects from old and new instruments will be automatically in a.u., improving consistency. tldr: I prefer to keep compensation in its own module.

To conform to the mef precedent, compensate should have compensate.get_transform_fxn(). It may still be worth keeping compensate.get_compensation_params() if easily retrieving the autofluorescence vector and the bleedthrough matrix is useful.

compensate.get_transform_fxn() is not a bad idea. But note that the use case is different compared to calibration: People expect to be able to use compensation controls from a previous day, the same way you use autofluorescence from a different day. This means that you should have a way to actually obtain the compensation coefficients from a control experiment, and a way to use them later (or I guess you could load the raw data of your control experiment every time). In contrast, you're always expected to have a beads file with your current experiment that you can load and use with mef.get_transform_fxn(). I'll think about how to organize this.

The compensate and mef modules do the same thing: They calculate transforms from data. I think they should conform to a common interface.

I've been reflecting on the transform module. Some thoughts:

  • It's always hard for me to remember how the MEF transformation traces its way through the mef and transform modules. I would be in favor of simplifying its derivation and exposure to the user.
  • I think I originally thought there were going to be a lot more transformations we would want to support (e.g., log, logicle, etc.). In practice, those have largely manifested themselves in the plot module.
  • The transform module is still useful for FCSData bookkeeping (e.g., making a copy of the FCSData object and updating FCSData.range()). I could envision this functionality being absorbed into FCSData, though (e.g., via a FCSData.transform() function).
  • The transform module is also still useful for applying transforms to non-FCSData data (e.g., a numpy array) in a standardized way. I don't know how many users use non-FCSData data, though. Moreover, if transform.to_mef() and transform.to_compensated() were moved back to their respective modules, those functions could still be written to support non-FCSData arrays.
  • I agree transform.to_rfi() might make more sense as an internal processing step of FCSData and doesn't really need to be exposed to the user as overtly as it currently is.
  • The mef module currently kind of bends over backwards to interface with the transform module (e.g., by producing a transformation function via mef.get_transform_fxn()). If that driving rationale is removed, the primary mef module interface point could possibly be simplified. I'm not sure what form it (and compensate) should take to simplify them, though. (Do we still return transformation functions? Do the modules provide functions that operate directly on data, like gate module functions?)

Upon reflection, I currently favor removing transform, updating FCSData, and simplifying the common mef and compensate module interfaces.

What if I had replicates of the controls used by compensate.get_compensation_params() to calculate the autofluorescence vector or the bleedthrough matrix? Do we just not support that?

Yeah that's a good question. a0 is the result of applying statistic_fxn() to the autofluorescence sample. If we had many, do we make a0 the mean of these? or the statistic_fxn() of these? or do we concatenate events and take the statistic_fxn() of the result? You can see how this can turn into one of those endless stats debates. I'm fine supporting only one replicate.

Yeah, I was thinking along the exact same lines. I also agree supporting one replicate should be fine. The user could concatenate data beforehand if they really wanted to. And anything more complicated than that (e.g., taking the mean of means) could be left to the user to figure out on their own.

@JS3xton JS3xton mentioned this pull request Oct 12, 2020
@castillohair
Copy link
Collaborator Author

Upon reflection, I currently favor removing transform, updating FCSData, and simplifying the common mef and compensate module interfaces.

It seems that we're on the same page regarding transform. HOWEVER, it was not my intention to suggest that we should do this now. This is way outside the scope of this PR, and will require many changes across the whole package that I do not have the time to work on now. My point was that we should not go in the direction of adding more stuff to transform while working on this PR. I can modify the module to use a get_transform_fxn function, although I prefer the module as it is now even if it's not consistent with mef.

I have looked at the example files you sent me. Unfortunately, your FL1 intensities were never high enough to significantly bleed into FL3. The following is the autofluorescence control:

NFC

And the following is the sfGFP control:

SFC1

FL3 fluorescence seems to be at around the same place. Therefore, while we can make a technically valid demonstration, it would not illustrate the point of the module very well.

My B. subtilis dataset is only slightly better. This is a sample with low fluorescence in both FL1/FL3:

S0001

This is one with high FL1 bleeding slightly into FL3:

S0024

I talked to Karl and it seems he might not have the data we need with the appropriate controls. Therefore, the best example dataset so far seems to be my B. subtilis dataset. If you can think of any better data source let me know.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 13, 2020

It seems that we're on the same page regarding transform. HOWEVER, it was not my intention to suggest that we should do this now. This is way outside the scope of this PR, and will require many changes across the whole package that I do not have the time to work on now.

I agree this would be a major change, and I'm in favor of filing a new issue for it to be completed when we have time. I'm also willing to take a crack at it now, but not if you want the first attempt or if you won't be able to consider it quickly.

My point was that we should not go in the direction of adding more stuff to transform while working on this PR. I can modify the module to use a get_transform_fxn function, although I prefer the module as it is now even if it's not consistent with mef.

I feel strongly that compensate should be consistent with mef. I would rather remove transform now and make them both nice and simple, but I understand if you don't have time for that right now.

That being the case, compensate.compensate() should be moved to transform.to_compensated(), and compensate.get_transform_fxn() should prepackage transform.to_compensated() with appropriate A and a0 values calculated from no-fluorophore and single-fluorophore control data. The partial function returned by compensate.get_transform_fxn() could then be used on multicolor data.

Unfortunately, your FL1 intensities were never high enough to significantly bleed into FL3.

Yeah, I'm not surprised to hear that. I didn't compensate that data set, and I don't recall having any problems with it.

Could bleed-through and compensation be better demonstrated with FL2? If not, then I agree the B. subtilis data set is probably best.

@castillohair
Copy link
Collaborator Author

castillohair commented Oct 14, 2020

I think the best course of action for now would be to make compensate conform with mef (I can take care of this), and leave the transform overhaul/elimination for another time.

Could bleed-through and compensation be better demonstrated with FL2?

That's a good idea. Do you remember if we ever agreed on what calibrated units were appropriate for FL2?

@JS3xton
Copy link
Contributor

JS3xton commented Oct 15, 2020

I think the best course of action for now would be to make compensate conform with mef (I can take care of this), and leave the transform overhaul/elimination for another time.

Sounds good.

That's a good idea. Do you remember if we ever agreed on what calibrated units were appropriate for FL2?

I don't think we ever did. Investigating that:

(I have two sources for the MEF Channel Configurations, and they don't always agree—specifically for MEPE and MEAP.)

Looks like MEFL aligns well with (and thus appropriately calibrates) FL1, as does MECY with FL3 (as previously determined).

If I had to guess the correct MEPE configuration, I would guess MEPE1, which looks like it would calibrate FL2 well. MEPTR also looks like it could be used, though.

@castillohair
Copy link
Collaborator Author

castillohair commented Oct 20, 2020

Sorry, I'm not sure I understand the last graph. What's the difference between MEPE1 and MEPE2? I agree that MEPE1 sounds like the best MEF to calibrate against.

There seems to be a small but noticeable difference in FL2 signal when sfGFP is present in your example. This is all based on made up FL2 MEF calibration values, so this may change a little when we figure out what values to actually use. Still, it may not be enough.

image

I can try to run my B. subtilis example with FL2 later today and see how it looks.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 20, 2020

Sorry, I'm not sure I understand the last graph. What's the difference between MEPE1 and MEPE2? I agree that MEPE1 sounds like the best MEF to calibrate against.

That plot, as best I can determine, shows the configuration of the cytometer used by Spherotech to cross-calibrate rainbow calibration beads to beads surface-labeled with a known amount of corresponding fluorophore (e.g., FITC for MEFL, PE for MEPE, etc.).

However, I found two different emails listing two different emission filter sets for their PE and APC channels, hence MEPE1 and MEPE2, etc. Our interlab colleagues have embraced MEPE1, though.

There seems to be a small but noticeable difference in FL2 signal when sfGFP is present in your example.

It's hard for me to tell because they're not plotted on the same scale. It looks like the FL2 peak shifts from ~60 MEF to ~95 MEF, though. I don't know if that's enough for a compelling compensation demonstration, though, and would have to see the final demonstration.

@castillohair
Copy link
Collaborator Author

Ok I'll use MEPE. Do you happen to have the beads datasheet or the MEPE values corresponding to your example?

@JS3xton
Copy link
Contributor

JS3xton commented Oct 20, 2020

Do you happen to have the beads datasheet or the MEPE values corresponding to your example?

I just emailed it to you.

@castillohair
Copy link
Collaborator Author

I reorganized the module and moved the functions to more closely resemble what we did with mef. @JS3xton see what you think.

Note that this method can be used in two ways. The "full" version uses a non-fluorescence control and subtracts autofluorescence from the samples to calibrate. This will result in non-fluorescent samples with histograms centered around zero, even with old flow cytometers that don't normally have events with negative fluorescence values. In newer cytometers, I assume it would center non-fluorescent histograms to zero if/when they're not already there. For a detailed justification on why I think this is the correct way to compensate, see here. On the other hand, one can pass None as a non-fluorescent control, in which case autofluorescence will be set to zero and the compensation procedure will be equivalent to an axis rotation. This is what most people do when they talk about compensation.

Now for the examples: This is the data from your dual reporter gate, before and after color compensation but without compensating for autofluorescence:

violins_no_nfc

And this is with autofluroescence compensation (data in the two panels on top were subtracted from the mean of the autofluorescence control, but no further color compensation was performed):

violins_with_nfc

In both cases, FL2 output is noticeably lower after compensating. I think the second plot is better at showing this difference, though.

As for the B. subtilis data, an apparent increase in FL2 fluorescence is eliminated after compensation. This is what that looks like without autofluorescence compensation.

violins_with_fl2

Furthermore, samples at t=330 and 450 min have a high FL2 fluorescence subpopulation. This is actually what I would expect since FL3, which doesn't suffer from much bleedthrough in this example, has the same long tails:

violins_with_fl3

I would like to hear opinions on which dataset to include as an example. I would prefer if only one example is included with FlowCal, so if we choose the B. subtilis example we should eliminate the previous gate dataset.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 28, 2020

I reorganized the module and moved the functions to more closely resemble what we did with mef. @JS3xton see what you think.

Awesome, yeah it looks much better. Some additional thoughts:

  • We should add unit tests, ideally using a simple synthetic data set.
  • compensate.get_transform_fxn() should be able to relax this constraint: "The number of channels should match the number of single-fluorophore control samples." I think we just need as many channels as fluorophores (including more channels than fluorophores). More channels than fluorophores may be an overdetermined system, and they often don't have exact solutions, but a least-squares method can be used to find a good solution (numpy recommends linalg.lstsq()). It would be nice if FlowCal automatically handled calculating an exact solution for a square matrix or a least-squares solution for an overdetermined system.
  • Pursuant with the previous point, I don't think compensate.get_transform_fxn() should prepopulate transform.to_compensated() with channels (mef.get_transform_fxn() does not do this). Instead, the channels asked for by the user should be extracted from A (throwing an error if they're not present), and then either the exact solution or a least-squares solution should be calculated and used to compensate the specified channels.
  • Calculation of the summary statistics in compensate.get_transform_fxn() looks precarious. I tried to reproduced those calculations manually with np.mean() and got a single scalar where FlowCal.stats.mean() returned a vector of scalars for each channel, so I think compensate.get_transform_fxn() probably doesn't behave correctly with numpy stats functions?
  • For code readability, I would prefer a little more discipline with the matrix algebra data structures. E.g., a0 really aught to be a column vector (internally) all the way through (coerced to such if necessary). When done properly, I think you can use much more natural matrix algebra expressions instead of the less clear list comprehensions and all the transposes (.T). (Same for arriving at A.)
  • On the Notes section in the docstring:
    • Explaining the math is a good idea, but at present it's long and a bit of an impenetrable alphabet soup. Can we just explain it with matrix notation? We might also explicitly mention that A is the "spillover matrix" and A^-1 is the "compensation matrix", as they appear to be called elsewhere. (1) (2)
    • I struggled to determine the units on A (or f for that matter); the best I came up with for A was: channel i fluorescence per compensated or "unmixed" fluorescence (i.e., fluorescence of the primary channel measuring only its designated fluorophore). The units of A^-1 therefore become the inverse: compensated fluorescence per channel i fluorescence, and the units of f become compensated fluorescence (which I guess makes sense). Ideally, we would state this up front (not at the end of the section). It was confusing for me, though, so I'm not sure it helps to discuss it.

On the other hand, one can pass None as a non-fluorescent control, in which case autofluorescence will be set to zero and the compensation procedure will be equivalent to an axis rotation. This is what most people do when they talk about compensation.

Do you think nfc_sample should default to None (i.e., in the compensate.get_transform_fxn() function signature)? Or do you want the user to have to explicitly specify that. I don't really feel strongly either way, but it was an alternative that occurred to me.

Now for the examples.

Interesting. I would try to simplify them. E.g., use fewer data points such that the violins don't overlap (but keep them roughly evenly spaced). And try to achieve a square plot aspect ratio.

You might also show the compensated and uncompensated violins on the same plot. E.g., uncompensated in the background in gray and compensated in color in the foreground. That might make the compensation effect more stark. Right now, it's hard to compare them (especially when the scale is not the same between the two plots).

Showing the green violins almost feels unnecessary and confusing. If they explain a compensation effect that is greater at low aTc (when green is high) and lower at high aTc (when green is low), then they may be helpful, though.

I don't have strong feelings on the use of autofluorescence compensation; I'm generally in favor of whatever is simpler and more widely used (which sounded like not using it?).

If I gave you a Min and Max for the transfer function data, then I think it could be used to replace the current plot.violin() examples (I wouldn't overcomplicate the plot.violin() tutorial and example with the green violins, though). And if the simplified B. subtilis data also looks good, I could envision perhaps including it in the tutorial but leaving it out of the /examples folder? I don't have strong feelings on that, though.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 28, 2020

Also, regarding the examples:

Compensation is traditionally shown with a 2D plot. I'm all for pushing the new violin plots, but we might also want to pick the most egregious sample and show a more traditional 2D plot of skewed data before and unskewed data after compensation.

@castillohair
Copy link
Collaborator Author

I reorganized the module and moved the functions to more closely resemble what we did with mef. @JS3xton see what you think.

Awesome, yeah it looks much better. Some additional thoughts:

  • We should add unit tests, ideally using a simple synthetic data set.

Yeah I was thinking the same. I'll add this next.

  • compensate.get_transform_fxn() should be able to relax this constraint: "The number of channels should match the number of single-fluorophore control samples." I think we just need as many channels as fluorophores (including more channels than fluorophores). More channels than fluorophores may be an overdetermined system, and they often don't have exact solutions, but a least-squares method can be used to find a good solution (numpy recommends linalg.lstsq()). It would be nice if FlowCal automatically handled calculating an exact solution for a square matrix or a least-squares solution for an overdetermined system.

This sounds really confusing to me. From a user's perspective, why would I want to specify more channels than those for which I have controls? And even if I wanted to do this, how do I specify which channels each control corresponds to? Will we use another channels argument? Will we allow None in the list of controls? This would also be inconsistent with MEF calibration, where we don't allow channels for which we don't have MEF values. And considering the additional burden of handling over/underdetermined equation systems, I think this is a lot of trouble for making the API flexible enough for an edge case that we should actually be discouraging (i.e. not having a full list of controls).

  • Pursuant with the previous point, I don't think compensate.get_transform_fxn() should prepopulate transform.to_compensated() with channels (mef.get_transform_fxn() does not do this). Instead, the channels asked for by the user should be extracted from A (throwing an error if they're not present), and then either the exact solution or a least-squares solution should be calculated and used to compensate the specified channels.

I would argue that this is different from mef because compensation is performed in several channels simultaneously, whereas calibration proceeds in each channel semi-independently. Thus I think we should discourage not compensating in all channels at the same time. That said, I will try to implement this, but if it turns out it's too much trouble I'll probably argue for dropping it.

  • Calculation of the summary statistics in compensate.get_transform_fxn() looks precarious. I tried to reproduced those calculations manually with np.mean() and got a single scalar where FlowCal.stats.mean() returned a vector of scalars for each channel, so I think compensate.get_transform_fxn() probably doesn't behave correctly with numpy stats functions?

Thanks for pointing that out. I think issues with handling other statistics functions or numpy arrays will come out and be solved when I make the unit tests.

  • For code readability, I would prefer a little more discipline with the matrix algebra data structures. E.g., a0 really aught to be a column vector (internally) all the way through (coerced to such if necessary). When done properly, I think you can use much more natural matrix algebra expressions instead of the less clear list comprehensions and all the transposes (.T). (Same for arriving at A.)

It seems clear enough to me. List comprehensions are, to me, way more readable than broadcasting, which seems to be what you want. And the statements can be further clarified with comments if necessary. I don't know what other "discipline" issues you may have, but I think our lives are gonna be easier if you propose code snippets to replace what's currently there. I can incorporate them after the unit tests, to be sure that the results are identical to what we currently have.

  • On the Notes section in the docstring:

    • Explaining the math is a good idea, but at present it's long and a bit of an impenetrable alphabet soup. Can we just explain it with matrix notation? We might also explicitly mention that A is the "spillover matrix" and A^-1 is the "compensation matrix", as they appear to be called elsewhere. (1) (2)
    • I struggled to determine the units on A (or f for that matter); the best I came up with for A was: channel i fluorescence per compensated or "unmixed" fluorescence (i.e., fluorescence of the primary channel measuring only its designated fluorophore). The units of A^-1 therefore become the inverse: compensated fluorescence per channel i fluorescence, and the units of f become compensated fluorescence (which I guess makes sense). Ideally, we would state this up front (not at the end of the section). It was confusing for me, though, so I'm not sure it helps to discuss it.

One issue is that the matrix terms in the diagonal are treated differently than the others, which makes a fully matrix-based treatment challenging. I'll see what I can do. I think mentioning units is a good idea though.

On the other hand, one can pass None as a non-fluorescent control, in which case autofluorescence will be set to zero and the compensation procedure will be equivalent to an axis rotation. This is what most people do when they talk about compensation.

Do you think nfc_sample should default to None (i.e., in the compensate.get_transform_fxn() function signature)? Or do you want the user to have to explicitly specify that. I don't really feel strongly either way, but it was an alternative that occurred to me.

Yeah I'm not sure either. I'll leave it as it is for now.

Now for the examples.

Interesting. I would try to simplify them. E.g., use fewer data points such that the violins don't overlap (but keep them roughly evenly spaced). And try to achieve a square plot aspect ratio.

You might also show the compensated and uncompensated violins on the same plot. E.g., uncompensated in the background in gray and compensated in color in the foreground. That might make the compensation effect more stark. Right now, it's hard to compare them (especially when the scale is not the same between the two plots).

These are all good ideas. The plots I attached were not meant to be the final plots for the tutorials. But I'll come back to these suggestions when I get there.

Showing the green violins almost feels unnecessary and confusing. If they explain a compensation effect that is greater at low aTc (when green is high) and lower at high aTc (when green is low), then they may be helpful, though.

Yeah I think the point is to show that the biggest correction occurs when FL1 signal is higher. Again, I'll have an opinion on whether to keep it or not when I actually get to the tutorials.

I don't have strong feelings on the use of autofluorescence compensation; I'm generally in favor of whatever is simpler and more widely used (which sounded like not using it?).

If I gave you a Min and Max for the transfer function data, then I think it could be used to replace the current plot.violin() examples (I wouldn't overcomplicate the plot.violin() tutorial and example with the green violins, though). And if the simplified B. subtilis data also looks good, I could envision perhaps including it in the tutorial but leaving it out of the /examples folder? I don't have strong feelings on that, though.

Having the mix/max samples may make the compensation effect more obvious without doing autofluorescence compensation. On the other hand, I would prefer if the example showed the most complete version of compensation, which is including autofluorescence compensation. I'll have more to say when I get there, but please do send me the min/max samples.

The B. subtilis data has the drawback that the biology is hard to explain, and the results don't match similar published experiments for various very technical reasons. I eventually ran better experiments, but these are not good compensation examples. This is why I would prefer to go with your dataset.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 29, 2020

why would I want to specify more channels than those for which I have controls?

Modern cytometers often have many channels, whereas a user may only require 2 or 3 differently colored fluorophores (with 2 or 3 appropriate controls). If additional channels could be specified trivially and result in "better" compensation, that seems worth supporting to me.

(I don't know how much better using an overdetermined system is; it's hard to know without a good example. Intuitively, though, I thought using more information from more channels seemed like it would do a better job. If implemented well, it would also not preclude users from explicitly specifying one channel per fluorophore as you propose; it would be a pure extension of functionality.)

The compensate.get_transform_fxn() function might even include all channels in A by default; the user wouldn't have to think about channels until applying the compensation with the partial transform.to_compensated() function.

how do I specify which channels each control corresponds to?

Normalize the columns of A to the column max instead of arbitrarily picking a corresponding channel for a fluorophore. The max channel then becomes the de facto corresponding channel.

(There may be some weird edge cases where one channel becomes the corresponding channel for multiple fluorophores, i.e., the a=1 coefficient fall multiple times in the same row. I haven't done the math to determine if the resulting compensation fails, nor do I understand the relationship between the channels and the fluorophores in that situation.)

I think this normalization would need to be done after extracting the requested channels from A and a0, though (and after any errors raised if too few channels are extracted and A is rendered underdetermined), so determining the final A and a0 values would be moved to the partial transform.to_compensated() function and compensate.get_transform_fxn() would basically just package the control data into a larger "pre-A" matrix. (The user should not be concerned with these logistics; they should simply specify channels to the partial transform.to_compensated() function and it should just work).

the additional burden of handling over/underdetermined equation systems

I thought this was basically just gonna require swapping linalg.solve() with linalg.lstsq(). And maybe some extra channel checking to make A and a0 correctly.

I think our lives are gonna be easier if you propose code snippets to replace what's currently there.

Yeah, fair enough. I'll wait for other things to resolve and then see if I think anything is worth clarifying.

The plots I attached were not meant to be the final plots for the tutorials.

Ahh, OK sorry, I was getting a little ahead of myself, haha. I look forward to seeing the final plots.

Having the mix/max samples may make the compensation effect more obvious without doing autofluorescence compensation. On the other hand, I would prefer if the example showed the most complete version of compensation, which is including autofluorescence compensation. I'll have more to say when I get there, but please do send me the min/max samples.

I'm not sure I completely understand you here. I don't think Min and Max should be included in the compensation demonstration because I think they would just make it more confusing. I do think they should be included in the separate plot.violin() tutorial and examples, though, to demonstrate some of the violin plot features.

@JS3xton
Copy link
Contributor

JS3xton commented Oct 29, 2020

but please do send me the min/max samples.

Oh, and the Min and Max data should already be included for the green violins (which are the gate output).

@castillohair
Copy link
Collaborator Author

castillohair commented Oct 29, 2020

Oh, I think I understand your idea of using more channels than fluorophores now. If that could be implemented that would be cool.

Besides the channel selection issue, there's also the issue of what data structure to return. For example, if you're using a flow cytometer with fluorescence channels FL1, FL2, and FL3, you want to use all three for compensation, and you're trying to reconstruct signals from fluorophores GFP and mCherry, how would the output data structure look like? If the algorithm determines that FL1 and FL3 are "better" for GFP and mCherry respectively, should it discard FL2? Should it leave it unmodified? How does the transformation function tell the user which channels contain the GFP and mCherry signals? Even if the math ends up working out, there's a lot to think about from the API level.

(btw the overdetermined system can be solved using a pseudoinverse which directly solves the least-squares problem).

@JS3xton
Copy link
Contributor

JS3xton commented Nov 13, 2020

This ended up being quite a chore. I had to make a small synthetic data set and basically implement my proposed API (two things I was trying to avoid doing unnecessarily), but I think it's been illuminating. The math underlying my API was a little more complicated than I thought, but it's actually quite doable. The core of it is the following three lines of pseudocode:

# Solve for independent fluorophore signals. `spillover` does NOT need to be
# normalized. Use of `lstsq()` eliminates the need to extract channels from
# `spillover` and `data` to get square matrix. Should also be more robust to
# noise in real data.
f_hat = np.linalg.lstsq(spillover, data.T)

# Reconstruct requested channel-fluorophore signals by selectively applying
# `spillover` back to `f_hat`.
for channel,fluorophore in zip(*channel_fluorophore_pairs):
    data[:,fluorophore@channel] = spillover.loc[channel,fluorophore] * f_hat[fluorophore,:]

My proposed API can reconstruct all channel-fluorophore signals from a single spillover matrix by selectively reapplying the spillover matrix to f_hat (pseudocode line 3). In contrast, the current method manipulates the spillover matrix (reordering columns and normalizing them to the diagonal) to make f_hat the signals of "primary" channel-fluorophore pairs. It does not reconstruct all pairs, though; to get other pairs, you have to make a new spillover matrix (by calling compensate.get_transform_fxn() again). This can get quite tedious, for example in Example 2 below.

Example 1: Unmixing the contributions of two fluorophores to one channel

One example I would really like to (eventually) support in a clean way is simply unmixing the contributions of two fluorophores to a single channel (here, GFP and mCherry to FL2). With the current constraints on the output object—two new channels cannot replace one old channel—we cannot do this in one command. However, we should be able to do it in two passes:

code

Using the current implementation:

compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[gfp_data,mcher_data],
                                                         channels=['FL2','FL3'])
fl2_gfp = compensate_transform_fxn1(data=multicolor_data)[:,'FL2']

compensate_transform_fxn2 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[mcher_data,gfp_data],
                                                         channels=['FL2','FL1'])
fl2_mcher = compensate_transform_fxn2(data=multicolor_data)[:,'FL2']

Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'gfp'  :gfp_data,
                                                                     'mcher':mcher_data})
fl2_gfp   = compensate_transform_fxn(data=multicolor_data,
                                     channel_fluorophore_pairs=[('FL2','gfp'),
                                                                ('FL3','mcher')])[:,'FL2']
fl2_mcher = compensate_transform_fxn(data=multicolor_data,
                                     channel_fluorophore_pairs=[('FL2','mcher'),
                                                                ('FL1','gfp')])[:,'FL2']

(Eventually:)

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'gfp'  :gfp_data,
                                                                     'mcher':mcher_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL2','gfp'),
                                                                      ('FL2','mcher')],
                                           compensating_channels=['FL1','FL2','FL3'])
fl2_gfp   = multicolor_data[:,'gfp@FL2']
fl2_mcher = multicolor_data[:,'mcher@FL2']

(Note: The last example would require m-to-n compensation given the channels I chose to unmix with; the prior behavior could be reproduced using a second call to compensate_transform_fxn(). You will always need to specify at least one additional compensating channel, though.)


Example 2: Compensating multiple channels to a single fluorophore with crosstalk

Compensate FL1, FL2, and FL3 to fluor1 and FL3 and FL4 to fluor2.

code

Using the current implementation:

# assume FL1 is the primary channel for fluor1 and FL4 for fluor2
compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL1','FL4'])
fluor1_FL1 = compensate_transform_fxn1(data=multicolor_data)[:,'FL1']
fluor2_FL4 = compensate_transform_fxn1(data=multicolor_data)[:,'FL4']

compensate_transform_fxn2 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL2','FL4'])
fluor1_FL2 = compensate_transform_fxn2(data=multicolor_data)[:,'FL2']

compensate_transform_fxn3 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL3','FL4'])
fluor1_FL3 = compensate_transform_fxn3(data=multicolor_data)[:,'FL3']

compensate_transform_fxn4 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL1','FL3'])
fluor2_FL3 = compensate_transform_fxn4(data=multicolor_data)[:,'FL3']

fluor1_data = (fluor1_FL1,fluor1_FL2,fluor1_FL3)
fluor2_data = (fluor2_FL3,fluor2_FL4)

Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data})
fluor1_data = compensate_transform_fxn(data=multicolor_data,
                                       channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                  ('FL2','fluor1'),
                                                                  ('FL3','fluor1'),
                                                                  ('FL4','fluor2')])[:,['FL1','FL2','FL3']]
fluor2_data = compensate_transform_fxn(data=multicolor_data,
                                       channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                  ('FL2','fluor1'),
                                                                  ('FL3','fluor2'),
                                                                  ('FL4','fluor2')])[:,['FL3','FL4']]

(Note: Use of np.linalg.lstsq() permits a 4-channel x 2-fluorophore spillover matrix, which renders the computation more consistent between function calls and likely more robust.)

(Eventually:)

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                      ('FL2','fluor1'),
                                                                      ('FL3','fluor1'),
                                                                      ('FL3','fluor2'),
                                                                      ('FL4','fluor2')])

fluor1_data = multicolor_data[:,['fluor1@FL1','fluor1@FL2','fluor1@FL3']]
fluor2_data = multicolor_data[:,['fluor2@FL3','fluor2@FL4']]

Example 3: Separately unmixing two sets of channels

A bit esoteric, but if you wanted to unmix two separate sets of crosstalking fluorophores, the current implementation again requires that you make two transform functions. (Best justification I can think of for this might be if you had an established compensation protocol for two fluorophores that you wanted to preserve and you added two additional orthogonal fluorophores).

code

Using the current implementation:

compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL1','FL2'])
multicolor_data = compensate_transform_fxn1(data=multicolor_data)

compensate_transform_fxn2 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor3_data,fluor4_data],
                                                         channels=['FL3','FL4'])
multicolor_data = compensate_transform_fxn2(data=multicolor_data)

(Note: This approach uses two 2x2 submatrices of A instead of the full 4x4 matrix, which could also be used and would only require one transform function.)


Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data,
                                                                     'fluor3':fluor3_data,
                                                                     'fluor4':fluor4_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                      ('FL2','fluor2')])
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL3','fluor3'),
                                                                      ('FL4','fluor4')])

Example 4: Swapping the fluorophores for which two channels are compensated

Switching from fluor1@FL1 and fluor2@FL2 to fluor2@FL1 and fluor1@FL2 again requires making another transform function.

code

Using the current implementation:

compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data],
                                                         channels=['FL1','FL2'])
multicolor_data1 = compensate_transform_fxn1(data=multicolor_data)

compensate_transform_fxn2 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor2_data,fluor1_data],
                                                         channels=['FL1','FL2'])
multicolor_data2 = compensate_transform_fxn2(data=multicolor_data)

Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data})
multicolor_data1 = compensate_transform_fxn(data=multicolor_data,
                                            channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                       ('FL2','fluor2')])
multicolor_data2 = compensate_transform_fxn(data=multicolor_data,
                                            channel_fluorophore_pairs=[('FL1','fluor2'),
                                                                       ('FL2','fluor1')])

Example 5: Compensating a subset of channels (possibly outdated)

At least as currently implemented, compensating a subset of the channels originally specified would require making a new transform function.

code

Using the current implementation:

compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data,fluor3_data],
                                                         channels=['FL1','FL2','FL3'])

# This would require and subsequently change FL2. It would also likely yield
# slightly different compensated FL1 and FL3 signals.
multicolor_data = compensate_transform_fxn1(data=multicolor_data)

# This would outright fail, I assume.
multicolor_data = compensate_transform_fxn1(data=multicolor_data, channels=['FL1','FL3'])

compensate_transform_fxn2 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor3_data],
                                                         channels=['FL1','FL3'])
multicolor_data = compensate_transform_fxn2(data=multicolor_data)

Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data,
                                                                     'fluor3':fluor3_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                      ('FL3','fluor3')])

Example 6: Common use case

For completeness, here's how I envision the common use case would look.

code

Using the current implementation:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples=[fluor1_data,fluor2_data,fluor3_data],
                                                        channels=['FL1','FL2','FL3'])
multicolor_data = compensate_transform_fxn(data=multicolor_data)

Using my proposed API:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data,
                                                                     'fluor3':fluor3_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                      ('FL2','fluor2'),
                                                                      ('FL3','fluor3')])

@castillohair
Copy link
Collaborator Author

castillohair commented Nov 14, 2020

Thanks for the detailed examples. To be honest, though, I remain skeptical about the general idea of using explicit fluorophore labels. I think the crux of the disagreement is in this statement of yours:

In contrast, the current method [...] does not reconstruct all pairs, though; to get other pairs, you have to make a new spillover matrix (by calling compensate.get_transform_fxn() again).

I am not convinced this is a remotely common use case. Trying different channel/fluorophore combinations doesn't seem to be something a user would do frequently, with the exception of an initial exploratory phase in which many calls to compensate.get_transform_fxn() don't seem like a big deal to me. More specifically:

  • Example 1: Here I'm going to assume that you're thinking about the Moake cytometer. Is there any reason why you would want to get gfp and mcherry fluorescence in units of the same channel? Having the same fluorescence signal in the same arbitrary channel doesn't make the fluorophore concentrations biologically comparable.
  • Example 2: Again, why would you want the same signal many times in different units?
  • Example 3: I'm not very clear on what your math is doing (see below), but I don't quite understand how this would be different from
compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data,fluor3_data,fluor4_data],
                                                         comp_channels=['FL1','FL2','FL3','FL4'])
multicolor_data = compensate_transform_fxn1(data=multicolor_data)

And if they're different, whether one is more "correct".

  • Example 4: Again, I don't think people are gonna be doing this a lot except for an exploratory phase, at most once every time they change fluorophores.

  • Example 5: As far as I understand, we agreed on allowing the following:

compensate_transform_fxn1 = compensate.get_transform_fxn(nfc_sample=None,
                                                         sfc_samples=[fluor1_data,fluor2_data,fluor3_data],
                                                         comp_channels=['FL1','FL2','FL3'])
# The following would do the same as `compensate_transform_fxn1(data=multicolor_data)`,
# but only modify FL1 and FL3 in `multicolor_data`.
multicolor_data = compensate_transform_fxn1(data=multicolor_data, channels=['FL1','FL3'])

Another issue is that, since you didn't show how you're calculating your spillover matrix, I'm not sure what your math is doing exactly. I assumed that, for example, in this case:

compensate_transform_fxn = compensate.get_transform_fxn(nfc_sample=None,
                                                        sfc_samples={'fluor1':fluor1_data,
                                                                     'fluor2':fluor2_data,
                                                                     'fluor3':fluor3_data,
                                                                     'fluor4':fluor4_data})
multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FL1','fluor1'),
                                                                      ('FL2','fluor2')])
multicolor_data_fl1_only = compensate_transform_fxn(data=multicolor_data,
                                                    channel_fluorophore_pairs=[('FL1','fluor1')])

the FL1 channel in multicolor_data would contain the signal of fluor1 in units of FL1, reconstructed using all the information in all the fluorescent channels on which the controls contain data. But I could also interpret is as the signal of fluor1 in units of FL1, reconstructed using the information in FL1 and FL2 only. If what you're doing is the latter, does that mean that multicolor_data[:, 'FL1'] is going to be different from multicolor_data_fl1_only[:, 'FL1']? That would be very confusing in my opinion. Why not use all the channel information, all the time instead? Also, how exactly are you calculating the spillover matrix in compensate.get_transform_fxn if you don't specify any fluorescent channels? Would you calculate the matrix on every possible channel? If so, would you have issues with channels such as FSC, SSC, and time, whose means should not depend on the controls?

The use of fluorophore labels has one additional, potentially very confusing issue in flow cytometers with channel names taken from fluorophores. For example, if channels are FITC, PE, and APC, and you're using fluorophores of the same names, the following would be allowed in your version of the API:

multicolor_data = compensate_transform_fxn(data=multicolor_data,
                                           channel_fluorophore_pairs=[('FITC','PE'),
                                                                      ('PE','APC'),
                                                                      ('APC','FITC')])

which is very confusing imo.

To be clear, I don't think your API version is bad. I think what you have in example 6 is perfectly usable (unless you're using a flow cytometer with fluorophore names used for channels, in which case it becomes kind of confusing). But I'm not convinced that the additional complexity of introducing fluorophore labels is justified for n-to-n compensation, given that the simplest use case will be used significantly more than any other. I'm kind of torn, however, because I think with (significantly) more time and discussion some of these ideas could work well with m-to-n compensation. I think I'd like to wait for a reply and think about this a little more.

(EDIT: changed last paragraph)

@JS3xton
Copy link
Contributor

JS3xton commented Nov 16, 2020

The math underlying compensation describes all of the unmixed signals—signals about which I think I would absolutely be curious. A well-designed API would expose all those signals in an elegant way that does not detract from the common use case.

That being said, I agree examination of the non-"primary" signals would probably not be the most common use case.

To address spillover: It would need to be calculated dynamically in transform.to_compensated(), not compensate.get_transform_fxn() (as I've been trying to say... XD). I was thinking I'd probably use all unique channels and fluorophores specified by the user, like this:

channels,fluorophores = zip(*channel_fluorophore_pairs)
channels     = unique(channels)
fluorophores = unique(fluorophores)

spillover = pd.DataFrame([[statistic_fxn(sfc_sample[:,channel]) for sfc_sample in sfc_samples]
                          for channel in channels],
                         index=channels,
                         columns=fluorophores)
# Example result:
# >>> spillover
#      fluor1  fluor2
# FL1   42.00   17.25
# FL2   21.00   51.75
# FL3    0.42   69.00

(The above would have to be modified to incorporate background subtraction. And sfc_samples is not technically available in transform.to_compensated(), but you get the idea. And yes, given this, multicolor_data[:, 'FL1'] would absolutely be different from multicolor_data_fl1_only[:, 'FL1']; I suspect multicolor_data_fl1_only[:, 'FL1'] would be unchanged relative to multicolor_data.)

My goal, though, is really as stated at the beginning of this post (elegantly exposing all unmixed signals), and if there's a better API to achieve that, I'm amenable (even if we can't fully realize it in this next FlowCal release). Fluorophore labels, for example, are largely just a kludge glue that bridges the two compensation functions (compensate.get_transform_fxn() and transform.to_compensated()); if compensation was done in a single function, I don't think we'd need them. Furthermore, a different output object would allow us to return more signals (necessary for exposing all unmixed signals). I was therefore thinking the following might be a better path forward:

  • For now: Keep the current behavior; it's undoubtedly best to provide the "primary" compensated signals if we can only provide one signal per channel.
  • Later (when we remove transform): Consolidate compensate into one function and modify the output object(s) to return all unmixed signals by default (with no fluorophore labels to differentiate compensated channels in a common output object, maybe we return a separate numpy array, pd.DataFrame or FCSData object for every single-fluorophore control?). This would eliminate the need to specify individual signals with channel_fluorophore_pairs. Indeed, the compensate function signature would look like the current signature for compensate.get_transform_fxn(), and it would more simply and directly construct spillover (channels would specify rows to extract and sfc_samples would specify raw data for the column vectors). (You could even have spillover be an optional additional parameter to short circuit sfc_samples for a user that has their own spillover calculated.)

And we can continue to wait on m-to-n compensation (sorry, my intuition just says that's gonna be the way superior strategy when you're considering more channels than fluorophores, so I keep bringing it back in, but you're wise to wait for evidence).

@castillohair
Copy link
Collaborator Author

Sounds good. I'll move on to units tests for the current version then.

Fell free to open issues for removing transform and for the next compensation method anytime.

@JS3xton
Copy link
Contributor

JS3xton commented Nov 17, 2020

Some other details as I think about the FCSData attributes of compensated channels:

For uncalibrated data, I think we would technically need to check that all detector voltages match among sfc_samples and data for each specified channel. I believe the contributions from each fluorophore to a channel need to be in the same units as data; i.e., spillover.loc['FL1','fluorN'] * f_hat[fluorN] needs to have the same units for all N, and they need to match those of data[:,'FL1']. A concrete thought experiment: What if sfc_samples[fluor1] was measured at FL1 detector voltage 500, but sfc_samples[fluor2] was measured at FL1 detector voltage 700? If fluorescence varied linearly with detector voltage, I would actually expect np.linalg.solve() to compensate for detector voltage discrepancies (effectively bundling a constant into the calculated f_hat for any fluorophore measured with a detector voltage not matching that of data), but I don't think we believe fluorescence varies linearly with detector voltage, so I don't expect the resulting compensation to be correct. I don't know how to check for calibration in the general sense, though, so maybe the best thing is just to clearly state this requirement in the docstring? (I.e., uncalibrated data and controls must be measured at common detector voltages.) We could also throw a warning if mismatched detector voltages are detected, but that could get annoying for the perfectly legitimate compensation of calibrated data measured at different detector voltages.

For calibrated data, I don't think detector voltage should matter anymore? And in that case, I guess I would preserve the original detector voltage in data? (This makes me think we should consider using None as the attribute value for derived channels in the future, though.)

Regarding amplification_type and amplifier_gain, if everything has been transformed to RFI, then I again don't think they should matter (or change)? I again don't know how to check for RFI units in general, but maybe that's not worth worrying about (beyond perhaps mentioning in the docstring) since RFI transformation is going to become automatic in the future.

I don't think resolution should change.

I also don't think range should change? It technically would if we used transform.transform(transform_fxn=minus) to subtract out non-"primary" signals, but I don't think compensation gives us additional range on the low end, so that intuitively seems inappropriate. (The range parameter is less useful for floating-point data anyways since floating-point data can occur outside it.)

@JS3xton
Copy link
Contributor

JS3xton commented Nov 18, 2020

For calibrated data

Thinking more about this use case (compensating calibrated data), I'm left wondering whether compensation and calibration are commutative. I think calibration of multicolor data should comport with calibration of single-color data if the multicolor data is first compensated (thereby calibrating the fluorescence of a fluorophore as if it were measured in isolation), but I don't know what happens if you compensate after calibration. Making the tutorial and examples should clarify this. If compensation and calibration are not commutative, though, we should probably address that explicitly in the compensation docstring. And we should more seriously consider throwing warnings or errors if sfc_samples and data detector voltages don't match (as discussed in my previous post) because compensating uncalibrated data would be the expected use case.

@castillohair
Copy link
Collaborator Author

To reply to the last few points raised:

  • I don't think the compensation functions should check for detector voltages, as they don't have a method for determining whether the data has been calibrated or not.
  • .range should be modified along with the event data, as it has been done with every other transformation function.
  • I don't know if compensation and calibration are commutative. That would require analyzing the relevant equations, which I think will get tricky when thinking about how compensation affects beads spectra, and the filters of whatever cytometer you're working with. However, what I can say easily is that compensation should work with calibrated data, as the equations only require controls and data to be in the same units. I'm fine with recommending one of the following: a) calibration then compensation, on the basis of calibration only requiring data to be in the same units, or b) choose an order and be consistent.

@castillohair
Copy link
Collaborator Author

castillohair commented Jan 13, 2021

I think I've addressed all previous issues with the last few commits. Among these:

  • Example scripts have been modified to use your dual reporter experiment with the last controls that you sent me.
  • New API tutorial for compensation has been written. This is using non-calibrated samples for compensation for simplicity.
  • Docstrings of transform.to_compensated() and compensate.get_transform_fxn() have been rewritten. In particular, the math portion of the docstring of compensate.get_transform_fxn() was simplified to use more vectors and matrices, and the function code was modified to be more similar to the docstring math and hopefully easier to understand.
  • Unit tests were written for transform.to_compensated. Unit tests were not written for compensate.get_transform_fxn() since this function requires many .fcs files to work (it would be hard to use a synthetic dataset). Currently, mef.get_transform_fxn() is not unit tested for similar reasons.
  • EDIT: Forgot to mention that the readme.txt file in the examples directory still needs to be updated to describe the new dataset. I would prefer if you did this since you're the one that took build those plasmids and took the data.

I'll wait for comments.

@castillohair
Copy link
Collaborator Author

In the process of merging develop, I reaffirmed my opinion that the gray histogram colormap in the example files doesn't look very good. In particular, the first few histograms are white and not even visible. I know you added a small constant to the inducer levels to fix this, but it seems to me like a better fix would be a better colormap. This is what it looks like if I just change the colormap to gray_r:

histograms copy

I am gonna press strongly again for leaving the default colormap (no colormap specified). I always found the discussion on "perceptually uniform colormaps" a little out of place, and I think it only adds bloat to an example file that's already large. This is what it looks like with the default colormap:

histograms

I would also be in favor of using viridis just because it doesn't look bad, although if we do that I would like to not have a big comment above talking about perceptually uniform colormaps.

histograms viridis

I'm leaving the default colors for now until we reach an agreement.

@JS3xton
Copy link
Contributor

JS3xton commented Feb 10, 2021

viridis looks fine to me. (I think the imperceptible yellow 2.0ng/mL histogram with the default color map is just as bad as the white histograms with the unmodified gray_r color map.) Can you try viridis_r, though? I reversed gray because I thought more inducer should be darker.

I really think A and a0 should be pandas DataFrame and Series data structures, respectively. These data structures are far superior for labeling rows or columns, which we are doing. They should also allow us to remove comp_channels from FlowCal.transform.to_compensated(), which is confusing right now and doesn't do what I originally wanted it to do. (I originally wanted it to allow users to specify a submatrix of A to use for compensation, but that's not how it's currently being used—right now it's just carrying around the row labels. At this point, I think it should just be removed.) Also, the comp_channels description in FlowCal.compensate.get_transform_fxn() still needs to be more descriptive. E.g., "Channels to compensate. comp_channel[i] is compensated to describe the sfc_samples[i] fluorophore as if it were measured in isolation."

The updated example plots are confusing now and overwhelmed by the compensation workflow, and the compensation tutorial could be more focused. (The violin plot tutorial also looks bad now and incorrectly shows Min and Max in the last plot.) In particular, I think it's really confusing to show both the input (orange) and output (green) data at every stage of the analysis.

  • For the example scripts, can we analyze only the uncompensated output (green) data first (i.e., histograms, dose response, and violin dose response plots), and then introduce the input (orange) data, compensate it, and make one plot showing the results? (4 plots total. And I still believe the most effective way to show the compensation effect is with uncompensated violins in the background in light gray and compensated violins in the foreground in color. The current plots, which have incomparable y-axis scales, do a poor job of showing the effect.) With this more modular structure, simple features can still be plucked from the example scripts without having to understand the more complicated ones.
  • For the compensation tutorial, can we similarly focus on just compensating the input (orange) data? Also, while I think the "Observing bleedthrough" section is a good idea in principle, I don't think it's useful here with the data we currently have—the effect is very small and it's not well highlighted in the figure shown. (Just focus on the sfGFP bleedthrough if you really want to keep the section, or outsource to a different dataset / website.) Also, if you want to discuss the nfc_sample=None use case, maybe give it its own section, e.g., "Eliminating bleedthrough only via compensation" (and title the previous section "Eliminating bleedthrough and autofluorescence via compensation"). The nfc_sample=None use case looks worse, though, so I actually think it's better to just remove it entirely from the tutorial. (If you keep it, though, can you add the parameter names to the last FlowCal.compensate.get_tranform_fxn() function call?)
  • For the violin plot tutorial, does the uncompensated input (orange) data look better in the violin plots? If not, can we at least reduce the width on the output data violins to make the plot less crowded and overlapping in the middle?

Other minor issues:

  • Does FlowCal.compensate.get_transform_fxn() work with a Numpy-based statistic_fxn?
  • There was a typo—"accross"—in the code on the Compensate tutorial page.
  • Where is "Part 2" in the analyze_no_mef.py example script? (We may have lost that a while ago, but I did a Ctrl+F for it and couldn't find it...)
  • You used the phrase "compensation coefficients" in several places, but its meaning is not clear to me.
  • Yes, I'll update the README.

I'll continue my review after we've addressed the major issues listed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants