-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to set IID subsample #56
Conversation
@eurunuela @tsalo @CesarCaballeroGaudes @javiergcas @dowdlelt @n-reddy I have some results from testing this idea & I wanted to share. As a reminder, for this dataset, I noticed that most runs estimated a subsampling depth to 2 (to make the remaining sparse data IID) but a bunch of runs has estimates of 1 or 3 which resulted in the estimated number of PCA components being way too high or low. The idea was that, given the acquisition parameters are identical, the spatial smoothness for all runs should be similar, so setting the subsampling depth to 2 for all runs might address the issue. The attached data is from a study where 25 participants did event-related word nonword task runs, movie viewing with paced breathing runs, and paced breathing runs. In the figure, each marker is a run and there are 174 runs total. In case it mattered, each type of run is a different marker shape. The runs where the program estimated a depth of 2 are green while estimates of 1 or 3 were blue or red. For the first row, the x-axis is the number of components for the aic, kic, and mdl criteria if I use the estimated subsampling depth and the y-axis is if I set the depth to 2 using this PR. (axis scaling is equal so the green makers, where depth is 2 for both, is the unity line.) You can see that the number of components gets closer to typical values from runs where it was set to 2, but some are still near the extremes. The second row shows variances explained by the retained components in the PCA and again, the values are a bit less extreme using a fixed value. The 3rd row is variance explained vs number of components when subsampling is set to 2 and the 4th row is for estimated subsample size Here's its very clear that setting the value to 2 means all the values now seem to be within a similar overall distrubtion. The big minus is that there are still extreme cases with too many retained components (only with aic) and cases where the total variance explained by the PCA seems a bit low. This makes me think that there is an underlying limit to the MAPCA method that we still haven't characterized and I'd like to figure out a better option (possibly taking advantage of multi-echo info). Still I think this is an improvement and I'd like to clean up this PR to merge and link it to tedana. |
I cleaned up the code & documentation and fixed an existing bug that was causing integration tests to fail (A file name was mistyped). If we decide this option is worth adding, this PR is ready for review/merging. |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #56 +/- ##
==========================================
+ Coverage 90.26% 95.43% +5.17%
==========================================
Files 3 3
Lines 298 307 +9
==========================================
+ Hits 269 293 +24
+ Misses 29 14 -15
☔ View full report in Codecov by Sentry. |
Am I right in making the observation that, for the ones in which the autoselection would have chosen 3, these are ones that the selection of 2 led to a relatively low amount of variance explained? Something about that jumps out to me. I am troubled also by the large jump from 2 to 3, thats a huge change for a single increment in a parameter. Fixing things at should likely help folks, but I agree with your "find a better method" mention. I am fully supportive of this idea and will take a closer look at things, but just trying to grapple with the image. |
@dowdlelt I see a way where that might sense to me. If we assume that the runs where a sparseness factor of 3 is estimates have a slight increase in spatial smoothness, then it would be possible to model those runs with slightly fewer components than other runs. This is testable. I could make a smoothness estimate for each run and see if there's the ones with 3 are slightly smoother and if there's a more general relationship between spatial smoothness and the number of components that were estimated. I'm not sure I'll be able to get to runing that test during the next few days. |
Someone suggested to me that the sparse sampling approach might fail with multi-slice sequences since there is some dependence in non-neighboring slices. To quickly check this, I ran MAPCA on 206 runs of optimally combined data from 2 participants. The subsampling value was consistently 2 in all runs. This hints that the multi-slice might be a confounding factor. That said, the component estimates ranged from 40-120 (150 volumes) so there are other factors as well. I'm leaning more to saying MA-PCA is just not working as intended. This PR might help isolate some problems, but not solving the underlying issues. I'm mixed in merging this PR and giving users this option. Thoughts from others? |
I wanted to test it real quick before giving a thumbs up, but haven't had a chance. I definitely think it is the right thing to do - it's a sneak parameter, and being able to ask people what it was when they say something on neurostars is worth it (and suggest setting it to 2). |
Hey @dowdlelt, have you had the chance to test it? I lean towards merging this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found some good data to play with, but in the meantime, looks good to me. Happy subsampling!
@eurunuela - I don't think you ever actually approved this, came up in the meeting today. Assuming you approve, we should be good to go. |
Sorry @dowdlelt, I do approve. Didn't know we had a meeting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Github confuses me - maybe because I wasn't requested my previous reviews didn't count? Do it live I say!
(but really, should be good to go)
As discussed at the May 2023 tedana developers call, we decided to add an option to let a user set the IID subsampling value
sub_iid_sp_median
I think this PR has it working, but I want to make sure everything is correct and properly documented before saying it's ready to merge.As part of this PR, we'll always estimate the IID subsample and both the estimated IID subsample and the one used (different if user provided) will be logged.