-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different results from different installs #114
Comments
@erimjr If you are able to upload your table and mapping file I would be interested to play around with it. From what you have described I can't see anything immediately wrong with what you are doing |
@johnchase No worries. Here you go! GitHub tells me it won't support my .biom file type so I'll drop a text version of the table here and you can convert with biom convert? non_rarefied_results_first_r.txt |
@lkursell pointed out that the alpha2 parameter default has changed between versions (the parameter that accounts for the counts of the "unknown". Explicitly setting this in the two versions should provide the same results. I am going to close this for now as my testing suggests that this is the case for the difference, however feel free to comment if this does not provide the same outcome |
@johnchase and @lkursell |
Yes, that should be the case. I am unsure about the timeline for #112 |
@johnchase sorry to @ you again, I know you guys are really busy. I've rerun my small data set again with the alpha2 parameter set to .001 (which is the default in the original R code as well as in the previous version of SourceTracker2 and the results are basically identical to the run with alpha2 set at the new default (.1) $sourcetracker2 gibbs -i /otu_table_second_sourcetracker.biom -m /map_second_sourcetracker.txt -o /20181205_st2.1_first --source_rarefaction_depth 0 --sink_rarefaction_depth 0 --per_sink_feature_assignments --alpha2 .001 Thanks again for your help and advice! 20181129_mixing_proportions.pdf |
I can try to take a look into this a bit deeper this weekend, I have not been able to reproduce the results that you are seeing with the development version of sourcetracker |
Hi @johnchase have you been able to find my problem? I've tried to rerun this several times with slightly different parameters, but always end up with the same results - nearly identical outputs from the original R script (which takes like 4 days to run on my machine), the SourceTracker2 script installed from Bioconda, and wildly different outputs from the SourceTracker2 script installed from Github (which still errors out on my Ubuntu machine, but will run on my MacBook). |
Hi @erimjr thank you for being patient! It took me a minute to track down what is going on here. Previously when I was doing my testing I was doing it all with the sourcetracker API, rather than the command line interface. This was a mistake on my part as the differences you are seeing are a direct result of running sourcetracker through the command line interface. When you run sourcetracker through the command line interface there are a few things that happen to your data, the one that we are interested in here is that all of the sources from the same environment are collapsed into a single source sample. In the previous versions of sourcetracker the collapsing was done by summing all of the feature(OTU) counts, in the development version the collapsing is done by taking the mean of the counts. I was able to dig up some discussion about this from a while back here: #51 (comment) @lkursell @wdwvt1 Would you want to comment on the validity over summing vs. averaging sample counts? IIRC I tended to prefer to sum counts rather than take the mean, although ultimately I believe it comes down to the assumptions regarding ones data. I wonder if a solution would be to add this as a command line parameter to let the user decide? @erimjr In the mean time I don't have a fantastic solution to be able to quickly reproduce results between versions, although one possibility would be to collapse your data (i.e. combine your source samples) prior to running sourcetracker. The following will illustrate the differences that @erimjr is seeing samples = pd.read_csv('map_second_sourcetracker.txt', sep='\t', index_col=0)
data = pd.read_csv('otu_table_second_sourcetracker.txt', sep='\t', index_col=0, skiprows=1).T
c = Client()
sinks = data.loc[samples[samples.SourceSink == 'sink'].index].copy()
sources = data.loc[samples[samples.SourceSink == 'source'].index].copy()
#Take the sum
sources = pd.DataFrame(sources.sum()).T
sources.index = ['Sewer']
mp, mp_std, feats = gibbs(sources,
sinks)
print(mp)
sources = data.loc[samples[samples.SourceSink == 'source'].index].copy()
#take the mean
sources = pd.DataFrame(sources.mean()).T
sources.index = ['Sewer']
mp, mp_std, feats = gibbs(sources,
sinks)
print(mp) |
Thanks @johnchase and @lkursell! I can try it! My only question would be: since my data is incidence data, I guess using the mean number of counts in replicate sources would give a better representation of the source data. However, I think I remember reading somewhere in the documentation that the counts need to be integers, but maybe I'm wrong since I can't seem to find that advice again after looking. |
Not exactly, when I ran sourcetracker through the API there is no formatting of the data that takes place by sourcetracker, it is all done manually. When I collapsed the data manually I made the mistaken assumption that the command line sourcetracker call was collapsing samples by summing them, this it turns out is the difference in the results between the current and previous versions of sourcetracker.
If this is the case, and you wish to use the mean then you don't need to do anything. The latest version of sourcetracker on github will automatically do this. If you wish to collapse by summing the features of each sample, you can do this prior to running sourcetracker, then when you run sourcetracker if there is only one sample per source, no collapsing will take place |
@johnchase Ok I think I get it now. So I guess the R version of sourcetracker and the version on Bioconda are summing the counts in replicate sources, which seems to give a more "correct" answer in my case? That seems strange to me, considering it's an incidence table and I would guess that by simply adding all the 1's together would make source samples with more replicates look very different from the sink samples? Clearly I'm missing something, though, since the result seems to come out better this way. |
Great, I will be interested to hear what you find! |
@johnchase @lkursell |
Now I just need to figure out how to interpret the per sink feature assignments 👍 |
Thanks for the update. This is what I had seen with your data as well. I'm fascinated by this is we had tended to not see large differences as a result of how sources were collapsed. I think I still have much to understand about how 16S data is interpreted by sourcetracker! |
Hi, I've never left a comment on GitHub before so I hope I'm doing this properly.
I'm running SourceTracker2 on a MacBook. I've previously installed SourceTracker2 into a dedicated virtual environment using the Bioconda channel ($conda install sourcetracker), but that version doesn't have the --per_sink_feature_assignment flag available, which I want. I'll call this install ST2.
I installed the new master version of SourceTracker2 into a different virtual environment according to the instructions in issue #85 (using $pip install https://github.com/biota/sourcetracker2/archive/master.zip); I'll call this install ST2.1
As an aside, when I install ST2.1 the script runs fine on my MacBook (Mojave, 10.14.1), but in Ubuntu (16.04.5 LTS, xenial) I get the applymap error mentioned in issue #111 .
When I run the same data set with the same commands in ST2 and ST2.1 I get very different results (well outside of deviation). Most notably, two of my sink samples are actually my source samples and ST2.1 only finds 67 and 81 % contribution from the sinks, respectively, whereas ST2 finds >99.9 % contribution for both (which is a more accurate result).
I already know there are problems with my training data set (I need a greater diversity of sinks for more reliable results and I need more replicates), but I'm trying to establish a workflow that can be followed by other users before putting the rest of the data in.
I've attached the output mixing_proportions.txt files from each of my runs. If you want me to upload my .biom table and map I'm happy to, just tell me.
I'm running each install with no rarefaction in sources or sinks, (sourcetracker2 gibbs -i ./otu_table_second_sourcetracker.biom -m ./20181030_map_second_sourcetracker.txt -o ./st2_small_rerun --source_rarefaction_depth 0 --sink_rarefaction_depth 0) with and without the --per_sink_feature_assignment flag in ST2.1 and ST2, respectively.
Finally, the results from ST2 line up well with results from the original R script.
I think there's a question buried in there somewhere. It's something like, "I've clearly done something wrong, can you help me figure it out?"
Thanks for your help!
mixing_proportions_ST2.1.txt
mixing_proportions_stds_ST2.1.txt
mixing_proportions_stds_ST2.txt
mixing_proportions_ST2.txt
The text was updated successfully, but these errors were encountered: