-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lme2 #38
Lme2 #38
Changes from 11 commits
6a247a7
f13954d
5a7ad7a
241f2a2
5a53409
145ca2e
66e146d
ea3669c
e10bd6f
ec77bfa
af5f730
eecf811
a78b49f
9922347
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -37,6 +37,7 @@ def _intersect_of_table_metadata_tree(table, metadata, tree): | |
|
||
def _to_balances(table, tree): | ||
""" Converts a table of abundances to balances given a tree. | ||
|
||
Parameters | ||
---------- | ||
table : pd.DataFrame | ||
|
@@ -197,7 +198,126 @@ def ols(formula, table, metadata, tree, **kwargs): | |
# http://stackoverflow.com/a/22439820/1167475 | ||
stats_formula = '%s ~ %s' % (b, formula) | ||
|
||
mdf = smf.ols(stats_formula, data=data).fit() | ||
mdf = smf.ols(stats_formula, data=data, **kwargs).fit() | ||
fits.append(mdf) | ||
return RegressionResults(fits, basis=basis, | ||
feature_names=table.columns) | ||
|
||
|
||
def mixedlm(formula, table, metadata, tree, groups, **kwargs): | ||
""" Linear Mixed Effects Models applied to balances. | ||
|
||
Parameters | ||
---------- | ||
formula : str | ||
Formula representing the statistical equation to be evaluated. | ||
These strings are similar to how equations are handled in R. | ||
Note that the dependent variable in this string should not be | ||
specified, since this method will be run on each of the individual | ||
balances. See `patsy` for more details. | ||
table : pd.DataFrame | ||
Contingency table where samples correspond to rows and | ||
features correspond to columns. | ||
metadata: pd.DataFrame | ||
Metadata table that contains information about the samples contained | ||
in the `table` object. Samples correspond to rows and covariates | ||
correspond to columns. | ||
tree : skbio.TreeNode | ||
Tree object where the leaves correspond to the columns contained in | ||
the table. | ||
groups : str | ||
Column names in `metadata` that specifies the groups. These groups are | ||
often associated with individuals repeatedly sampled, typically | ||
longitudinally. | ||
**kwargs : dict | ||
Other arguments accepted into `statsmodels.regression.linear_model.OLS` | ||
|
||
Returns | ||
------- | ||
RegressionResults | ||
Container object that holds information about the overall fit. | ||
|
||
Examples | ||
-------- | ||
>>> import pandas as pd | ||
>>> import numpy as np | ||
>>> from skbio.stats.composition import ilr_inv | ||
>>> from skbio import TreeNode | ||
>>> from gneiss import mixedlm | ||
|
||
Here, we will define a table of proportions with 3 features | ||
`a`, `b`, and `c` across 12 samples. | ||
|
||
>>> table = pd.DataFrame({ | ||
... 'x1': ilr_inv(np.array([1.1, 1.1])), | ||
... 'x2': ilr_inv(np.array([1., 2.])), | ||
... 'x3': ilr_inv(np.array([1.1, 3.])), | ||
... 'y1': ilr_inv(np.array([1., 2.1])), | ||
... 'y2': ilr_inv(np.array([1., 3.1])), | ||
... 'y3': ilr_inv(np.array([1., 4.])), | ||
... 'z1': ilr_inv(np.array([1.1, 5.])), | ||
... 'z2': ilr_inv(np.array([1., 6.1])), | ||
... 'z3': ilr_inv(np.array([1.1, 7.])), | ||
... 'u1': ilr_inv(np.array([1., 6.1])), | ||
... 'u2': ilr_inv(np.array([1., 7.])), | ||
... 'u3': ilr_inv(np.array([1.1, 8.1]))}, | ||
... index=['a', 'b', 'c']).T | ||
|
||
Now we are going to define some of the external variables to | ||
test for in the model. Here we will be testing a hypothetical | ||
longitudinal study across 3 time points, with 4 patients | ||
`x`, `y`, `z` and `u`, where `x` and `y` were given treatment `1` | ||
and `z` and `u` were given treatment `2`. | ||
|
||
>>> metadata = pd.DataFrame({ | ||
... 'patient': [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4], | ||
... 'treatment': [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], | ||
... 'time': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3] | ||
... }, index=['x1', 'x2', 'x3', 'y1', 'y2', 'y3', | ||
... 'z1', 'z2', 'z3', 'u1', 'u2', 'u3']) | ||
|
||
Finally, we need to define a bifurcating tree used to convert the | ||
proportions to balances. If the internal nodes aren't labels, | ||
a default labeling will be applied (i.e. `y1`, `y2`, ...) | ||
|
||
>>> tree = TreeNode.read(['(c, (b,a)Y2)Y1;']) | ||
>>> print(tree.ascii_art()) | ||
/-c | ||
-Y1------| | ||
| /-b | ||
\Y2------| | ||
\-a | ||
|
||
Now we can run the linear mixed effects model on the proportions. | ||
Underneath the hood, the proportions will be transformed into balances, | ||
so that the linear mixed effects models can be run directly on balances. | ||
Since each patient was sampled repeatedly, we'll specify them separately | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. extra space between "was" and "sampled" |
||
in the groups. In the linear mixed effects model `time` and `treatment` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. extra space between "effects" and "model" |
||
will be simultaneously tested for with respect to the balances. | ||
|
||
>>> res = mixedlm('time + treatment', table, metadata, tree, | ||
... groups='patient') | ||
|
||
See Also | ||
-------- | ||
statsmodels.regression.linear_model.MixedLM | ||
ols | ||
|
||
""" | ||
table, metadata, tree = _intersect_of_table_metadata_tree(table, | ||
metadata, | ||
tree) | ||
ilr_table, basis = _to_balances(table, tree) | ||
data = pd.merge(ilr_table, metadata, left_index=True, right_index=True) | ||
|
||
fits = [] | ||
for b in ilr_table.columns: | ||
# mixed effects code is obtained here: | ||
# http://stackoverflow.com/a/22439820/1167475 | ||
stats_formula = '%s ~ %s' % (b, formula) | ||
mdf = smf.mixedlm(stats_formula, data=data, | ||
groups=data[groups], | ||
**kwargs).fit() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any easy way to exercise the kwargs in the unit tests? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly. Give me a sec. |
||
fits.append(mdf) | ||
return RegressionResults(fits, basis=basis, | ||
feature_names=table.columns) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,3 @@ | ||
#!/usr/bin/env python | ||
|
||
# ---------------------------------------------------------------------------- | ||
# Copyright (c) 2016--, gneiss development team. | ||
# | ||
|
@@ -39,26 +37,29 @@ def __init__(self, | |
self.basis = basis | ||
self.results = stat_results | ||
|
||
# sum of squares error. Also referred to as sum of squares residuals | ||
sse = 0 | ||
# sum of squares regression. Also referred to as | ||
# explained sum of squares. | ||
ssr = 0 | ||
# See `statsmodels.regression.linear_model.RegressionResults` | ||
# for more explanation on `ess` and `ssr`. | ||
|
||
# obtain pvalues | ||
self.pvalues = pd.DataFrame() | ||
for r in self.results: | ||
p = r.pvalues | ||
p.name = r.model.endog_names | ||
self.pvalues = self.pvalues.append(p) | ||
sse += r.ssr | ||
ssr += r.ess | ||
|
||
@property | ||
def r2(self): | ||
""" Calculates the coefficients of determination """ | ||
# Reason why we wanted to move this out was because not | ||
# all types of statsmodels regressions have this property. | ||
|
||
# See `statsmodels.regression.linear_model.RegressionResults` | ||
# for more explanation on `ess` and `ssr`. | ||
# sum of squares regression. Also referred to as | ||
# explained sum of squares. | ||
ssr = sum([r.ess for r in self.results]) | ||
# sum of squares error. Also referred to as sum of squares residuals | ||
sse = sum([r.ssr for r in self.results]) | ||
# calculate the overall coefficient of determination (i.e. R2) | ||
sst = sse + ssr | ||
self.r2 = 1 - sse / sst | ||
return 1 - sse / sst | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there any tests that need to be modified/created? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nope. This is handled within the ols results. Note that I wanted to move this outside of the constructor, because linear mixed effects models don't have an r2. |
||
|
||
def _check_projection(self, project): | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be
mixedlm
instead ofOLS