The goal of this is to transcompile a subset of SAS/Base to SciPy.
The tests can be run directly inside your git clone (without having to install stan) by typing:
nosetests stan
data merge
will not require the data to be sorted before hand. Data will be implicitly sorted (similar to the SPDE engine).dates
will be suppported in a different manner (coming soon).format
,length
,informats
will not be necessary (we shall usedtype
innumpy
).- Pandas supports column names with spaces in it. This may cause issues since SAS automatically changes spaces to
_
. - Pandas is case sensitive, SAS is not.
macro
facility. It can be replicated (to a degree) using iPython.
from stan.transcompile import transcompile
import stan.stan_magic
from pandas import DataFrame
import numpy as np
import pkgutil
from numpy import nan
import stan.proc_functions as proc_func
mod_name = ["from stan.proc_functions import %s" % name for _, name, _ in pkgutil.iter_modules(proc_func.__path__)]
exec("\n".join(mod_name))
# create an example data frame
df = DataFrame(np.random.randn(10, 5), columns = ['a','b','c','d','e'])
df
a | b | c | d | e | |
---|---|---|---|---|---|
0 | -1.245481 | -1.609963 | 0.442550 | -0.056406 | -0.213349 |
1 | -1.118754 | 0.116146 | -0.032579 | -0.556940 | 0.270678 |
2 | 0.864960 | -0.479118 | 2.370390 | 2.090656 | -0.475426 |
3 | 0.434934 | -2.510176 | 0.122871 | 0.077915 | 0.597477 |
4 | 0.689308 | 0.042817 | 0.217040 | -1.424120 | -0.214721 |
5 | -0.432170 | -1.344882 | -0.055934 | 1.921247 | 1.519922 |
6 | -0.837277 | 0.944802 | -0.650114 | -0.297314 | 1.432118 |
7 | 1.488292 | -1.236296 | 0.128023 | 2.886408 | -0.560200 |
8 | -0.510566 | -1.736577 | 0.066769 | -0.735257 | 0.178167 |
9 | 2.540022 | 0.034493 | -0.521496 | -2.189938 | 0.111702 |
%%stan
data test;
set df (drop = a);
run;
u"test=df.drop(['a'],1)n"
exec(_)
test
b | c | d | e | |
---|---|---|---|---|
0 | -1.609963 | 0.442550 | -0.056406 | -0.213349 |
1 | 0.116146 | -0.032579 | -0.556940 | 0.270678 |
2 | -0.479118 | 2.370390 | 2.090656 | -0.475426 |
3 | -2.510176 | 0.122871 | 0.077915 | 0.597477 |
4 | 0.042817 | 0.217040 | -1.424120 | -0.214721 |
5 | -1.344882 | -0.055934 | 1.921247 | 1.519922 |
6 | 0.944802 | -0.650114 | -0.297314 | 1.432118 |
7 | -1.236296 | 0.128023 | 2.886408 | -0.560200 |
8 | -1.736577 | 0.066769 | -0.735257 | 0.178167 |
9 | 0.034493 | -0.521496 | -2.189938 | 0.111702 |
%%stan
data df_if;
set df;
if b < 0.3 then x = 0;
else if b < 0.6 then x = 1;
else x = 2;
run;
u"df_if=dfnfor el in ['x']:n if el not in df_if.columns:n df_if[el] = np.nanndf_if.ix[((df_if[u'b']<0.3)), 'x'] = (0)nfor el in ['x']:n if el not in df_if.columns:n df_if[el] = np.nanndf_if.ix[((~((df_if[u'b']<0.3))) & (df_if[u'b']<0.6)), 'x'] = (1)ndf_if.ix[((~((df_if[u'b']<0.6))) & (~((df_if[u'b']<0.3)))), 'x'] = (2)n"
exec(_)
df_if
a | b | c | d | e | x | |
---|---|---|---|---|---|---|
0 | -1.245481 | -1.609963 | 0.442550 | -0.056406 | -0.213349 | 0 |
1 | -1.118754 | 0.116146 | -0.032579 | -0.556940 | 0.270678 | 0 |
2 | 0.864960 | -0.479118 | 2.370390 | 2.090656 | -0.475426 | 0 |
3 | 0.434934 | -2.510176 | 0.122871 | 0.077915 | 0.597477 | 0 |
4 | 0.689308 | 0.042817 | 0.217040 | -1.424120 | -0.214721 | 0 |
5 | -0.432170 | -1.344882 | -0.055934 | 1.921247 | 1.519922 | 0 |
6 | -0.837277 | 0.944802 | -0.650114 | -0.297314 | 1.432118 | 2 |
7 | 1.488292 | -1.236296 | 0.128023 | 2.886408 | -0.560200 | 0 |
8 | -0.510566 | -1.736577 | 0.066769 | -0.735257 | 0.178167 | 0 |
9 | 2.540022 | 0.034493 | -0.521496 | -2.189938 | 0.111702 | 0 |
# procs can be added manually they can be thought of as python functions
# you can define your own, though I need to work on the parser
# to get it "smooth"
df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df1
a | b | |
---|---|---|
0 | True | False |
1 | False | True |
2 | True | True |
%%stan
proc describe data = df1 out = df2;
by a;
run;
u"df2=describe.describe(data=df1,by='a')"
exec(_)
df2
a | b | ||
---|---|---|---|
a | |||
False | count | 1 | 1 |
mean | 0 | 1 | |
std | NaN | NaN | |
min | False | True | |
25% | False | True | |
50% | 0 | 1 | |
75% | False | True | |
max | False | True | |
True | count | 2 | 2 |
mean | 1 | 0.5 | |
std | 0 | 0.7071068 | |
min | True | False | |
25% | 1 | 0.25 | |
50% | 1 | 0.5 | |
75% | 1 | 0.75 | |
max | True | True |
The proc actually isn't difficult to write. So for the above code it is
actually just this:
def describe(data, by): return data.groupby(by).describe()
This functionality allow you to handle most of the by
and retain
cases. For languages like Python and R, the normal way to handle data is
through the split-apply-combine methodology.
Merges can be achieved in a similar way, by creating a proc
:
%%stan
proc merge out = df2;
dt_left left;
dt_right right;
on = 'key';
run;
u"df2=merge.merge(dt_left=left,dt_right=right,on='key')"
left = DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
exec(_)
df2
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
heres an example showing how you can define your own function and run it
(not a function that came with the package)
def sum_mean_by(data, by):
return data.groupby(by).agg([np.sum, np.mean])
%%stan
proc sum_mean_by data = df_if out = df_sum;
by x;
run;
u"df_sum=sum_mean_by(data=df_if,by='x')"
exec(_)
df_sum
a | b | c | d | e | ||||||
---|---|---|---|---|---|---|---|---|---|---|
sum | mean | sum | mean | sum | mean | sum | mean | sum | mean | |
x | ||||||||||
0 | 2.710545 | 0.301172 | -8.723557 | -0.969284 | 2.737635 | 0.304182 | 2.013566 | 0.223730 | 1.214251 | 0.134917 |
2 | -0.837277 | -0.837277 | 0.944802 | 0.944802 | -0.650114 | -0.650114 | -0.297314 | -0.297314 | 1.432118 | 1.432118 |
proc sql
is supported through the pandasql
library. So the above
table could have been produced via SQL as well.
import pandasql
q = """
select
sum(a) as sum_a,
sum(b) as sum_b,
sum(c) as sum_c,
sum(d) as sum_d,
sum(e) as sum_e,
avg(a) as avg_a,
avg(b) as avg_b,
avg(c) as avg_c,
avg(d) as avg_d,
avg(e) as avg_e
from
df_if
group by x
"""
df_sum_sql = pandasql.sqldf(q, locals())
df_sum_sql
sum_a | sum_b | sum_c | sum_d | sum_e | avg_a | avg_b | avg_c | avg_d | avg_e | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2.710545 | -8.723557 | 2.737635 | 2.013566 | 1.214251 | 0.301172 | -0.969284 | 0.304182 | 0.223730 | 0.134917 |
1 | -0.837277 | 0.944802 | -0.650114 | -0.297314 | 1.432118 | -0.837277 | 0.944802 | -0.650114 | -0.297314 | 1.432118 |