-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Vensim Subscripting #21
Comments
A few thoughts:
|
For simplicity it probably makes sense to implement subscript notation using python decorator syntax, to preserve the simplicity of the implemented python function. Something like:
We may not even need to have a fixed set of parameters in the decorator call, but broadcast whatever is passed in...
We'd need to have some clever error checking, though. We need the decorator function to be able to modify calls to other functions from within the decorated functions. We may be able to use something akin to numpy's |
Next steps:
|
First step is probably to write test cases for each of the subscript features. I'm moving over to the joint test suite, and will add these tests here: https://github.com/SDXorg/test-models |
Syntax should:
|
Other complexities:
|
For this it would make some sense to have a labelled ndarray, something like a Subranging, subscript mapping, broadcasting, and summary functions would practically come out of the box and have a more pandas-like syntax. What do you think? |
Patrick - I think you might be on the right track. Looking for now just at how we represent the subscripted state, it seems like we have several options for representing the subscript structure: Flat DictionaryWe could use the flat state dictionary the way it currently exists, and add suffixes to the variable names to indicate a position in the subscript structure.
This is nice because it doesn't require any new unpacking/repacking machinery, but could get very unweildy when the number of subscripts grows. Nested DictionariesWe could nest subscript values within dictionaries within the state dictionary.
This has several advantages:
Unnamed ndarraysIf we are ok storing the names of the subscript elements in a separate structure, we can
Accessing values gets rather complicated in this arrangement: Named ndarraysThe nicest solution from an implementation perspective is probably a type of named array (using either pandas or xray or some other type of container.
This would be easy to access, but might make the integration slow, because of the extra overhead. Additionally, we want to make sure we don't add too many more dependencies. The way to do this is probably to test some of these in a sandbox environment, and do speed tests... |
The way we construct arrays in the state vector should also drive how constants (and in some cases, equations) are subscripted as well...
For the named array case:
etc. |
Here's another issue - in function calls, we can either enumerate the specific subscript names as separate parameters or lump them all into an array:
vs.
The second approach is cleaner, especially for long equations, but less explicit. We could alternately try some sort of hybrid, with syntax akin to what is described in this stack exchange post. Then we could be explicit in the function definition, but implicit in the calls within the function. |
Wow, quite a few options here! I agree, sandboxing some of these seems the way to go. I'm not sure how scalable the python nested dict or list of lists (of lists possibly) would be... but the overhead of using pandas or xray is a really good point especially if they need to be instantiated every call. Maybe it's possible to store these as attributes instead.. but that would be a huge change up I think. The unnamed ndarray option seems to be more flexible than named and can still broadcast and aggregate along arbitrary dimensions / ranges. Hybrid approach for function calls / definitions seems like a good idea. Then we can see the dims in the definition of something that is subscripted. Once we get tests together then we can see how these work out. |
We might be able to get away with not instantiating variables every call, but setting them up once and modifying their values. For the constants, we could define them as attributes of the function:
We could avoid re-instantiating variables within the state dictionary just by coding them properly. That doesn't totally avoid the overhead however. If we decide to go with the unnamed array, instead of doing an 'index' type lookup, we could define the index externally. Something else to check for speed.
I'm not sure if its as elegant, but it may be faster... |
I made a super-basic demo of how we could implement using unnamed arrays, in the sandbox. Basically I pulled out the functional pieces of pysd that are needed for the demo and put them in their own file, separate of the project. It's written for clarity, not speed, so we should refactor before comparing with other options on any speed tests, specifically the components associated with packing and unpacking the state vector. The biggest question that this brought up was whether we should broadcast the subscript function calls in the model file, or in the template class? Right now I have it in the dstock function. It may be good to have it here, because some stocks won't be broadcast to all subscripts, and we need a way to work that out? @pbreach - how would you feel about modifying this to make a subscript demo using xray? |
@JamesPHoughton Sure thing, Just finished adding, but it is extremely slow. I tried to profile and seems like indexing a It would be nice if it weren't so slow seeing as supporting the rest of the subscripting functionality would be trivial to implement as long as the translator can pick up on it... Maybe I missed something. |
How much overhead are you getting? Profiling the current unnamed arrays version, creating the array structure in
If I pull that out (make it an attribute of the function, as in this gist) then we're back to the case where the model functions themselves are the biggest contributor.
I'm also rebuilding the arrays when I re-pack the state dictionary in set_state. With the numpy array, it doesn't seem to be a big part of the bottleneck, (maybe only because its called fewer times?) but you might have more overhead there, and refactoring to modify instead of recreate the structure might help... |
OK now I've added in the example. The kind of overhead I am getting is ~8s to integrate as opposed to ~0.016s with the unnamed example above it. I tried taking out Check it out and let me know what you think. Personally I think the unnamed arrays are looking pretty promising based on the example and results. |
Interesting! I pulled out the array instantiation in
The other half seems to be some sort of repeating call cycle in xray ( Possibly a bug in xray? |
Found the problem! The regression happens when setting values in
and from using xray:
So you can see there is a bit of overhead in using xray (takes ~2X longer in this example). My guess is that more equations will lead to longer run times using xray compared to the unnamed arrays. There would probably be a performance benefit once the arrays get really big... But I think it's unlikely that we will be dealing with subscripted arrays that don't fit in memory for example. The only advantage would be easier to implement subscripting functionality but I think it would be not too much harder with the unnamed arrays. |
Thats the same order of magnitude, which is helpful. It might make sense to sandbox out some of the other subscript operations, (maybe just the aggregation functions?) before we commit to one path or the other. It would also make sense to sandbox a numba/cython/theano type speedup operation, to see how easy it is to do in each setting. I'll have a look at that. I also like the idea of keeping all of the array name values in a single nested dictionary, as opposed to spreading them around. |
Here are the updated timings (I left unnamed arrays:
and from using xray:
They're slightly faster because I'm on my office computer. Looks pretty similar now, but you're right it would be ideal to test out the other subscripting functionality. It will also be interesting to see how we can fit this into numba/cython/theano. |
Had some fun today sandboxing what would happen if instead of: In the sandbox example:
The code actually turns out to be simpler, and faster. (9.09 ms per loop vs. 16.6 ms per loop) However, I'm not sure if it'll work for some of the more complex subscripting constructs. I think we should make the sandbox examples more complex - try the aggregation or range functions, and see if that influences our decision making... |
We might be able to do the same sort of thing using xray... |
I added the equivalent xray example of the array-based operation and fixed option 2 in #43. It doesn't seem too slow, but I think things will start to get interesting once we move onto some more complicated cases. Would it make sense to come up with a simple example model using some of these aggregation functions to test? Maybe something like the teacup example, except with multiple teacups. The total cooling of the teacups could contribute to temperature of the room... Or maybe something simpler? |
I can send you a couple of Vensim models (simple and moderate), if you want |
@pbreach nice! Where you you think the extra overhead comes from? I think doing some aggregation tests would be a great idea - I'm not sure they even have to be as complex as the teacup example. What about something like this test. @BaharEs - It would be great to see your models - how would you feel about including them in a set of sample models in https://github.com/SDXorg/test-models? Alternately, if you think they're simple enough to be used as unit-test type models, we could include them directly with the test suite... |
I tried out some aggregation functions (sum, max) in the sandbox. Xray definitely wins for being explicit, and the matrix style operations are a good boost in speed. If we're really worrying about speed at this point though, perhaps we should look at how we can integrate with python speedups referenced in #12. |
I think at this point we should go with option 4 - the xray case. Its more explicit, and I think will be easier to develop/debug. If we get to the point where we need the extra speed, and we can't get it out of xray, we'll go back to option 3 - they are fairly similar. I've created a branch to do this dev in, and added the functionality around packing/unpacking. See: The tests pass as well as they did before (some bugs in other places). I think if we hand-code some models, we should be able to test the subscript functionality. Then we can work on the |
@JamesPHoughton Well, I tried the same thing using pandas dataframes and it seems to be even slower (~3-4x) than xray. My guess is maybe the overhead is in the construction of the I like the aggregation test you put together. I'm not sure what the status is on incorporating subscript parsing (I see you've starting on bringing in ANTLR which is great!). But some more hand-coded models would be good. If I have time I might be able to put some together similar to how you did in the sandbox. For speed-ups I think xray will be great for working with large data (still has to fit on one machine). It supports using dask as a back-end for doing array ops on disk in chunks with |
Floating questions:
|
Use numpy |
Just to make sure things don't get lost as we play more in the sandbox, here are the 4 versions mentioned above. All have the same dependencies: from __future__ import division
import numpy as np
from pysd import functions
from scipy.integrate import odeint
import itertools Option 1: Calling through unnamed arrays, one function evaluation per array element.class MinModel(object):
########## boilerplate stuff from the existing pysd #########
def __init__(self):
self._stocknames = [name[:-5] for name in dir(self) if name[-5:] == '_init']
self._stocknames.sort() #inplace
self._dfuncs = [getattr(self, 'd%s_dt'%name) for name in self._stocknames]
self.state = dict(zip(self._stocknames, [None]*len(self._stocknames)))
self.reset_state()
self.functions = functions.Functions(self)
def reset_state(self):
"""Sets the model state to the state described in the model file. """
self.t = self.initial_time() #set the initial time
retry_flag = False
for key in self.state.keys():
try:
self.state[key] = eval('self.'+key+'_init()') #set the initial state
except TypeError:
retry_flag = True
if retry_flag:
self.reset_state() #potential for infinite loop!
########### Stuff we have to modify to make subscripts work #########
def d_dt(self, state_vector, t):
"""The primary purpose of this function is to interact with the integrator.
It takes a state vector, sets the state of the system based on that vector,
and returns a derivative of the state vector
"""
self.set_state(state_vector)
self.t = t
derivative_vector = []
for func in self._dfuncs:
derivative_vector += list(func())
return derivative_vector
def set_state(self, state_vector):
i = 0
for key in self._stocknames:
if isinstance(self.state[key], np.ndarray):
size = self.state[key].size
elements = state_vector[i:i+size]
shape = self.state[key].shape
self.state[key] = np.array(elements).reshape(shape)
i += size
else:
self.state[key] = state_vector[i]
i += 1
def get_state(self):
#if we keep this, we should make it fully a list comprehension
state_vector = []
for item in [self.state[key] for key in self._stocknames]:
if isinstance(item, np.ndarray):
state_vector += list(item.flatten())
else:
state_vector += list(item)
return state_vector
######### model specific components (that go in the model file)
suba_list = ['suba1', 'suba2', 'suba3']
subb_list = ['suba2', 'subb2']
suba_index = dict(zip(suba_list, range(len(suba_list))))
subb_index = dict(zip(subb_list, range(len(subb_list))))
def stock(self, suba, subb):
return self.state['stock'][self.suba_index[suba]][self.subb_index[subb]]
def stock_init(self):
return np.array([[1,1],[1,1],[1,1]])
def dstock_dt(self):
return [self.flow(suba, subb) for suba, subb in itertools.product(self.suba_list, self.subb_list)]
def constant(self, suba, subb):
return self.constant.values[self.suba_index[suba]][self.subb_index[subb]]
constant.values = np.array([[1,2],[3,4],[5,6]])
def flow(self, suba, subb):
return self.constant(suba, subb) * self.stock(suba, subb)
def initial_time(self):
return 0
a = MinModel()
----
%%timeit
a.reset_state()
odeint(a.d_dt, a.get_state(), range(10))
100 loops, best of 3: 16.4 ms per loop Option 2: Using xray, one function call per array elementfrom xray import DataArray
class xMinModel(object):
########## boilerplate stuff from the existing pysd #########
def __init__(self):
self._stocknames = [name[:-5] for name in dir(self) if name[-5:] == '_init']
self._stocknames.sort() #inplace
self._dfuncs = [getattr(self, 'd%s_dt'%name) for name in self._stocknames]
self.state = dict(zip(self._stocknames, [None]*len(self._stocknames)))
self.reset_state()
self.functions = functions.Functions(self)
def reset_state(self):
"""Sets the model state to the state described in the model file. """
self.t = self.initial_time() #set the initial time
retry_flag = False
for key in self.state.keys():
try:
self.state[key] = eval('self.'+key+'_init()') #set the initial state
except TypeError:
retry_flag = True
if retry_flag:
self.reset_state() #potential for infinite loop!
########### Stuff we have to modify to make subscripts work #########
def d_dt(self, state_vector, t):
"""The primary purpose of this function is to interact with the integrator.
It takes a state vector, sets the state of the system based on that vector,
and returns a derivative of the state vector
"""
self.set_state(state_vector)
self.t = t
derivative_vector = []
for func in self._dfuncs:
derivative_vector += list(func())
return derivative_vector
def set_state(self, state_vector):
i = 0
for key in self._stocknames:
if isinstance(self.state[key], DataArray):
shape = self.state[key].shape
size = self.state[key].size
self.state[key].loc[:,:].values = np.array(state_vector[i:i+size]).reshape(shape)
i += size
else:
self.state[key] = state_vector[i]
i += 1
def get_state(self):
#if we keep this, we should make it fully a list comprehension
state_vector = []
for item in [self.state[key] for key in self._stocknames]:
if isinstance(item, DataArray):
state_vector += list(item.values.flatten())
else:
state_vector += list(item)
return state_vector
######### model specific components (that go in the model file)
dim_dict = {'suba': ['subb1', 'subb2'],
'subb': ['suba1', 'suba2', 'suba3']}
def stock(self, suba, subb):
return self.state['stock'].loc[suba, subb].values
def stock_init(self):
return DataArray([[1,1],[1,1],[1,1]], self.dim_dict)
def dstock_dt(self):
return [self.flow(suba, subb) for suba, subb in itertools.product(*self.dim_dict.values())]
def constant(self, suba, subb):
#values = DataArray([[1,2],[3,4],[5,6]], self.dim_dict)
return self.constant.values.loc[suba, subb].values
constant.values = DataArray([[1,2],[3,4],[5,6]], dim_dict)
def flow(self, suba, subb):
return self.constant(suba, subb) * self.stock(suba, subb)
def initial_time(self):
return 0
------
a = xMinModel()
%%timeit
a.reset_state()
odeint(a.d_dt, a.get_state(), range(10))
10 loops, best of 3: 35.2 ms per loop Option 3: Unnamed arrays, matrix arithmeticclass fMinModel(object):
########## boilerplate stuff from the existing pysd #########
def __init__(self):
self._stocknames = [name[:-5] for name in dir(self) if name[-5:] == '_init']
self._stocknames.sort() #inplace
self._dfuncs = [getattr(self, 'd%s_dt'%name) for name in self._stocknames]
self.state = dict(zip(self._stocknames, [None]*len(self._stocknames)))
self.reset_state()
self.functions = functions.Functions(self)
def reset_state(self):
"""Sets the model state to the state described in the model file. """
self.t = self.initial_time() #set the initial time
retry_flag = False
for key in self.state.keys():
try:
self.state[key] = eval('self.'+key+'_init()') #set the initial state
except TypeError:
retry_flag = True
if retry_flag:
self.reset_state() #potential for infinite loop!
########### Stuff we have to modify to make subscripts work #########
def d_dt(self, state_vector, t):
"""The primary purpose of this function is to interact with the integrator.
It takes a state vector, sets the state of the system based on that vector,
and returns a derivative of the state vector
"""
self.set_state(state_vector)
self.t = t
derivative_vector = []
for func in self._dfuncs:
res = func()
if isinstance(res, np.ndarray):
res = res.flatten()
derivative_vector += list(res)
return derivative_vector
def set_state(self, state_vector):
i = 0
for key in self._stocknames:
if isinstance(self.state[key], np.ndarray):
size = self.state[key].size
elements = state_vector[i:i+size]
shape = self.state[key].shape
self.state[key] = np.array(elements).reshape(shape)
i += size
else:
self.state[key] = state_vector[i]
i += 1
def get_state(self):
#if we keep this, we should make it fully a list comprehension
state_vector = []
for item in [self.state[key] for key in self._stocknames]:
if isinstance(item, np.ndarray):
state_vector += list(item.flatten())
else:
state_vector += list(item)
return state_vector
######### model specific components (that go in the model file)
def stock(self):
return self.state['stock']
def stock_1d_sum(self):
return self.stock().sum(axis=1)
def stock_1d_max(self):
return self.stock().max(axis=0)
def stock_init(self):
return np.array([[1,1],[1,1],[1,1]])
def dstock_dt(self):
return self.flow()
def constant(self):
return np.array([[1,2],[3,4],[5,6]])
def flow(self):
return self.constant() * self.stock()
def initial_time(self):
return 0
----
a = fMinModel()
%%timeit
a.reset_state()
odeint(a.d_dt, a.get_state(), range(10))
100 loops, best of 3: 8.51 ms per loop Option 4: Xray and matrix multiplicationfrom xray import DataArray
class xMinModel(object):
########## boilerplate stuff from the existing pysd #########
def __init__(self):
self._stocknames = [name[:-5] for name in dir(self) if name[-5:] == '_init']
self._stocknames.sort() #inplace
self._dfuncs = [getattr(self, 'd%s_dt'%name) for name in self._stocknames]
self.state = dict(zip(self._stocknames, [None]*len(self._stocknames)))
self.reset_state()
self.functions = functions.Functions(self)
def reset_state(self):
"""Sets the model state to the state described in the model file. """
self.t = self.initial_time() #set the initial time
retry_flag = False
for key in self.state.keys():
try:
self.state[key] = eval('self.'+key+'_init()') #set the initial state
except TypeError:
retry_flag = True
if retry_flag:
self.reset_state() #potential for infinite loop!
########### Stuff we have to modify to make subscripts work #########
def d_dt(self, state_vector, t):
"""The primary purpose of this function is to interact with the integrator.
It takes a state vector, sets the state of the system based on that vector,
and returns a derivative of the state vector
"""
self.set_state(state_vector)
self.t = t
derivative_vector = []
for func in self._dfuncs:
res = func()
if isinstance(res, DataArray):
res = res.values.flatten()
derivative_vector += list(res)
return derivative_vector
def set_state(self, state_vector):
i = 0
for key in self._stocknames:
if isinstance(self.state[key], DataArray):
shape = self.state[key].shape
size = self.state[key].size
self.state[key].loc[:,:].values = np.array(state_vector[i:i+size]).reshape(shape)
i += size
else:
self.state[key] = state_vector[i]
i += 1
def get_state(self):
#if we keep this, we should make it fully a list comprehension
state_vector = []
for item in [self.state[key] for key in self._stocknames]:
if isinstance(item, DataArray):
state_vector += list(item.values.flatten())
else:
state_vector += list(item)
return state_vector
######### model specific components (that go in the model file)
dim_dict = {'suba': ['subb1', 'subb2'],
'subb': ['suba1', 'suba2', 'suba3']}
def stock(self):
return self.state['stock']
def stock_1d_sum(self):
return self.stock().sum(dim='suba')
def stock_1d_max(self):
return self.stock().max(dim='subb')
def stock_init(self):
return DataArray([[1,1],[1,1],[1,1]], self.dim_dict)
def dstock_dt(self):
return self.flow()
def constant(self, *args):
return DataArray([[1,2],[3,4],[5,6]], self.dim_dict)
def flow(self):
return self.constant() * self.stock()
def initial_time(self):
return 0
a = xMinModel()
----
%%timeit
a.reset_state()
odeint(a.d_dt, a.get_state(), range(10))
10 loops, best of 3: 21.2 ms per loop |
It turns out that the flattening and unflattening take a good chunk of time, as does setting up the Option 5: matrix math, Euler integration on the raw state dictionary.class eMinModel(object):
def __init__(self):
self._stocknames = [name[:-5] for name in dir(self) if name[-5:] == '_init']
self._stocknames.sort()
self._dfuncs = {name: getattr(self, 'd%s_dt' % name) for name in self._stocknames}
self.state = dict(zip(self._stocknames, [None]*len(self._stocknames)))
self.reset_state()
self.functions = functions.Functions(self)
def reset_state(self):
self.t = self.initial_time() #set the initial time
retry_flag = False
for key in self.state.keys():
try:
self.state[key] = eval('self.'+key+'_init()') #set the initial state
except TypeError:
retry_flag = True
if retry_flag:
self.reset_state() #potential for infinite loop!
def step(self, dt):
newstate = {}
for key in self._stocknames:
newstate[key] = self._dfuncs[key]() * dt + self.state[key]
self.state = newstate
self.t = self.t+dt
def integrate(self, timesteps):
outputs = range(len(timesteps))
for i, t2 in enumerate(timesteps):
outputs[i] = self.step(t2-self.t)
return outputs
######### model specific components (that go in the model file)
def stock(self):
return self.state['stock']
def stock_init(self):
return np.array([[1,1],[1,1],[1,1]])
def dstock_dt(self):
return self.flow()
def constant(self):
return np.array([[1,2],[3,4],[5,6]])
def constant2(self):
return 6
def flow(self):
return self.constant() * self.stock()
def initial_time(self):
return 0
In [39]:
a = eMinModel()
In [40]:
%%timeit
a.reset_state()
a.integrate(np.arange(0,10,.1))
1000 loops, best of 3: 704 µs per loop For such a simple model, this turns out to be significantly faster, contrary to my expectation. |
Basic subscript definition and evaluation is functional as of 08468b7, with tests of basic functionality passing: I'm going to go ahead and close this issue, as it has mostly been a discussion about how subscripting should be implemented, and I think that is by and large worked out at this point. There are two major pieces of functionality still missing, and I've created special issues for them so we can decide how to handle them individually:
It would be great to get some help on these if you guys have ideas. |
A number of the most interesting models utilize subscripts, and so if PySD is to be useful in their analysis, it should support this feature.
What we describe as 'subscripts' encompasses a number of different features:
We should start with only a subset of these features. (An outstanding question is if we want to ever support all of them. A good goal would be to match the functionality of the XMILE schema.)
Subscripts have different behaviors in a number of different contexts:
The text was updated successfully, but these errors were encountered: