Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

💡 For-loop syntax? #72

Closed
liamhuber opened this issue Nov 7, 2023 · 9 comments
Closed

💡 For-loop syntax? #72

liamhuber opened this issue Nov 7, 2023 · 9 comments

Comments

@liamhuber
Copy link
Member

I want to migrate our discussion on for-loop syntax from the meeting discussion to its own issue, as it's getting a bit in-depth.

I had posted an example calculation that used the current (and not at all good) for-loop syntax:

wf.ev = wf.create.meta.for_loop(
    loop_body_class=strained_energy,
    length=len(strains),  # I don't like that this is not dynamic
    iterate_on="strain"
)(
    name="Al", 
    STRAIN=strains.tolist()
)

@JNmpi replied:

Please find below some syntax suggestions which I would like to have and which (at least to me) appear to be intuitive:

Replace

wf.ev = wf.create.meta.for_loop( loop_body_class=strained_energy, length=len(strains), # I don't like that this is not dynamic iterate_on="strain" )( name="Al", STRAIN=strains.tolist() )

by:

     wf.ev = strained_energy(name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

An alternative syntax would be:

     wf.ev = strained_energy(name='Al', strain__array=np.linspace(-0.05, 0.05, 5))

or

     wf.ev = strained_energy(name='Al', strain__list=np.linspace(-0.05, 0.05, 5))

Resolving the **kwargs argument the implementation should be rather straightforward. In one of my code examples (iter_pool) I implemented and used such a construction. If we later get type checking we could use the generic variable (strain) rather than one of the modified constructs (STRAIN, strain__array, etc.).

(Note that the registration aside is in its own issue, #71)

The @samwaseda replied:

With @JNmpi 's suggestion, I fear that the subsequent node might not know whether to treat the array as one item or to loop over each element, especially in this example since we can set a float for a hydrostatic strain and an array for uniaxial strains in pyiron. I would probably suggest something like

wf.ev = strained_energy(name='Al', STRAIN=wf.create.array(np.linspace(-0.05, 0.05, 5)))

I have to look into the actual code to see how its implementation could look...

@liamhuber
Copy link
Member Author

First, to Sam's comment, I agree that it is necessary to have some explicit instruction that the input should be looped over, and for exactly the reason he supplies: otherwise input that might potentially be iterable could get very confused. Any anyhow, "explicit is better than implicit".

Then I want to look at the main suggestion:

wf.ev = strained_energy(name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

Pros:

  • Extremely easy for users to leverage
  • Intent is pretty clear (could make it clearer with one of the strain__array alternatives, but I like all-caps)
  • It's familiar, since this is what we have in `ironflow.
  • Easy to incorporate in a GUI by adding a batch button next to input (again, as in ironflow)

The cons then depend on what we're doing under the hood to actually make this behave the way it so intuitively looks like it should. I'm afraid I can't find iter_pool anywhere in #33, so I can't make any comment on it and maybe it already matches very well one of the things I'm about to suggest

I see two major routes:

1) Empower Node to "batch" input

This is what we do over in ironflow. The fundamental "run" code for each and every node goes through checks to see if any of its input is batched, and if so leverages loops over its "normal" running procedure, collects up all the output, etc.

Pros (above and):

  • The node you're asking for is the node you get

Cons:

  • Every node is carrying around a bunch of extra complicated code for managing loops, even the ones that never use it. Not the end of the world, but it makes Node a lot heavier.
  • What if we want to distribute loop tasks to different executors? This too would need to be embedded right in the core Node functionality. Again, it's possible but heavy.

2) Replace the node with a macro

I.e. when an all-caps input kwarg is detected, or some other instruction to loop over the input is given, do an in-place replacement of the relevant node with a for-loop macro (similar to my original example).

Pros (above and):

  • At the end of the day you wind up with a proper macro, so you can use any and all Macro tools on it (e.g. setting executors separately for children!)
  • The heavy-lifting for handling for-loops is not right in Node, but well-encapsulated somewhere else

Cons:

  • Suuuuuper implicit, just horribly, horrendously implicit. You start with some function node (or maybe macro) and then silently end with a macro that encases that node!
  • Still wind up needing to implement an adaptive structure in the underlying for-loop macro, so it can change its number of children when the length of the batched input changes. (This isn't bad, IMO, it's just work we'd need to do)
  • Only works when the node is owned by a macro or workflow -- I don't think there's a good or clean way to replace self if it's just held in a variable right in the process memory. So this creates an asymmetry between parented and un-parented nodes.

Alternative

I would suggest an alternative that is sort of a "nice" version of my original example:

wf.ev = wf.create.standard.for(strained_energy, name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

That leverages some nice interpretation of the kwargs to be totally equivalent to this:

wf.ev = wf.create.standard.for(strained_energy, loop_on="strain")
wf.ev.name = "Al"
wf.ev.STRAIN = np.linspace(-0.05, 0.05, 5)

This differs from the current setup in that (1) we create a node directly instead of first creating a class then instantiating it -- super easy change, and (2) it would need to be a dynamic macro that rebuilds its child graph according to the length of the batched input.

I don't think this is quite as simple as invoking the graph_creator at each run or each input change, but that would be the starting point. With recent changes so that the macros are "walled gardens", I think we have all the tools to do this pretty straightforwardly. It's a bit of a change in philosophy from macros being perfectly "cystallized" and static workflows, but personally it doesn't phase me as the interface to the nodes stays static. We could consider introducing some new Dynamic(Macro) class to handle this sort of thing, but I'd start by implementing it on a case-by-case basis right from Macro.

Something that's nice is that it plays very well with distributing separate executors to each looped process. I already want to add such a shortcut to Composite for quickly distributing executors to all children, so we could just leverage that. I'm thinking wf.ev.child_executor = ..., wf.ev.set_executor_in_children(...) or similar.

In terms of a GUI interface, I've updated an old sketch in the lego-mindstorms vein to reflect this code. Note the little "+" button to indicate that you want to batch over an additional channel(s), and how input can be either looped over (by connecting to the for-macro) or used directly (by connecting directly to the held node). This lines up well with the idea that the actual number of children in the "dynamic" For-macro might change, but this is fine as long as the IO interface is consistent.

for_sketch

Pros:

  • Completely explicit -- want a for loop, ask for a for loop
  • Comes with all the power and utility of just having a macro (e.g. distributing executors among children)
  • Still pretty easy to leverage
  • Can have an un-parented for-loop just the same as any other un-parented macro
  • Lego-mindstorms-esque GUI

Cons:

  • Not as easy to leverage as above

Side-by-side

When considering

wf.ev = strained_energy(name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

vs

wf.ev = wf.create.standard.for(strained_energy, name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

I have to admit that the top one is easier to write, but I feel that it's worth paying wf.create.standard.for() in order to gain the architectural advantages of the bottom one.

@JNmpi
Copy link

JNmpi commented Nov 7, 2023

Thanks for raising this point, @samwaseda. I think the example shows the danger when having input parameters that do not have a specified type. The solution I would like to follow is that rather than trying to resolve the behavior of the function from the data type to provide for the various cases separate names (strain_scalar or strain_hydrostatic and strain_tensor). The example would look then like:

  wf.ev = strained_energy(name='Al', HYDROSTATIC_STRAIN=wf.create.numpy.linspace(-0.05, 0.05, 5)))

Note that I have replaced the vector np.linspace(-0.05, 0.05, 5) by a node, since this allows us to extract the metadata of how the vector has been created (i.e., start=-0.05, end=-0.05 etc.). This syntax/construction would also provide all the necessary information to construct the workflow and keeps the number of parameters minimal.

@JNmpi
Copy link

JNmpi commented Nov 7, 2023

It looks like our replies have been submitted almost at the same time, @liamhuber. I therefore addressed only @samwaseda's comment. Please find below the code snippet for the exec_pool loop I mentioned in one of the previous replies:

def exec_pool(self, max_workers = 1, ** kwargs):
    from pympipool import Pool

    key = list(kwargs.keys())[0]
    val = kwargs[key]

    def _run(y):
        self.inputs[key] = y
        return self.run()

    with Pool(max_workers=max_workers) as p:
        out = p.map(func=_run, iterable=val)

    return out

I used it in connection with @jan-janssen's pympipool library. It was meant as a proof of concept and I did not intend it to be perfect code.

I like your new syntax for the for loop, i.e.:

  wf.ev = wf.create.standard.for(strained_energy, name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

Nevertheless, I have a few points where I am not perfectly happy:

  • In my opinion, a good syntax should not show redundancies, since this could give cause to errors when the multiple redundant elements are not consistent. In the above case for and STRAIN indicate both that STRAIN is a vector. I therefore would be in strong favor of dropping it. Dropping it would reflect the same behavior we have in numpy, where np.cos(x) would work with x as vector and without having to specify an explicit for loop.

  • In my seminars, I teach that avoiding for loops with indices is a good thing and computationally more efficient. Having an explicit for-loop statement appears in the age of numpy etc. a bit outdated. This is of course only a more personal view.

  • As mentioned in my previous response to avoid having a vector with 'magic' elements I would prefer to have a node vector generator rather than an explicit vector, i.e.:

     wf.ev = strained_energy(name='Al', HYDROSTATIC_STRAIN=wf.create.numpy.linspace(min=-0.05, max=0.05, steps=5)))
    

@samwaseda
Copy link
Member

wf.ev = wf.create.standard.for(strained_energy, name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

With this implementation I have two concerns (or rather: one concern expressed in two different ways)

  • It is not quite clear which variable for stands for, i.e. the connection between linspace and for is not quite clear (simply because they are not written in the same place)
  • More fundamentally: How would pyiron know that the for loop is for STRAIN and not for name? Ok in this case STRAIN is a list/array and name is not, but what if there are multiple lists in one function?

@liamhuber
Copy link
Member Author

Thanks for the rapid iteration of this discussion, both!

@JNmpi,

I like your new syntax for the for loop, i.e.:
...
Nevertheless, I have a few points where I am not perfectly happy:

Super, let's use it as a starting point and improve from here then! (When we actually get to it that is, as it's lower priority than storage IMO)

In my opinion, a good syntax should not show redundancies, since this could give cause to errors when the multiple redundant elements are not consistent. In the above case for and STRAIN indicate both that STRAIN is a vector. I therefore would be in strong favor of dropping it. Dropping it would reflect the same behavior we have in numpy, where np.cos(x) would work with x as vector and without having to specify an explicit for loop.

This is a very valid concern. I would say there's an interesting asymmetry in the redundancy here: given STRAIN, I agree for is redundant; but given for we only know something is being iterated over and therefore STRAIN is still necessary.

In principle we can then drop the for as you suggest, which would bring us in line with your numpy example. However, this just brings us back to exactly the first suggestion:

wf.ev = strained_energy(name='Al', STRAIN=np.linspace(-0.05, 0.05, 5))

And I still have concerns about how implicit that is.

The big difference to the numpy example is that np.cos is a function, and it's always a function whether it operates on a scalar or a vector. In contrast, in this paradigm strained_energy(strain=0.05) is a functional node, but strained_energy(STRAIN=np.linspace(-0.05, 0.05, 5)) is a macro node, even though strained_energy is being called both times!

Maybe it's just OK to do this and I'm making a mountain out of a molehill, but my gut still says this is a messy idea.

As an aside, on way to implement such behaviour may be to modify Node.__new__, so when we call strained_energy(STRAIN=...), before ever getting to __init__, __new__ is parsing the kwargs and seeing if any of them are flagged as vectorized versions of a regular input field; if not then __new__ returns an instance of the regular class, but if a request for vectorization is detected it actually returns a for-macro wrapping that class instead!

I made a super simple example of this concept:

class Foo:
    def __new__(cls, *args, **kwargs):
        return super().__new__(cls)
    
    def __init__(self, x, bar):
        self.x = x
        self.y = x - 1
        self.bar = bar  # Our "body node"
        
class Bar:
    def __new__(cls, x):
        if x < 0:
            instance = Foo(x, cls)  # Our "for macro"
        else:
            instance = super().__new__(cls)
        return instance
        
    def __init__(self, x):
        self.x = x
        self.y = x + 1
        
bar = Bar(5)
print(bar.x, bar.y, type(bar))
>>> 5 6 <class '__main__.Bar'>

foobar = Bar(-5)
print(foobar.x, foobar.y, type(foobar), foobar.bar)
>>> -5 -6 <class '__main__.Foo'> <class '__main__.Bar'>

Anyhow, like I said, I don't think this is a good idea, but at least it is possible.

In my seminars, I teach that avoiding for loops with indices is a good thing and computationally more efficient. Having an explicit for-loop statement appears in the age of numpy etc. a bit outdated. This is of course only a more personal view.

In some ways then I feel that explicitly having the ...for() is actually preferable, since under the hood it really is not doing something vectorized but is exactly doing a traditional for-loop (over child nodes).

As mentioned in my previous response to avoid having a vector with 'magic' elements I would prefer to have a node vector generator rather than an explicit vector, i.e.:

This is a nice point. In this particular example I don't think it makes a huge difference since there are only five elements, so a human can assess start, stop, and steps trivially easily. But for more complex data, or even just for linspaces that are too big to easily count n, then this becomes important.

In general I guess this is a sort of "best practice" -- like how Function nodes should be functional and idempotent -- that we should start working into examples.

@samwaseda

With this implementation I have two concerns (or rather: one concern expressed in two different ways)

Indeed, I think I can answer both at once. In this case, the value being iterated over is indicated by the fact that the kwarg matches with key.upper() for some key in the inputs rather than simply with key. Instead of using all-caps we could instead be more explicit and use f"{key}__vector" or similar. Critical is just that it deviate from the regular input channel labels in some systematic and (almost for sure) unique way.

As an implementation detail, right now the IO channels are not actually accessible until after a node gets instantiated. So class For.__init__(body_class: type[Node], ..., **kwargs) would probably need to do something like instantiate a copy of the body, scrape the relevant input channel labels to compare against **kwargs keys, then delete this trial node before setting up its real set of children. In the future this might be easier if IO channels (or at least labels and type hints) become available at the class level. In either case, it's an implementation detail that can be tucked away inside the definition of for-loop nodes and isn't prohibitively expensive, so I'm not too worried about it.

Note that in the examples here we have been iterating over exactly one input field, but in general I would expect to be able to iterate over multiple, perhaps assuming that they should all have the same length and get zipped together (as is the case in ironflow), but maybe also allowing to set a flag so they're nested instead?

@JNmpi
Copy link

JNmpi commented Nov 8, 2023

@liamhuber, I fully agree that this topic has lower priority than issues such as storage. Nevertheless, I think we all feel that this is a crucial point and will affect the majority of workflows.

What I realize in our discussions is that we talk mainly about a specific scenario, where we could have a fully independent execution of the iterations in the loop body. With respect to pyiron this corresponds to the concept of the parallel master. In numpy, it would correspond to a map approach where the numpy function is applied to each element of the input vector, array etc. This is the case, where in my opinion we do not need an explicit statement for the for-loop. I also do not see the need for a macro - a simple construct as my exec_pool example (which should be of course renamed) could be part of the node.py module. This is what I did with the exec_pool implementation. The nice thing about this is that it works readily also for macro nodes, without having to add a single line of extra code.

Things are getting more tricky when implementing workflows with a for-loop that does not run the elements individually. An example would be a convergence text where output from the previous node is needed as input into the next node. In this case, we probably need the full power of a loop-node, although we may still find a more intuitive syntax. This scenario is what our SerialMaster aimed to do, but we were never happy with the syntax. Again, low priority for the moment but it does not hurt to continue the discussion.

@liamhuber
Copy link
Member Author

liamhuber commented Nov 8, 2023

@liamhuber, I fully agree that this topic has lower priority than issues such as storage. Nevertheless, I think we all feel that this is a crucial point and will affect the majority of workflows.

Super, we're on the same page here. I just wanted to make sure that the very active conversation here didn't give the impression I was also actively coding this!

... This is the case, where in my opinion we do not need an explicit statement for the for-loop. I also do not see the need for a macro - a simple construct as my exec_pool example (which should be of course renamed) could be part of the node.py module. ...

Yes, I see what you're saying. This is in exactly the direction as my comment above under "1) Empower Node to "batch" input". I'm not totally closed to such a solution, but I do still have concerns about (a) the added complexity to the Node class making maintenance and learning harder, and (b) the on-the-fly type changes when a user decides to batch some input (both in terms of how it is implicit and might confuse them, and how it will interact with the rest of the type checking system, e.g. existing connections).

I'm definitely willing to give it a try, I'm just not enthusiastic about it yet.

Things are getting more tricky when implementing workflows with a for-loop that does not run the elements individually. An example would be a convergence text where output from the previous node is needed as input into the next node. In this case, we probably need the full power of a loop-node, although we may still find a more intuitive syntax. This scenario is what our SerialMaster aimed to do, but we were never happy with the syntax. Again, low priority for the moment but it does not hurt to continue the discussion.

Yes, for sure. As soon as there is cyclicity in the data graph things just get super complicated. We could do this with a cyclic for loop, but I probably lean towards pushing people to using the while loop tools. At any rate, definitely more complicated, almost certainly needs a macro expression, and low priority for now.

I do think we want to keep in mind how our solution will handle multiple looped inputs -- zipping or nested loops? (i.e. O(1) or O(N) cost with N looped variables.) This can probably be handled just fine by extending the __init__ signature with something like looping_mode: Literal["zip", "nest"] (or something more complex if we want to allow mixing zipping and nesting in the case of 3+ looped variables).

@liamhuber liamhuber changed the title For-loop syntax? 💡 For-loop syntax? Jan 12, 2024
@liamhuber
Copy link
Member Author

For-loops are now implemented to my satisfaction in #309. This still needs to be extended to be a method(s) right on StaticNode (or Node if we de-parent Workflow from Node...) that lets you generate these more easily, and some multiple dispatching to support using a node instance instead of a node class wouldn't hurt.

Anyhow, I will leave this issue open mostly because I really like the GUI sketch at the top, even if the core of having a decent for-loop syntax is already achieved.

@liamhuber
Copy link
Member Author

Shortcut-access on StaticNode got added in #320, so this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants