Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async Support? #209

Open
tshanebuckley opened this issue Mar 7, 2022 · 16 comments · May be fixed by #215
Open

Async Support? #209

tshanebuckley opened this issue Mar 7, 2022 · 16 comments · May be fixed by #215

Comments

@tshanebuckley
Copy link

Hello! Happy to see some significant progress/updates have been made over the past few months on this project! I was curious if there are any plans to add async support to PyCap?

I've recently been working on some code that encapsulates PyCap and uses it to at least fetch records in an asynchronous fashion. Also played around with a FastAPI middleware to allow easier access to data in a more REST-like manner to support application development. (It's not very production-worthy as of yet, though)

If there is no async work in the pipelines yet, I would be happy to help contribute! :-D

@pwildenhain
Copy link
Collaborator

I'm totally down! you caught me at the perfect time, I was literally considering releasing 2.0 some time this week or next week (right now it's just a pre-release on pypi)

Couple of follow up questions:

  1. If we do decided to add async support, could this be down without breaking the existing API? I don't want to have to release a 3.0 right after releasing a 2.0 😂 . If it requires breaking the existing API, then I could see holding off until we have async implemented
  2. I'm guessing it can be done without breaking the existing API. Do you think it could be implemented such that it's an optional dependency? Like you only need package x if you want to use async with PyCap? That's what we've done with pandas, so it would be nice to continue that.
  3. Can you sketch a few examples of what it would look like for someone to use PyCap with async functionality?
  4. Can you describe the high-level implementation details? At least how you currently imagine it.

@tshanebuckley
Copy link
Author

Let me see if I can answer these coherently:

1&4.) Totally! Really all I did in my implementation was slightly modify a PyCap Project object to have the "_call_api()" method return the payloads instead of actually running the request. Then I extend the payloads by records and chunk them into groups. Each chunk is then added to a list of tasks and then I use asyncio and aiohttp to submit all of the requests instead of using the requests module. My goal was building this out to avoid reinventing the wheel and lean on PyCap, so I do think this is very possible!

2.) Potentially. Though it would probably be a few extra dependencies (aiohttp, aiofiles, aiopath, maybe a few more or a few less). Though I do think a better implementation is to hide async logic under the hood of synchronous functions and leave the PyCap function signatures relatively unchanged. I'm pretty sure this is how REDCapR works.

3.) Instead of my method, it could be possible to just add an "async" variable that defaults to None, but can be given a dictionary to act as kwargs for any extra info needed to run an async request. Like I said, I think the functions and their signatures could be left relatively unchanged.

Right now, running my version looks something like this (gc is an instance of my wrapper object):
r = asyncio.run(gc.exec_request(method='POST', selection_criteria={'fields': {'field_name'}}, rc_name='my_project', func_name='export_records'))

Though I think a better implementation would be to contain that within the object like this (in this case, gc would be the new and improved PyCap Project object):
gc.export_records(<usual PyCap arguments>, async=True)
and if the async argument is set to True (or maybe the kwargs idea mentioned before), return the coroutine(s).

Not sure if I explained all of my thoughts well enough here. Happy to set up a phone call or something over email if that would be helpful!
Here's the repo I've been working out of if it helps clear things up!

@pwildenhain
Copy link
Collaborator

this looks sweet 🍰 very cool stuff on GREENCap 😎

About REDCapR I've also used this package extensively, and I don't think there's any support for asynchronous execution. The closest I can think is that redcap_read allows you to define batch sizing for records, but that's it. Am I missing something? Or is this what you were referring to?

I think the export_records(<usual PyCap arguments>, async=True) would be very slick. Just so I understand if had the following code:

from redcap import Project

proj = Project(url, token)

records = proj.export_records(async=True)
users = proj.export_users(async=True)

This means that my program wouldn't need to wait for records to be defined before it could make use of users correct? Want to make sure I have a good grasp of the benefits before we invest the time in implementing async for PyCap.

@tshanebuckley
Copy link
Author

You know...now that you mention it, I think you're right that it just chunks and runs synchronously. Would make sense why I can pull data faster with GREENCap!

So Python relatively recently implemented a task-based asynchronous paradigm. It's based on the async and await keywords. So, the code would look more like this if you wouldn't want these to be blocking (or something close to it):

from redcap import Project
import asyncio

proj = Project(url, token)

# initialize a list of asynchronous tasks
tasks = list()

# get your coroutines
records_task = asyncio.ensure_future(proj.export_records(async=True))
users_task = asyncio.ensure_future(proj.export_users(async=True))

# append the coroutines to your task list
tasks.append(records_task)
tasks.append(users_task)

# gather coroutine calls (technically you could do this in one line and not do the appending above)
my_coroutines = await asyncio.gather(asyncio.gather(*ploads))

# run the calls
asyncio.run(my_coroutines)

What I had in mind was to leave the implementation looking the same as before:
from redcap import Project

proj = Project(url, token)

records = proj.export_records()
users = proj.export_users()

TL;DR: The goal would be to have the stuff like proj.export_records() work almost exactly as it does now, but have a faster implementation under the hood and give folks more advanced in programming the option to work directly with the async stuff if they want.

@pwildenhain
Copy link
Collaborator

Awesome, I like adding extensibility/possibility for composition for people who want to opt into it. We can write some good docs as well about how use the async features in tandem with libraries like asyncio

Id say you're good to go on getting started on the work. Lmk if you have any questions/trouble at all with setting up the development environment/running tests

@tshanebuckley
Copy link
Author

Heard! May need some help with PyTest since that'll be new for me and I'm obviously gonna wanna make sure the methods work as they did before! 😄

Will probably create the fork tomorrow and should hopefully have time Friday to dive into this!

@pwildenhain
Copy link
Collaborator

Awesome. I'm on a trip this weekend, so any questions you have I'll get to them next week

@tshanebuckley
Copy link
Author

Spent some time on this over the weekend getting familiar with the repo and getting some pseudocode and a plan together for the refactor to run asynchronous requests while keeping the "user facing" project class relatively unchanged. For transparency (and my own sake 😄), here are my notes:

base.py:

  • add code to chunk some requests by record
  • update _return_data() (will do everything async with json/dicts and convert back to csv or xml)

method classes:

  • change method logic to exist within a sub function so that calls may be run over a list of the payloads chunked in the update to base.py (or something similar to this)

coroutine.py:

  • new object for holding the request logic within a coroutine
  • changes to the method classes result from the addition of this class; want to use the logic from the method objects to build a task list (an asynchronous task per payload)
  • much of the logic from request to be moved here and refactored

request.py:

  • execute() method changed to return either the coroutine or uses asyncio.run() to just return data as usual

project.py:

  • may want to update to allow async initialization of the project (not an immediate priority)

@pwildenhain
Copy link
Collaborator

Fantastic, this all makes sense

How do you feel about adding an article to the documentation site as well? mkdocs is what builds the site and it just uses markdown, so it shouldn't be too bad to have an article about how to take advantage of the asynchronous execution.

Also, let me know if you need help updating tests at all. You won't be able to run any of the integration tests locally since you don't have a super user token for the redcapdemo site (which allows you to create projects using the API). And the unit tests have a little bit of architecture behind them using the responses package, which mocks a redcap server for us to test against. I refactored he unit tests to make them easier to update but it's a lot to wrap your head around at first.

On a side note, how was the project to wrap your head around? Was the architecture easy to understand? I'm open to any other refactoring during this that would improve the package in general.

@tshanebuckley
Copy link
Author

I think a markdown page on how to use the package asynchronously makes a lot of sense. And yeah, totally makes sense for me to handle that. Ran the test-suite locally and noticed the super user token issue. I'm not exactly new to unit testing, but definitely new to pytest, so I'll certainly be bugging you about that! The architecture was pretty easy to follow. Was just some mental gymnastics to figure out how I wanted to refactor this stuff.

Speaking of markdown and understanding the architecture...I am so hyped to learn that mermaid works in comments as well!
So my understanding of the current architecture is this:

classDiagram
      Base <|-- Method
      Method *-- Project
      Project --o Request
Loading

and I'll probably be updating it to something like this (I include the interface because I would like to add a ProjectStream object at some point for GreenCap, but that's down the road):

classDiagram
      Base <|-- Method
      Method *-- Project
      Project --o Request
      Request --o Coroutine
      Method *-- Coroutine
      IMethod <|-- Method
      <<interface>> IMethod
      IProject <|-- Project
      <<interface>> IProject
Loading

Not 100% on the Coroutine yet. I'm contemplating having the Method object being somehow included in the Coroutine, but am not married to the idea. Just trying to think through where it best makes sense to keep the Request validation logic. A bunch of the Request logic has to be moved into the Coroutine so that bit is really what I have left to think about before the rubber meets the road on all of this.

I think adding async is enough in general, don't want to delay your release for too long! 😄 Although perhaps loading Project classes and re-used requests from YAML could be easy enough to add and I think it's super handy. Let's users just update a config file instead of re-writing a new script. I had this setup in GreenCap where the config stuff was saved under ~/.greencap.

@pwildenhain
Copy link
Collaborator

I think adding async is enough in general, don't want to delay your release for too long! 😄 Although perhaps loading Project classes and re-used requests from YAML could be easy enough to add and I think it's super handy. Let's users just update a config file instead of re-writing a new script. I had this setup in GreenCap where the config stuff was saved under ~/.greencap.

This sounds like useful stuff -- also sounds like it could be included in a 2.1 release since there purely new features that wouldn't break existing architecture

Not 100% on the Coroutine yet. I'm contemplating having the Method object being somehow included in the Coroutine, but am not married to the idea. Just trying to think through where it best makes sense to keep the Request validation logic. A bunch of the Request logic has to be moved into the Coroutine so that bit is really what I have left to think about before the rubber meets the road on all of this.

I'm not terribly familiar with async stuff, one idea I had while reading this though was what if when async=False we still use Coroutine but only run one task? And then we just return that for the user right away, rather than returning the task? That way it's the same pathway through the code/objects regardless of the value of async. Does that track at all?

@tshanebuckley
Copy link
Author

Your logic for async=False is exactly my plan...though async is a protect keyword, I've been planning coroutine=False.

I'm more just trying to wrap my head around chunking the api calls and recombining them and how that effects xml, json, and csv returns along with binary/file returns and making sure all of the type-checking is right.

@pwildenhain
Copy link
Collaborator

I've been planning coroutine=False

Nitpick: What about return_coroutine=False? This seems more explicit to me and signals to the user that the return type of the function will change.

@tshanebuckley Are you still planning on tackling this within the next few weeks? It seems like this won't be a breaking change in terms of the package API, just an additional argument (which is great).

If you think it might be a while before you get to it, then I think I'll release 2.0 in the meantime, and this can be a feature of 2.1

However, I'm willing to hold off if you're planning to work on this, because I know this will change a lot of how the package works internally, and I don't want to add a bunch of new endpoints in the next week just to make you have to convert those over to the new internal architecture as well.

@tshanebuckley
Copy link
Author

return_coroutine=False is a bit more explicit, so sounds good to me.

I have been working on this, just haven't pushed anything up to GitHub as of yet. I'm trying to keep each of the methods completely self-contained in the returned coroutine, but I think the original architecture works counter to this.

I need Coroutine to call, for example export_records to generate the payloads, but export_files runs _call_api to instantiate the Coroutine. This is basically for chunking the payload since I would rather chunk then generate each payload based on those chunks.

I definitely over-estimated how long this would take me. 😅 I think I figured out how to implement this without breaking away from the current architecture (just breaking out the code that generates the payload into a method and passing that method to _call_api). I'll push my working repo tonight and let you decide on the 2.0 release. I'm cool either way!

@pwildenhain
Copy link
Collaborator

Awesome, looking forward to seeing what you've got 👍🏻

@tshanebuckley
Copy link
Author

Just pushed. Started some work in the method classes, mainly files. You'll see that I took a decent amount of the code from request.py and have been replacing the request package with aiohttp and moving it into coroutine.py. Basically have to get every method call end with _call_api, where the plan is to have that return the coroutine or prior expected input. Still have to figure out what I am going to do with _return_data along with identifying errors on multiple requests...so that's still a work in progress.

@pwildenhain pwildenhain linked a pull request Mar 22, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants