Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader in multi-threaded environments #71

Open
jnak opened this issue Dec 9, 2019 · 10 comments
Open

Dataloader in multi-threaded environments #71

jnak opened this issue Dec 9, 2019 · 10 comments
Assignees
Labels
discussion Needs more discussion help wanted Extra attention is needed investigate Needs investigaton or experimentation

Comments

@jnak
Copy link

jnak commented Dec 9, 2019

Hi,

I've been thinking a bit about how we could implement the Dataloader pattern in v3 while still running in multi-threaded mode. Since v3 does not support Syrus's Promise library, we need to come up with a story for batching in async mode, as well as in multi-threaded environments. There are many libraries that do not support asyncio and there are many cases where it does not make sense to go fully async.

As far as I understand, the only way to batch resolver calls from a single frame of execution would be to use loop.call_soon. But since asyncio is not threadsafe, that means we would need to run a separate event loop in each worker thread. We would need to wrap the graphql call with something like this:

def run_batched_query(...):
    loop = asyncio.new_event_loop()
    execution_future = graphql(...)
    loop.run_until_complete(result_future)
    return execution_future.result()

Is that completely crazy? If yes, do you see a less hacky way? I'm not very familiar with asyncio so I would love to get feedback.

Cheers

@Cito
Copy link
Member

Cito commented Dec 10, 2019

My idea would be to port https://github.com/graphql/dataloader to Python, similarly to how GraphQL.js has been ported - it seems Syrus already started to work on this (https://github.com/syrusakbary/aiodataloader). Dataloader has seen a lot of new development recently, so this could be worthwile.

@Cito Cito self-assigned this Dec 10, 2019
@Cito Cito added the investigate Needs investigaton or experimentation label Dec 10, 2019
@jnak
Copy link
Author

jnak commented Dec 10, 2019

Nice! I didn't realize Syrus had made another version of Dataloader specifically for asyncio. He seems he's indeed using loop.call_soon to schedule batching on the next loop cycle.

Re the recent developments of Dataloader in JS, the task scheduling in Dataloader v2 uses the same underlying mechanism as in v1. Since Syrus' repo is a straight port from Dataloader JS v1, let's use it as a basis for the purpose of this conversation.

I see 2 issues to use it in a multi-threaded mode:

  1. It is not thread safe because the Dataloader queue is not thread-scoped. Scheduled coroutines could be moved to a different thread, leading to all sort of bugs and security issues. We could fix that by subclassing threading.local (vs object).

  2. It assumes it's run inside a single global event loop. We can remove the assumption that there is a single loop by not caching the value of get_event_loop(), but this will still assume this is run inside an event loop.

Unless we wrap each graphql call into its own event loop (see my snippet), I don't see how we can enable batching in multi-threaded mode. Do you have a sense if that's a crazy idea?

@Cito
Copy link
Member

Cito commented Dec 12, 2019

Currently I don't have enough time to look deeper into this and make a qualified assessment. Maybe you should examine what they are doing in other languages, e.g. graphql/dataloader#30.

@jnak
Copy link
Author

jnak commented Dec 16, 2019

Sounds good. I look into it when I start working on upgrading to v3 (probably not before Q2/Q3 2020. In the meantime, let's keep this issue open in case someone needs this earlier than I do so we can discuss it here.

If someone is interested in working on this, refer to syrusakbary/promise#81 to learn more about Dataloader and thread-safety in Python.

@silas
Copy link

silas commented Jan 18, 2020

Looks like it might be possible to use asgiref's sync_to_async and aiodataloader to do this now.

import asyncio
import os
import time
os.environ['ASGI_THREADS'] = '16'

from aiodataloader import DataLoader
from asgiref.sync import sync_to_async


class TestLoader(DataLoader):
    @sync_to_async
    def batch_load_fn(self, keys):
        t = time.time()
        return [f'{key}-{t}' for key in keys]


async def main():
    loader = TestLoader()
    one1 = loader.load('one')
    one2 = loader.load('one')
    two = loader.load('two')
    print(await one1, await one2)
    print(await two)


if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    loop.close()

@jnak
Copy link
Author

jnak commented Jan 22, 2020

@silas I'm not sure why you need asgiref's sync_to_async here since aiodataloader.DataLoader.batch_load_fn is already async. Also aiodataloader.DataLoader is not thread-safe, so you'll run into concurrency issues if you call / await load from multiple threads.

If you want to make this work in a multi-threaded environment, I recommend that you make aiodataloader.DataLoader thread-safe by using threading.local (see syrusakbary/promise#81 for an example). Then you should be able to safely wrap the graphql call with my run_batched_query snippet above.

@silas
Copy link

silas commented Jan 22, 2020

@jhgg batch_load_fn is a sync function in this case, which aiodataloader.DataLoader doesn't support (as far as I can tell).

You can't call load from the sync code (because it's an async function). You could use asgiref's async_to_sync which seems to hoist the call back up to the async thread (haven't confirmed), which should make it safe.

That said, I never call load from batch_load_fn or any of it's dependents, so it's fine for my use case.

Probably not a general solution, but it seems like it's possible to make it work now.

@miracle2k
Copy link

Why isn't the promise library supported anymore - or an equivalent? I want to use the dataloader pattern in a sync application. It's not even multi-threaded. All I want is to return an object from my resolver which will be resolved in a second tick, after the initial pass through execute. The promise library seemed to handle this use case nicely.

@Cito
Copy link
Member

Cito commented Apr 27, 2020

@miracle2k The async feature is a modern equivalent to promises. It's the modern and official solution to the problem and supported by both Python syntax and standard library. Promises are not contained in the standard library.

@miracle2k
Copy link

@Cito Ok, but I don't want to use async in this code base, but I do want to use the data loader. I think this was a reasonable thing to have been supported previously.

So I am aware that I can do the following, and indeed this is what I am doing now and it works:

  • Use the aiodataloader library. Implement the async def batch_load_fn function, but ignore the fact that it is async, and simply do a blocking call.
  • Do blocking calls to database and network in all other resolvers of the schema as well.
  • When calling into graphql to resolve a query, wrap the whole thing in an asyncio loop:
    import asyncio
    loop = asyncio.new_event_loop()
    async def run_async():
        return await execute(
            schema,
            document,
            variable_values=params.variables,
            operation_name=params.operation_name,
            **kwargs,
        )

    return loop.run_until_complete(run_async())

Now we are using asyncio as the "promise execution engine" for dataloader only, and otherwise the code is synchronous. I don't now how high the performance penalty is, but would guess it's minimal / similar to what it was with the promise library. It feels pretty gross though? If this is the recommend way to do the dataloader pattern in sync mode, ok.

Note for anyone attempting this: In my particular case this required changes to aiodataloader so it doesn't require an event loop to exist at __init__ time. but this can be avoided by simply moving the asyncio event loop higher up in the call stack.

@Cito Cito added discussion Needs more discussion help wanted Extra attention is needed labels Dec 28, 2021
pylipp added a commit to boxwise/boxtribute that referenced this issue Aug 20, 2022
- create loaders and run async query execution when dispatching GraphQL
  request
- obtain loaders from GraphQL context in resolvers
- split product resolver due to different authz enforcement
cf.
graphql-python/graphql-server#66
https://lightrun.com/answers/graphql-python-graphene-consider-supporting-promise-based-dataloaders-in-v3
graphql-python/graphql-core#71
pylipp added a commit to boxwise/boxtribute that referenced this issue Aug 21, 2022
- create loaders and run async query execution when dispatching GraphQL
  request
- obtain loaders from GraphQL context in resolvers
- split product resolver due to different authz enforcement
cf.
graphql-python/graphql-server#66
https://lightrun.com/answers/graphql-python-graphene-consider-supporting-promise-based-dataloaders-in-v3
graphql-python/graphql-core#71
pylipp added a commit to boxwise/boxtribute that referenced this issue Aug 22, 2022
- create loaders and run async query execution when dispatching GraphQL
  request
- obtain loaders from GraphQL context in resolvers
- split product resolver due to different authz enforcement
cf.
graphql-python/graphql-server#66
https://lightrun.com/answers/graphql-python-graphene-consider-supporting-promise-based-dataloaders-in-v3
graphql-python/graphql-core#71
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Needs more discussion help wanted Extra attention is needed investigate Needs investigaton or experimentation
Projects
None yet
Development

No branches or pull requests

4 participants