Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add load balancing #17

Open
JBGruber opened this issue Jun 2, 2024 · 5 comments
Open

Add load balancing #17

JBGruber opened this issue Jun 2, 2024 · 5 comments

Comments

@JBGruber
Copy link
Owner

JBGruber commented Jun 2, 2024

The same approach implemented in #16 could also be used to send requests to multiple Ollama servers at once to process requests in parallel. There are at least two approaches we could follow:

  1. naive: we distribute requests equally among servers and wait for all responses.
  2. advanced: we send a few requests to each server and then poll which instance has returned responses. As soon as a server has fewer than x open requests in the queue, we send more.

In 1., the total run time would be determined by the slowest instance. 2. would be much more efficient in scenarios with a mix of fast and slow machines, but also harder to implement.

@JBGruber
Copy link
Owner Author

This works now in the output branch. I opted to do something between naive and advanced. When you supply a vector of servers, you can assign a name to each, corresponding to what share of requests should be fulfilled by that server. So c("0.6" = "http://localhost:11434/", "0.4" = "http://192.168.2.45:11434/") will hand 60% of requests to localhost and 40% to the remote computer. It's pretty quick:

library(rollama)
library(tidyverse)

reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
                       show_col_types = FALSE) |> 
  sample_n(500)
#> New names:
#> • `` -> `...1`

make_query <- function(t) {
  tribble(
    ~role,    ~content,
    "system", "You assign texts into categories. Answer with just the correct category, which is either {positive}, {neutral} or {negative}.",
    "user", t
  )
}


start <- Sys.time()
reviews_df_annotated <- reviews_df |> 
  mutate(query = map(`Review Text`, make_query),
         category = query(query, screen = FALSE,
                          model = "llama3.2:3b-instruct-q8_0", 
                          server = c("0.6" = "http://localhost:11434/", 
                                     "0.4" = "http://192.168.2.45:11434/"), 
                          output = "text"))
stop <- Sys.time()
stop - start
#> Time difference of 18.19546 secs

Created on 2024-10-18 with reprex v2.1.0

@bshor
Copy link

bshor commented Nov 8, 2024

I'm trying to understand (and implement) this.

Would this be the functional equivalent of having two GPU's (for example) in 1 system? That is, they could handle through combined VRAM a much larger model?

Or this merely going to split a single request at the ratios you select, and it would just merely process everything quicker, but not take advantage of the larger combined VRAM?

Or is this an implementation of Ollama's new parallel request feature?

@JBGruber
Copy link
Owner Author

JBGruber commented Nov 8, 2024

AFAIK there is no way to combine the vram on consumer gpus. This is indeed just using parallel requests and if you have multiple machines (or gpus running multiple instances of Ollama I guess) you can divide the queue. E.g., you have two PCs with gpu and one laptop. The laptop will be slow but could still fulfill some of the requests if you have thousands.

@bshor
Copy link

bshor commented Nov 8, 2024

Ok, but do you mean queues in what sense? Within the context of a single ollama call that may just take a long time to complete, or do you mean individual calls that are kept in a list or something?

@JBGruber
Copy link
Owner Author

A queue would be the same annotation task with examples and prompt for 1000 texts. I updated the vignette to show an example: https://jbgruber.github.io/rollama/articles/annotation.html#another-example-using-a-dataframe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants