Add load balancing #17

JBGruber · 2024-06-02T18:07:05Z

The same approach implemented in #16 could also be used to send requests to multiple Ollama servers at once to process requests in parallel. There are at least two approaches we could follow:

naive: we distribute requests equally among servers and wait for all responses.
advanced: we send a few requests to each server and then poll which instance has returned responses. As soon as a server has fewer than x open requests in the queue, we send more.

In 1., the total run time would be determined by the slowest instance. 2. would be much more efficient in scenarios with a mix of fast and slow machines, but also harder to implement.

JBGruber · 2024-10-18T15:58:37Z

This works now in the output branch. I opted to do something between naive and advanced. When you supply a vector of servers, you can assign a name to each, corresponding to what share of requests should be fulfilled by that server. So c("0.6" = "http://localhost:11434/", "0.4" = "http://192.168.2.45:11434/") will hand 60% of requests to localhost and 40% to the remote computer. It's pretty quick:

library(rollama)
library(tidyverse)

reviews_df <- read_csv("https://raw.githubusercontent.com/AFAgarap/ecommerce-reviews-analysis/master/Womens%20Clothing%20E-Commerce%20Reviews.csv",
                       show_col_types = FALSE) |> 
  sample_n(500)
#> New names:
#> • `` -> `...1`

make_query <- function(t) {
  tribble(
    ~role,    ~content,
    "system", "You assign texts into categories. Answer with just the correct category, which is either {positive}, {neutral} or {negative}.",
    "user", t
  )
}


start <- Sys.time()
reviews_df_annotated <- reviews_df |> 
  mutate(query = map(`Review Text`, make_query),
         category = query(query, screen = FALSE,
                          model = "llama3.2:3b-instruct-q8_0", 
                          server = c("0.6" = "http://localhost:11434/", 
                                     "0.4" = "http://192.168.2.45:11434/"), 
                          output = "text"))
stop <- Sys.time()
stop - start
#> Time difference of 18.19546 secs

^{Created on 2024-10-18 with reprex v2.1.0}

bshor · 2024-11-08T18:54:44Z

I'm trying to understand (and implement) this.

Would this be the functional equivalent of having two GPU's (for example) in 1 system? That is, they could handle through combined VRAM a much larger model?

Or this merely going to split a single request at the ratios you select, and it would just merely process everything quicker, but not take advantage of the larger combined VRAM?

Or is this an implementation of Ollama's new parallel request feature?

JBGruber · 2024-11-08T19:25:18Z

AFAIK there is no way to combine the vram on consumer gpus. This is indeed just using parallel requests and if you have multiple machines (or gpus running multiple instances of Ollama I guess) you can divide the queue. E.g., you have two PCs with gpu and one laptop. The laptop will be slow but could still fulfill some of the requests if you have thousands.

bshor · 2024-11-08T22:06:20Z

Ok, but do you mean queues in what sense? Within the context of a single ollama call that may just take a long time to complete, or do you mean individual calls that are kept in a list or something?

JBGruber · 2024-11-25T08:35:13Z

A queue would be the same annotation task with examples and prompt for 1000 texts. I updated the vignette to show an example: https://jbgruber.github.io/rollama/articles/annotation.html#another-example-using-a-dataframe

JBGruber added a commit that referenced this issue Oct 1, 2024

start rewriting backend for #17

17b6ab7

JBGruber mentioned this issue Oct 2, 2024

Way to kill slow query? #19

Closed

JBGruber added a commit that referenced this issue Oct 18, 2024

add option to use multiple Ollama servers (#17)

4e622cc

JBGruber mentioned this issue Oct 20, 2024

Outputs, multiple servers and better queue #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load balancing #17

Add load balancing #17

JBGruber commented Jun 2, 2024

JBGruber commented Oct 18, 2024

bshor commented Nov 8, 2024 •

edited

Loading

JBGruber commented Nov 8, 2024

bshor commented Nov 8, 2024

JBGruber commented Nov 25, 2024

Add load balancing #17

Add load balancing #17

Comments

JBGruber commented Jun 2, 2024

JBGruber commented Oct 18, 2024

bshor commented Nov 8, 2024 • edited Loading

JBGruber commented Nov 8, 2024

bshor commented Nov 8, 2024

JBGruber commented Nov 25, 2024

bshor commented Nov 8, 2024 •

edited

Loading