Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add completion models unit & matrix integration tests #68

Open
faassen opened this issue Oct 21, 2024 · 5 comments · May be fixed by #97
Open

test: Add completion models unit & matrix integration tests #68

faassen opened this issue Oct 21, 2024 · 5 comments · May be fixed by #97
Assignees
Labels
test Improvements to tests coverage & depth

Comments

@faassen
Copy link

faassen commented Oct 21, 2024

Given @0xMochan's comment on #66 I took the liberty of creating an issue for it. Here are some of my thoughts. I hope it's helpful at all.

Integration testing of external services is a tricky issue. Even without the stochastic nature of LLMs, there is no guarantee that replaying a request against the same external service is going to give the same response, as state could have changed downstream.

A generic mock service is very useful, but will only test part of the integration, but we also know specific integrations can break and it's difficult to test them thoroughly by manually.

How would manual testing work?

There would be a bunch of scenarios, and you'd run them, eyeballing whether the correct behavior happens. It'd be nice if those scenarios could be generic and specific clients (openai, anthropic, etc) can be injected. This way most of the manual testing code can be reused.

This can also lead to at least some "smoke tests" - while we don't know the details of the responses, we know the response should be successful, and that the right type of response is returned (such as the request to use a tool).

How to automate such smoke tests?

One a test passes with the real API, the requests (including parameters, body, etc) as well as the full responses can be recorded and serialized into files. So, the real clients need to grow an optional recording feature. Then you could have fake versions of the various clients (openapi, anthropic), etc, which instead of making a real quest look whether the request matches a recording, and synthesize the response from this recording as well.

Now that I've written all that I actually thought "wait, this could all happen on the reqwest layer", and I found a library that does something like I just described:

https://docs.rs/reqwest_mock/latest

But I found out it's discontinued. But the README describes a bunch of alternatives to explore:

I'll read a bit about these next.

@faassen faassen added the feat label Oct 21, 2024
@0xMochan
Copy link
Contributor

Adding a recording feature to serialize real responses as apart of the provider machinery would be quite interesting and perhaps a robust way to setup a nice in-between w/ smoke tests (which I think of sanity checks) and proper integration testing. Mocking is always tricky (though most of these providers let you set a seed, it's not something to rely on).

@faassen
Copy link
Author

faassen commented Oct 21, 2024

I've reviewed some of the various suggested solutions:

rvcr

https://github.com/chorusone/rvcr

Crowds: 13,289

This is centered around the record/replay feature but offers no other mocking features.

Builds on top of reqwest-middleware which may be interesting in its own right. I think from grepping the source it supports async requests.

wiremock-rs

https://github.com/lukemathwalker/wiremock-rs

Crowds: 9,591,535

This allows you to mock a HTTP service, but no recording/replay.

Lists mockito and httpmock as prior art, and presents a table comparing features:

https://docs.rs/wiremock/0.6.2/wiremock/#prior-art

Not the same or compatible with Java wiremock.

confusion warning

There's wiremock-rs, and there's Java wiremock (at wiremock.org), which is something else.

stubr

https://github.com/beltram/stubr

Crowds: 37,385

This is an adaptation (fork?) of wiremock-rs which supports Java Wiremock json stubs as input. Aims to reach drop-in replacement status of Java Wiremock.

Hasn't been updated in over a year.

This adds recording with the record-reqwest feature

https://beltram.github.io/stubr/html/recording/reqwest.html

but async requests aren't supported and rig uses async requests.

httpmock

https://github.com/alexliesenfeld/httpmock

Crowds: 3,873,238

Mock a HTTP service, but no recording/replay.

mockito

https://github.com/lipanski/mockito

Crowds: 5,595,547

Mock a HTTP service, but no recording/replay.

Analysis

By popularity wiremock-rs would be interesting, but it requires mocking the individual APIs rather than being able to record/replay. stubr fixes that and can integrate with reqwest, but relying on a fork of wiremock-rs that's less maintained seems risky, plus async requests appear to be required and not supported.

rvcr seems fairly simple and if record/replay is chosen, I think this is the way to go. If record/replay doesn't work out, fall back on wiremock-rs, as it's most popular and with work fake APIs can be produced.

@faassen
Copy link
Author

faassen commented Oct 21, 2024

though most of these providers let you set a seed, it's not something to rely on

Indeed! Anthropic's API docs take forever to load (today?), but here's this note about temperature (I haven't come across of a notion of seed):

Note that even with temperature of 0.0, the results will not be fully deterministic.

https://docs.anthropic.com/en/api/messages

@0xMochan
Copy link
Contributor

0xMochan commented Oct 21, 2024

I've reviewed some of the various suggested solutions:

This is some very high quality analysis and will be super useful as we approach this soon!

@mateobelanger mateobelanger changed the title feat: Integration Testing test(integration): Add hybrid smoke-integration tests for external providers Oct 22, 2024
@mateobelanger mateobelanger added test Improvements to tests coverage & depth and removed feat labels Oct 22, 2024
@0xMochan 0xMochan mentioned this issue Oct 29, 2024
8 tasks
@mateobelanger mateobelanger added feat and removed feat labels Oct 30, 2024
@mateobelanger mateobelanger self-assigned this Nov 12, 2024
@mateobelanger
Copy link
Member

Task

  • Create testing matrix for all completion model providers
    EX: with/out additional params, with/out tools, ...
  • integration tests

Research

https://rust.testcontainers.org/

@mateobelanger mateobelanger added this to the v0.5 milestone Nov 12, 2024
@mateobelanger mateobelanger changed the title test(integration): Add hybrid smoke-integration tests for external providers test(integration): Add completion models unit & matrix integration tests Nov 12, 2024
@mateobelanger mateobelanger changed the title test(integration): Add completion models unit & matrix integration tests test: Add completion models unit & matrix integration tests Nov 12, 2024
@cvauclair cvauclair removed this from the v0.5 milestone Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Improvements to tests coverage & depth
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants