-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Add completion models unit & matrix integration tests #68
Comments
Adding a recording feature to serialize real responses as apart of the provider machinery would be quite interesting and perhaps a robust way to setup a nice in-between w/ smoke tests (which I think of sanity checks) and proper integration testing. Mocking is always tricky (though most of these providers let you set a seed, it's not something to rely on). |
I've reviewed some of the various suggested solutions: rvcrhttps://github.com/chorusone/rvcr Crowds: 13,289 This is centered around the record/replay feature but offers no other mocking features. Builds on top of reqwest-middleware which may be interesting in its own right. I think from grepping the source it supports async requests. wiremock-rshttps://github.com/lukemathwalker/wiremock-rs Crowds: 9,591,535 This allows you to mock a HTTP service, but no recording/replay. Lists mockito and httpmock as prior art, and presents a table comparing features: https://docs.rs/wiremock/0.6.2/wiremock/#prior-art Not the same or compatible with Java wiremock. confusion warningThere's wiremock-rs, and there's Java wiremock (at wiremock.org), which is something else. stubrhttps://github.com/beltram/stubr Crowds: 37,385 This is an adaptation (fork?) of wiremock-rs which supports Java Wiremock json stubs as input. Aims to reach drop-in replacement status of Java Wiremock. Hasn't been updated in over a year. This adds recording with the record-reqwest feature https://beltram.github.io/stubr/html/recording/reqwest.html but async requests aren't supported and rig uses async requests. httpmockhttps://github.com/alexliesenfeld/httpmock Crowds: 3,873,238 Mock a HTTP service, but no recording/replay. mockitohttps://github.com/lipanski/mockito Crowds: 5,595,547 Mock a HTTP service, but no recording/replay. AnalysisBy popularity wiremock-rs would be interesting, but it requires mocking the individual APIs rather than being able to record/replay. stubr fixes that and can integrate with reqwest, but relying on a fork of wiremock-rs that's less maintained seems risky, plus async requests appear to be required and not supported. rvcr seems fairly simple and if record/replay is chosen, I think this is the way to go. If record/replay doesn't work out, fall back on wiremock-rs, as it's most popular and with work fake APIs can be produced. |
Indeed! Anthropic's API docs take forever to load (today?), but here's this note about temperature (I haven't come across of a notion of seed):
|
This is some very high quality analysis and will be super useful as we approach this soon! |
Task
Research |
Given @0xMochan's comment on #66 I took the liberty of creating an issue for it. Here are some of my thoughts. I hope it's helpful at all.
Integration testing of external services is a tricky issue. Even without the stochastic nature of LLMs, there is no guarantee that replaying a request against the same external service is going to give the same response, as state could have changed downstream.
A generic mock service is very useful, but will only test part of the integration, but we also know specific integrations can break and it's difficult to test them thoroughly by manually.
How would manual testing work?
There would be a bunch of scenarios, and you'd run them, eyeballing whether the correct behavior happens. It'd be nice if those scenarios could be generic and specific clients (openai, anthropic, etc) can be injected. This way most of the manual testing code can be reused.
This can also lead to at least some "smoke tests" - while we don't know the details of the responses, we know the response should be successful, and that the right type of response is returned (such as the request to use a tool).
How to automate such smoke tests?
One a test passes with the real API, the requests (including parameters, body, etc) as well as the full responses can be recorded and serialized into files. So, the real clients need to grow an optional recording feature. Then you could have fake versions of the various clients (openapi, anthropic), etc, which instead of making a real quest look whether the request matches a recording, and synthesize the response from this recording as well.
Now that I've written all that I actually thought "wait, this could all happen on the reqwest layer", and I found a library that does something like I just described:
https://docs.rs/reqwest_mock/latest
But I found out it's discontinued. But the README describes a bunch of alternatives to explore:
I'll read a bit about these next.
The text was updated successfully, but these errors were encountered: