-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support schemas that include non type-level constraint #12
Comments
Json schema
Pydantic
|
Can you elaborate on "serious problems if misused"? minItems/maxItems and minLength/MaxLength especially would be very nice to have. Also, contains/minContains sound like they should be sort of doable if you enforce a subschema for the first or the first n elements in an array. |
@turboderp The main problem is that their space complexity is quadratic. Due to the way regular grammar/context free grammar works, to implement counted repetition(aka maxLength/MaxItems) means to enumerate over all the possible lengths of the item from minItems to maxItems. This could lead to unexpected significant slowdown and/or enormous CPU memory usage. In the context of TabbyAPI(or in general for individual users) however, it probably does not matter that much. The same is true for other features labelled with 🟡. They are useful but if misused can lead to quadratic/exponential/factorial time/space complexity. I definitely will support them one day, but I would also like to remind myself to add a safety section for them somewhere in the docs. Then, hopefully someone will not be left confused when they found their program used 100GB system RAM and got killed by OS. For the case of minItems/maxItems and minLength/MaxLength specifically though, I think it is possible to "hack" the kbnf engine to handle the counted repetition efficiently by embedding explicit integer counters into Earley item state. Its effect on the codebase complexity and/or efficiency is unknown though. I probably will stick to standard context free grammar/regex when implementing this feature in Formatron first and see if there are many users who do need very large counted repetition.
Yeah, this does mean we need to modify its semantics(the array must contain at least one in the first n elements) though. The more serious problem is that optional item(nullable nonterminal in kbnf) can leads to exponential time complexity. This means it is only practical where |
I see. The main use cases I picture would be for CoT-type sampling where you might want structured generation of "1 to 5 follow-up questions" or some such. I guess quadratic complexity in that case wouldn't be a showstopper, but it would be harder for "up to 100 characters" in the case of a string. In any case, thanks for your work, and I will remain hopeful. :) |
Dan - Without Could you tell me a bit about the difficulties that lie ahead with implementing that? |
@gittb The problem is that For simpler cases where we can consider each subschema as a concrete type and You may also be interested in |
I think the simple case most people need is just tool selection. I.e. the model wants to generate a tool call, but you don't know which tool to apply the schema for until the model starts emitting JSON. I think typically each tool has a unique But if |
Yup - Spot on. Thank you turbo. |
For clarity - here is an example of what this would look like in pydantic. from pydantic import BaseModel
from typing import Union, Optional, List
class Tool1ParamModel(BaseModel):
param1: str
param2: int
class Tool2ParamModel(BaseModel):
example_param: bool
optional_param: Optional[int]
class Tool3ParamModel(BaseModel):
another_param: str
faster_schema_gen: int
please: Optional[str]
class ToolModel1(BaseModel):
name: str = "tool_1"
param: Tool1ParamModel
class ToolModel2(BaseModel):
name: str = "tool_2"
param: Tool2ParamModel
class ToolModel3(BaseModel):
name: str = "tool_3"
param: Tool3ParamModel
class TopModel(BaseModel):
tool_calls: List[Union[ToolModel1, ToolModel2, ToolModel3]] |
@gittb I see. Yeah, for that case |
@gittb supported in |
I appreciate you Dan. |
In Json Schema and Pydantic Fields a lot of non type-level constraint are provided. Not all of them can be implemented only with CFG, but we should examine all of them and try to implement as much as possible.
✅- has been supported
🟢- can be supported without incurring any potential problems
🟡- can be supported but can lead to serious problems if misused
🟠- can be supported but require significant efforts and functionalities may be limited
❌- impossible to support within the expressiveness of a context free grammar
The text was updated successfully, but these errors were encountered: