Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support schemas that include non type-level constraint #12

Open
Dan-wanna-M opened this issue Aug 28, 2024 · 12 comments
Open

Support schemas that include non type-level constraint #12

Dan-wanna-M opened this issue Aug 28, 2024 · 12 comments
Assignees
Labels
enhancement New feature or request

Comments

@Dan-wanna-M
Copy link
Owner

Dan-wanna-M commented Aug 28, 2024

In Json Schema and Pydantic Fields a lot of non type-level constraint are provided. Not all of them can be implemented only with CFG, but we should examine all of them and try to implement as much as possible.

✅- has been supported
🟢- can be supported without incurring any potential problems
🟡- can be supported but can lead to serious problems if misused
🟠- can be supported but require significant efforts and functionalities may be limited
❌- impossible to support within the expressiveness of a context free grammar

@Dan-wanna-M Dan-wanna-M self-assigned this Aug 28, 2024
@Dan-wanna-M Dan-wanna-M added the enhancement New feature or request label Aug 28, 2024
@Dan-wanna-M
Copy link
Owner Author

Dan-wanna-M commented Sep 15, 2024

Json schema

  • String
    • ✅🟡minLength/MaxLength (v0.4.6)
    • ✅🟡pattern(v0.4.6)
    • 🟠format
  • Numeric types
    • ❌multipleOf
    • 🟠minimum, exclusiveMinimum, maximum, exclusiveMaximum
      • ✅positive, negative, nonnegative,nonpositive(v0.4.6)
      • ❌others
  • Object
    • ✅properties
    • 🟠patternProperties
    • ❌additionalProperties(we always assume additionalProperties is false due to implementation limitations)
    • ❌unevaluatedProperties(we always assume unevaluatedProperties is false due to implementation limitations)
    • ✅🟡required
    • 🟠propertyNames
    • 🟠minProperties/maxProperties
  • Array
    • ✅items
    • ✅🟡prefixItems(v0.4.6)
    • ✅additionalItems(v0.4.6)
    • ❌unevaluatedItems
    • ❌contains/minContains/maxContains
    • ✅🟡minItems/maxItems(v0.4.6)
    • ❌uniqueItems
    • 🟠contentMediaType/contentEncoding
  • Schema composition
    • ❌allOf/oneOf/not
    • 🟠anyOf
      • ✅no common parts of subschemas factored out(v0.4.6)
      • 🟠with common parts of subschemas factored out
    • ❌dependentRequired/dependentSchemas/if/then/else
  • Misc
    • ✅Top-level array(v0.4.3)

Pydantic

  • str
    • ✅min_length/max_length(v0.4.6)
    • ✅pattern(v0.4.6)
  • int/float
    • ❌multiple_of
    • 🟢allow_inf_nan
    • 🟠gt, ge, lt, le
      • ✅positive, negative, nonnegative,nonpositive(v0.4.6)
      • ❌others
  • 🟢decimal
    • 🟢max_digits
    • 🟠decimal_places
  • sequence-like types
    • ✅🟡min_length/max_length(v0.4.6)
  • Misc
    • ❌lax mode format enforcement
      • I can't see why it is useful for a llm.

@turboderp
Copy link
Contributor

Can you elaborate on "serious problems if misused"? minItems/maxItems and minLength/MaxLength especially would be very nice to have.

Also, contains/minContains sound like they should be sort of doable if you enforce a subschema for the first or the first n elements in an array.

@Dan-wanna-M
Copy link
Owner Author

Can you elaborate on "serious problems if misused"? minItems/maxItems and minLength/MaxLength especially would be very nice to have.

@turboderp The main problem is that their space complexity is quadratic. Due to the way regular grammar/context free grammar works, to implement counted repetition(aka maxLength/MaxItems) means to enumerate over all the possible lengths of the item from minItems to maxItems. This could lead to unexpected significant slowdown and/or enormous CPU memory usage. In the context of TabbyAPI(or in general for individual users) however, it probably does not matter that much.

The same is true for other features labelled with 🟡. They are useful but if misused can lead to quadratic/exponential/factorial time/space complexity. I definitely will support them one day, but I would also like to remind myself to add a safety section for them somewhere in the docs. Then, hopefully someone will not be left confused when they found their program used 100GB system RAM and got killed by OS.

For the case of minItems/maxItems and minLength/MaxLength specifically though, I think it is possible to "hack" the kbnf engine to handle the counted repetition efficiently by embedding explicit integer counters into Earley item state. Its effect on the codebase complexity and/or efficiency is unknown though. I probably will stick to standard context free grammar/regex when implementing this feature in Formatron first and see if there are many users who do need very large counted repetition.

Also, contains/minContains sound like they should be sort of doable if you enforce a subschema for the first or the first n elements in an array.

Yeah, this does mean we need to modify its semantics(the array must contain at least one in the first n elements) though. The more serious problem is that optional item(nullable nonterminal in kbnf) can leads to exponential time complexity. This means it is only practical where maxItems-minItems is small(which usually is, in the context of optional object properties).

@turboderp
Copy link
Contributor

I see. The main use cases I picture would be for CoT-type sampling where you might want structured generation of "1 to 5 follow-up questions" or some such. I guess quadratic complexity in that case wouldn't be a showstopper, but it would be harder for "up to 100 characters" in the case of a string. In any case, thanks for your work, and I will remain hopeful. :)

@gittb
Copy link

gittb commented Oct 7, 2024

Dan - AnyOf is crucial for tool calling and agent workflows. I noticed you flagged it as 🟠.

Without AnyOf, we lose the ability to use unions of models. This matters for tool calling where the client provides multiple tools, and we need to guide the model in writing the correct data types and argument names for each tool. Without AnyOf, the model can’t write a list of calls where each call is AnyOf a set of models representing those client-provided tools.

Could you tell me a bit about the difficulties that lie ahead with implementing that?

@Dan-wanna-M
Copy link
Owner Author

Dan-wanna-M commented Oct 7, 2024

Could you tell me a bit about the difficulties that lie ahead with implementing that?

@gittb The problem is that AnyOf is more complex than sinple "union of schemas," which is much easier to support. See https://json-schema.org/understanding-json-schema/reference/combining#factoringschemas for an example: it allows factors out of common schema parts which do not interoperate well with context free grammar's model. This would require us to do some "common part" propagation that likely will then mess with duplicate fields&refs&anchors etc.

For simpler cases where we can consider each subschema as a concrete type and anyOf itself as typing.Union, then yes, it is much easier to implement. If you could specify what "simple case" you need(or need the most), then I can definitely accelerate the development of that case.

You may also be interested in FormatterBuilder.choice method which allows you to pass multiple FormatterBuilder.json()'s returning values, which behaves similar to AnyOf in simple cases.

@turboderp
Copy link
Contributor

I think the simple case most people need is just tool selection. I.e. the model wants to generate a tool call, but you don't know which tool to apply the schema for until the model starts emitting JSON. I think typically each tool has a unique "name" field to identify it, but I'm not sure if it's typically the first field or if some LMs might get confused if you force it to be the first field in order to dynamically swap in the corresponding subschema while generating.

But if FormatterBuilder.choice supports schemas as choices, that probably covers it well enough (?). If necessary, this could also be done one level higher in the generation pipeline by having multiple constraints evaluating in parallel and applying the union of their respective logit masks instead of the intersection. Then for every chosen token, disable those constraints for which the token would have been an invalid pick.

@gittb
Copy link

gittb commented Oct 8, 2024

Yup - Spot on. Thank you turbo.

@gittb
Copy link

gittb commented Oct 8, 2024

For clarity - here is an example of what this would look like in pydantic.

from pydantic import BaseModel
from typing import Union, Optional, List

class Tool1ParamModel(BaseModel):
    param1: str
    param2: int

class Tool2ParamModel(BaseModel):
    example_param: bool
    optional_param: Optional[int]

class Tool3ParamModel(BaseModel):
    another_param: str
    faster_schema_gen: int
    please: Optional[str]

class ToolModel1(BaseModel):
    name: str = "tool_1"
    param: Tool1ParamModel

class ToolModel2(BaseModel):
    name: str = "tool_2"
    param: Tool2ParamModel

class ToolModel3(BaseModel):
    name: str = "tool_3"
    param: Tool3ParamModel

class TopModel(BaseModel):
    tool_calls: List[Union[ToolModel1, ToolModel2, ToolModel3]]

@Dan-wanna-M
Copy link
Owner Author

For clarity - here is an example of what this would look like in pydantic.

from pydantic import BaseModel
from typing import Union, Optional, List

class Tool1ParamModel(BaseModel):
    param1: str
    param2: int

class Tool2ParamModel(BaseModel):
    example_param: bool
    optional_param: Optional[int]

class Tool3ParamModel(BaseModel):
    another_param: str
    faster_schema_gen: int
    please: Optional[str]

class ToolModel1(BaseModel):
    name: str = "tool_1"
    param: Tool1ParamModel

class ToolModel2(BaseModel):
    name: str = "tool_2"
    param: Tool2ParamModel

class ToolModel3(BaseModel):
    name: str = "tool_3"
    param: Tool3ParamModel

class TopModel(BaseModel):
    tool_calls: List[Union[ToolModel1, ToolModel2, ToolModel3]]

@gittb I see. Yeah, for that case FormatterBuilder.choice will not be very conveninent to use. Hopefully I can work out the simple case of AnyOf this week.

@Dan-wanna-M
Copy link
Owner Author

@gittb supported in v0.4.6 now.

@gittb
Copy link

gittb commented Oct 16, 2024

I appreciate you Dan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants