Support schemas that include non type-level constraint #12

Dan-wanna-M · 2024-08-28T01:22:12Z

In Json Schema and Pydantic Fields a lot of non type-level constraint are provided. Not all of them can be implemented only with CFG, but we should examine all of them and try to implement as much as possible.

✅- has been supported
🟢- can be supported without incurring any potential problems
🟡- can be supported but can lead to serious problems if misused
🟠- can be supported but require significant efforts and functionalities may be limited
❌- impossible to support within the expressiveness of a context free grammar

Dan-wanna-M · 2024-09-15T00:48:26Z

Json schema

String
- ✅🟡minLength/MaxLength (v0.4.6)
- ✅🟡pattern(v0.4.6)
- 🟠format
Numeric types
- ❌multipleOf
- 🟠minimum, exclusiveMinimum, maximum, exclusiveMaximum
  - ✅positive, negative, nonnegative,nonpositive(v0.4.6)
  - ❌others
Object
- ✅properties
- 🟠patternProperties
- ❌additionalProperties(we always assume additionalProperties is false due to implementation limitations)
- ❌unevaluatedProperties(we always assume unevaluatedProperties is false due to implementation limitations)
- ✅🟡required
- 🟠propertyNames
- 🟠minProperties/maxProperties
Array
- ✅items
- ✅🟡prefixItems(v0.4.6)
- ✅additionalItems(v0.4.6)
- ❌unevaluatedItems
- ❌contains/minContains/maxContains
- ✅🟡minItems/maxItems(v0.4.6)
- ❌uniqueItems
- 🟠contentMediaType/contentEncoding
Schema composition
- ❌allOf/oneOf/not
- 🟠anyOf
  - ✅no common parts of subschemas factored out(v0.4.6)
  - 🟠with common parts of subschemas factored out
- ❌dependentRequired/dependentSchemas/if/then/else
Misc
- ✅Top-level array(v0.4.3)

Pydantic

str
- ✅min_length/max_length(v0.4.6)
- ✅pattern(v0.4.6)
int/float
- ❌multiple_of
- 🟢allow_inf_nan
- 🟠gt, ge, lt, le
  - ✅positive, negative, nonnegative,nonpositive(v0.4.6)
  - ❌others
🟢decimal
- 🟢max_digits
- 🟠decimal_places
sequence-like types
- ✅🟡min_length/max_length(v0.4.6)
Misc
- ❌lax mode format enforcement
  - I can't see why it is useful for a llm.

turboderp · 2024-10-01T12:49:25Z

Can you elaborate on "serious problems if misused"? minItems/maxItems and minLength/MaxLength especially would be very nice to have.

Also, contains/minContains sound like they should be sort of doable if you enforce a subschema for the first or the first n elements in an array.

Dan-wanna-M · 2024-10-01T21:50:21Z

Can you elaborate on "serious problems if misused"? minItems/maxItems and minLength/MaxLength especially would be very nice to have.

@turboderp The main problem is that their space complexity is quadratic. Due to the way regular grammar/context free grammar works, to implement counted repetition(aka maxLength/MaxItems) means to enumerate over all the possible lengths of the item from minItems to maxItems. This could lead to unexpected significant slowdown and/or enormous CPU memory usage. In the context of TabbyAPI(or in general for individual users) however, it probably does not matter that much.

The same is true for other features labelled with 🟡. They are useful but if misused can lead to quadratic/exponential/factorial time/space complexity. I definitely will support them one day, but I would also like to remind myself to add a safety section for them somewhere in the docs. Then, hopefully someone will not be left confused when they found their program used 100GB system RAM and got killed by OS.

For the case of minItems/maxItems and minLength/MaxLength specifically though, I think it is possible to "hack" the kbnf engine to handle the counted repetition efficiently by embedding explicit integer counters into Earley item state. Its effect on the codebase complexity and/or efficiency is unknown though. I probably will stick to standard context free grammar/regex when implementing this feature in Formatron first and see if there are many users who do need very large counted repetition.

Also, contains/minContains sound like they should be sort of doable if you enforce a subschema for the first or the first n elements in an array.

Yeah, this does mean we need to modify its semantics(the array must contain at least one in the first n elements) though. The more serious problem is that optional item(nullable nonterminal in kbnf) can leads to exponential time complexity. This means it is only practical where maxItems-minItems is small(which usually is, in the context of optional object properties).

turboderp · 2024-10-01T22:53:52Z

I see. The main use cases I picture would be for CoT-type sampling where you might want structured generation of "1 to 5 follow-up questions" or some such. I guess quadratic complexity in that case wouldn't be a showstopper, but it would be harder for "up to 100 characters" in the case of a string. In any case, thanks for your work, and I will remain hopeful. :)

gittb · 2024-10-07T19:21:19Z

Dan - AnyOf is crucial for tool calling and agent workflows. I noticed you flagged it as 🟠.

Without AnyOf, we lose the ability to use unions of models. This matters for tool calling where the client provides multiple tools, and we need to guide the model in writing the correct data types and argument names for each tool. Without AnyOf, the model can’t write a list of calls where each call is AnyOf a set of models representing those client-provided tools.

Could you tell me a bit about the difficulties that lie ahead with implementing that?

Dan-wanna-M · 2024-10-07T20:01:38Z

Could you tell me a bit about the difficulties that lie ahead with implementing that?

@gittb The problem is that AnyOf is more complex than sinple "union of schemas," which is much easier to support. See https://json-schema.org/understanding-json-schema/reference/combining#factoringschemas for an example: it allows factors out of common schema parts which do not interoperate well with context free grammar's model. This would require us to do some "common part" propagation that likely will then mess with duplicate fields&refs&anchors etc.

For simpler cases where we can consider each subschema as a concrete type and anyOf itself as typing.Union, then yes, it is much easier to implement. If you could specify what "simple case" you need(or need the most), then I can definitely accelerate the development of that case.

You may also be interested in FormatterBuilder.choice method which allows you to pass multiple FormatterBuilder.json()'s returning values, which behaves similar to AnyOf in simple cases.

turboderp · 2024-10-08T12:51:06Z

I think the simple case most people need is just tool selection. I.e. the model wants to generate a tool call, but you don't know which tool to apply the schema for until the model starts emitting JSON. I think typically each tool has a unique "name" field to identify it, but I'm not sure if it's typically the first field or if some LMs might get confused if you force it to be the first field in order to dynamically swap in the corresponding subschema while generating.

But if FormatterBuilder.choice supports schemas as choices, that probably covers it well enough (?). If necessary, this could also be done one level higher in the generation pipeline by having multiple constraints evaluating in parallel and applying the union of their respective logit masks instead of the intersection. Then for every chosen token, disable those constraints for which the token would have been an invalid pick.

gittb · 2024-10-08T13:24:25Z

Yup - Spot on. Thank you turbo.

gittb · 2024-10-08T13:32:04Z

For clarity - here is an example of what this would look like in pydantic.

from pydantic import BaseModel
from typing import Union, Optional, List

class Tool1ParamModel(BaseModel):
    param1: str
    param2: int

class Tool2ParamModel(BaseModel):
    example_param: bool
    optional_param: Optional[int]

class Tool3ParamModel(BaseModel):
    another_param: str
    faster_schema_gen: int
    please: Optional[str]

class ToolModel1(BaseModel):
    name: str = "tool_1"
    param: Tool1ParamModel

class ToolModel2(BaseModel):
    name: str = "tool_2"
    param: Tool2ParamModel

class ToolModel3(BaseModel):
    name: str = "tool_3"
    param: Tool3ParamModel

class TopModel(BaseModel):
    tool_calls: List[Union[ToolModel1, ToolModel2, ToolModel3]]

Dan-wanna-M · 2024-10-08T18:19:12Z

For clarity - here is an example of what this would look like in pydantic.

from pydantic import BaseModel
from typing import Union, Optional, List

class Tool1ParamModel(BaseModel):
    param1: str
    param2: int

class Tool2ParamModel(BaseModel):
    example_param: bool
    optional_param: Optional[int]

class Tool3ParamModel(BaseModel):
    another_param: str
    faster_schema_gen: int
    please: Optional[str]

class ToolModel1(BaseModel):
    name: str = "tool_1"
    param: Tool1ParamModel

class ToolModel2(BaseModel):
    name: str = "tool_2"
    param: Tool2ParamModel

class ToolModel3(BaseModel):
    name: str = "tool_3"
    param: Tool3ParamModel

class TopModel(BaseModel):
    tool_calls: List[Union[ToolModel1, ToolModel2, ToolModel3]]

@gittb I see. Yeah, for that case FormatterBuilder.choice will not be very conveninent to use. Hopefully I can work out the simple case of AnyOf this week.

Dan-wanna-M · 2024-10-12T23:10:53Z

@gittb supported in v0.4.6 now.

gittb · 2024-10-16T11:36:33Z

I appreciate you Dan.

Dan-wanna-M self-assigned this Aug 28, 2024

Dan-wanna-M added the enhancement New feature or request label Aug 28, 2024

Dan-wanna-M mentioned this issue Sep 10, 2024

Native support for Json Schema #11

Closed

bdashore3 mentioned this issue Sep 27, 2024

[REQUEST] Add support for new output formatter Formatron theroyallab/tabbyAPI#212

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support schemas that include non type-level constraint #12

Support schemas that include non type-level constraint #12

Dan-wanna-M commented Aug 28, 2024 •

edited

Loading

Dan-wanna-M commented Sep 15, 2024 •

edited

Loading

turboderp commented Oct 1, 2024

Dan-wanna-M commented Oct 1, 2024

turboderp commented Oct 1, 2024

gittb commented Oct 7, 2024 •

edited

Loading

Dan-wanna-M commented Oct 7, 2024 •

edited

Loading

turboderp commented Oct 8, 2024

gittb commented Oct 8, 2024

gittb commented Oct 8, 2024 •

edited

Loading

Dan-wanna-M commented Oct 8, 2024

Dan-wanna-M commented Oct 12, 2024

gittb commented Oct 16, 2024

Support schemas that include non type-level constraint #12

Support schemas that include non type-level constraint #12

Comments

Dan-wanna-M commented Aug 28, 2024 • edited Loading

Dan-wanna-M commented Sep 15, 2024 • edited Loading

Json schema

Pydantic

turboderp commented Oct 1, 2024

Dan-wanna-M commented Oct 1, 2024

turboderp commented Oct 1, 2024

gittb commented Oct 7, 2024 • edited Loading

Dan-wanna-M commented Oct 7, 2024 • edited Loading

turboderp commented Oct 8, 2024

gittb commented Oct 8, 2024

gittb commented Oct 8, 2024 • edited Loading

Dan-wanna-M commented Oct 8, 2024

Dan-wanna-M commented Oct 12, 2024

gittb commented Oct 16, 2024

Dan-wanna-M commented Aug 28, 2024 •

edited

Loading

Dan-wanna-M commented Sep 15, 2024 •

edited

Loading

gittb commented Oct 7, 2024 •

edited

Loading

Dan-wanna-M commented Oct 7, 2024 •

edited

Loading

gittb commented Oct 8, 2024 •

edited

Loading