Document `u8`s that represent ASCII characters in config fields as characters in the docs #18669

jszwedko · 2023-09-25T19:18:32Z

For example, #18320, introduced a few new fields that take u8s that are supposed to be characters. These end up being documented as unsigned integers like so:

Even though in configuration they would be specified as character literals (i.e. , rather than 44).

The text was updated successfully, but these errors were encountered:

scMarkus · 2023-11-30T14:43:58Z

Is there a good way of representing u8 as characters in the documentation? I can not think of anything else then changing the datatype in the config to strings an converting them to u8 later when creating the actual serializer. In any case i can take over this task.

scMarkus · 2024-04-29T19:21:00Z

@jszwedko any thoughts on how to best solve this? I am by no means an rust or vector expert but I am happy to put in the work given some thought through guidance :)

jszwedko · 2024-04-30T18:49:49Z

I discovered that the delimiter is actually already marked as an "ASCII character" in the code:

vector/lib/codecs/src/encoding/format/csv.rs

Lines 77 to 83 in 056c2df

    
               /// The field delimiter to use when writing CSV. 
        
               #[serde( 
        
                   default = "default_delimiter", 
        
                   with = "vector_core::serde::ascii_char", 
        
                   skip_serializing_if = "vector_core::serde::is_default" 
        
               )] 
        
               pub delimiter: u8,

However this is not correctly translated in the cue definition:

vector/website/cue/reference/components/sinks/base/socket.cue

Lines 153 to 156 in 056c2df

    
           delimiter: { 
        
           	description: "The field delimiter to use when writing CSV." 
        
           	required:    false 
        
           	type: uint: default: 44

It should be using the ascii_char type mentioned here:

vector/website/cue/reference.cue

Line 500 in 056c2df

"ascii_char"?: #TypeAsciiChar & {_args: required: Args.required}

So I think the issue might lie in the translation from the configurable definition to the cue files (and potentially, past that, to the website rendering of the config options). The translation happens in a couple of places:

https://github.com/vectordotdev/vector/tree/master/lib/vector-config-macros is the macro that generates JSON schema (via vector generate-schema)
https://github.com/vectordotdev/vector/blob/master/scripts/generate-component-docs.rb is what takes the JSON schema and writes out the cue files

This is a nuanced issue 😅 If you are interested in digging into it, those would be some places to start though. Let me know if you have additional questions!

scMarkus · 2024-05-04T19:11:33Z

@jszwedko thanks a lot for those hints. what I found out so far (in the end I will ask for your intuition), is the generated json schema file, which is then fed into the ruby script does contain the undesired type already. so the faulty behavior happens before that.

regarding the generated json schema I have not yet identified the code doing the conversion but I found this list of valid types for the json schema (I could imagine based on that assumptions are made about u8 not being a string but a number):

vector/lib/vector-config-common/src/schema/json_schema.rs

Lines 537 to 550 in c2765f4

    
           /// The possible types of values in JSON Schema documents. 
        
           /// 
        
           /// See [JSON Schema 4.2.1. Instance Data Model](https://tools.ietf.org/html/draft-handrews-json-schema-02#section-4.2.1). 
        
           #[derive(Clone, Copy, Debug, Deserialize, Eq, Hash, Ord, PartialEq, PartialOrd, Serialize)] 
        
           #[serde(rename_all = "camelCase")] 
        
           pub enum InstanceType { 
        
               Null, 
        
               Boolean, 
        
               Object, 
        
               Array, 
        
               Number, 
        
               String, 
        
               Integer, 
        
           }

looking through the quite well written documentation I found this peace of insight, which basically hints on the type conversion being just inherently complicated...:

vector/lib/vector-config-macros/src/ast/field.rs

Lines 63 to 95 in c2765f4

    
               /// Type of the field. 
        
               /// 
        
               /// This is the as-defined type, and may not necessarily match the type we use for generating 
        
               /// the schema: see `delegated_ty` for more information. 
        
               pub fn ty(&self) -> &syn::Type { 
        
                   &self.original.ty 
        
               } 
        
               /// Delegated type of the field, if any. 
        
               /// 
        
               /// In some cases, helper types may be used to provide (de)serialization of types that cannot 
        
               /// have `Deserialize`/`Serialize`, such as types in the standard library, or may be used to 
        
               /// provide customized (de)serialization, such (de)serializing to and from a more human-readable 
        
               /// version of a type, like time strings that let you specify `1s` or `1 hour`, and don't just 
        
               /// force you to always specify the total number of seconds and nanoseconds, and so on. 
        
               /// 
        
               /// When these helper types are in use, we need to be able to understand what _they_ look like 
        
               /// when serialized so that our generated schema accurately reflects what we expect to get 
        
               /// during deserialization. Even though we may end up with a `T` in our configuration type, if 
        
               /// we're (de)serializing it like a `U`, then we care about `U` when generating the schema, not `T`. 
        
               /// 
        
               /// We currently scope this type to helper types defined with `serde_with`: the reason is 
        
               /// slightly verbose to explain (see `find_delegated_serde_deser_ty` for the details), but 
        
               /// unless `serde_with` is being used, specifically the `#[serde_as(as = "...")]` helper 
        
               /// attribute, then this will generally return `None`. 
        
               /// 
        
               /// If `#[serde_as(as = "...")]` _is_ being used, then `Some` is returned containing a reference 
        
               /// to the delegated (de)serialization type. Again, see `find_delegated_serde_deser_ty` for more 
        
               /// details about exactly what we look for to figure out if delegation is occurring, and the 
        
               /// caveats around our approach. 
        
               pub fn delegated_ty(&self) -> Option<&syn::Type> { 
        
                   self.attrs.delegated_ty.as_ref() 
        
               }

going one step further finding this hint which explicetly states the #[serde(with = "...")] syntax. maybe line 278 contains some type casting magic? I have to admit not fully understating what is happening there:

vector/lib/vector-config-macros/src/ast/util.rs

Lines 243 to 281 in c2765f4

    
           /// Tries to find a delegated (de)serialization type from attributes. 
        
           /// 
        
           /// In some cases, the `serde_with` crate, more specifically the `serde_as` attribute macro, may be 
        
           /// used to help (de)serialize a field/container with type A via a (de)implementation on type B, in order to 
        
           /// provide more ergonomic (de)serialization of values that can represent type A without needing to 
        
           /// explicitly match type A when (de)serialized. This is similar to `serde`'s existing support for 
        
           /// "remote" types but is taken further with a more generic and extensible approach. 
        
           /// 
        
           /// This, however, presents an issue because while normally we can handle scenarios like 
        
           /// `#[serde(from = "...")]` and its siblings, `serde_as` depends on `#[serde(with = "...")]` and 
        
           /// the fact that it simply constructs a path to the (de)serialize methods, rather than always 
        
           /// needing to explicitly reference a type. This means that we cannot simply grab the value of the 
        
           /// `with` name/value pair blindly, and assume if there's a value that a delegated/remote type is in 
        
           /// play... it could be a module path, too. 
        
           /// 
        
           /// This method looks for two indicators to understand when it should be able to extract the 
        
           /// delegated type: 
        
           /// 
        
           /// - `#[serde(with = "...")]` is present 
        
           /// - `#[serde_as(as = "...")]` is present 
        
           /// 
        
           /// When both of these are true, we can rely on the fact that the value of `with` will be a valid 
        
           /// type path, and usable like a virtual newtype, which is where we use the type specified for 
        
           /// `try_from`/`from`/`into` for the delegated (de)serialization type of a container itself. 
        
           /// 
        
           /// If we find both of those attribute name/value pairs, and the value of `with` can be parsed 
        
           /// successfully as a type path, `Some(...)` is returned, contained the type. Otherwise, `None` is 
        
           /// returned. 
        
           pub fn find_delegated_serde_deser_ty(attributes: &[syn::Attribute]) -> Option<syn::Type> { 
        
               // Make sure `#[serde_as(as = "...")]` is present. 
        
               find_name_value_attribute(attributes, "serde_as", "r#as") 
        
                   // Make sure `#[serde(with = "...")]` is present, and grab its value. 
        
                   .and_then(|_| find_name_value_attribute(attributes, "serde", "with")) 
        
                   // Try and parse the value as a type path. 
        
                   .and_then(|with| match with { 
        
                       Lit::Str(s) => s.parse::<syn::Type>().ok(), 
        
                       _ => None, 
        
                   }) 
        
           }

I would be willing to look into this more deeply but first i like to check back if I am on the right track?

tobz · 2024-05-14T13:26:52Z

@scMarkus 👋🏻 I'm the engineer who originally wrote all of the Configurable macro machinery and Jesse asked me to take a look at this issue.

Overall, while delegated types are sort of related to the problem, there's more of a fundamental underlying issue here: both Configurable and the documentation generation script don't actually handle ascii_char, or have a concept of it, really.

Essentially, there's no type in the standard library that represents an ASCII character. UTF-8 characters? Sure, char. Those may or may not be ASCII, though. We have two approaches we can take: simply annotate all these u8 fields being used as a way to handle an ASCII character, or create a newtype wrapper to enforce it.

The simpler approach is just annotating the u8 fields, because it solves half of the problem which is getting the right type into the documentation generation script. We can do this by adding the following annotation to the field: #[configurable(metadata(docs::type_override = "ascii_char"))].

This adds metadata that tells the documentation generation script to ignore determining the field type itself, and to instead just use the type we've given it. That gets us generating these fields as being ascii_char, to follow the logic in the Cue reference schema.

The second part we have to solve, likely, is actually having the documentation render that default value -- e.g. 44 -- as an ASCII character. My guess is that it may just end up rendering it as an integer, or maybe not at all, even after making the change to override the field type to ascii_char.

That's something you'll likely need to test, but we can also work through that as part of any PR you open.

scMarkus · 2024-05-14T14:53:10Z

Hi @tobz
thanks a lot for your input. you seam to be quite right. I naively added your idea.

    /// The field delimiter to use when writing CSV.
    #[configurable(metadata(docs::type_override = "ascii_char"))]
    #[serde(
        default = "default_delimiter",
        with = "vector_core::serde::ascii_char",
        skip_serializing_if = "vector_core::serde::is_default"
    )]
    pub delimiter: u8,

and looked at the intermediate json output. which changes from

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::human_name": "Delimiter"
              }
            },

to

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::type_override": "ascii_char",
                "docs::human_name": "Delimiter"
              }
            },

which makes the sinks cue file change but not website/cue/reference/remap/functions/parse_csv.cue file

diff --git a/website/cue/reference/components/sinks/base/aws_s3.cue b/website/cue/reference/components/sinks/base/aws_s3.cue
index b8cb9be0d..769c2d4d7 100644
--- a/website/cue/reference/components/sinks/base/aws_s3.cue
+++ b/website/cue/reference/components/sinks/base/aws_s3.cue
@@ -421,7 +421,7 @@ base: components: sinks: aws_s3: configuration: {
                                        delimiter: {
                                                description: "The field delimiter to use when writing CSV."
                                                required:    false
-                                               type: uint: default: 44
+                                               type: ascii_char: {}
                                        }
                                        double_quote: {
                                                description: """

I do agree to opening and working on a PR.

EDIT:
rereading previous comment I noticed there being a hint of maybe not just use ascii chars but support utf8 as well in the documentation. in my opinion utf8 would in general be nice for vector but then probably should be string anyway? in case of the csv parsing the underlying lib asks for a u8 specifically.

tobz · 2024-05-20T14:40:11Z

Ah, yeah. The documentation for remap functions is not generated automatically through all of the Configurable stuff, so it would have to be edited manually no matter what.

scMarkus · 2024-08-05T07:36:41Z

@jszwedko is this still open? I am asking because #20498 was merged.

jszwedko · 2024-08-05T14:09:41Z

@scMarkus thanks for noticing! I'll close this.

jszwedko added the domain: external docs Anything related to Vector's external, public documentation label Sep 25, 2023

scMarkus mentioned this issue Oct 31, 2023

CSV encoder docu shows numbers instead of characters as defaults #18999

Closed

scMarkus mentioned this issue May 15, 2024

fix(codecs) csv documentation defaults as ascii_char #20498

Merged

jszwedko closed this as completed Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document `u8`s that represent ASCII characters in config fields as characters in the docs #18669

Document `u8`s that represent ASCII characters in config fields as characters in the docs #18669

jszwedko commented Sep 25, 2023

scMarkus commented Nov 30, 2023 •

edited

Loading

scMarkus commented Apr 29, 2024

jszwedko commented Apr 30, 2024

scMarkus commented May 4, 2024

tobz commented May 14, 2024

scMarkus commented May 14, 2024 •

edited

Loading

tobz commented May 20, 2024

scMarkus commented Aug 5, 2024

jszwedko commented Aug 5, 2024

Document u8s that represent ASCII characters in config fields as characters in the docs #18669

Document u8s that represent ASCII characters in config fields as characters in the docs #18669

Comments

jszwedko commented Sep 25, 2023

scMarkus commented Nov 30, 2023 • edited Loading

scMarkus commented Apr 29, 2024

jszwedko commented Apr 30, 2024

scMarkus commented May 4, 2024

tobz commented May 14, 2024

scMarkus commented May 14, 2024 • edited Loading

tobz commented May 20, 2024

scMarkus commented Aug 5, 2024

jszwedko commented Aug 5, 2024

Document `u8`s that represent ASCII characters in config fields as characters in the docs #18669

Document `u8`s that represent ASCII characters in config fields as characters in the docs #18669

scMarkus commented Nov 30, 2023 •

edited

Loading

scMarkus commented May 14, 2024 •

edited

Loading