Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document u8s that represent ASCII characters in config fields as characters in the docs #18669

Closed
jszwedko opened this issue Sep 25, 2023 · 9 comments
Labels
domain: external docs Anything related to Vector's external, public documentation

Comments

@jszwedko
Copy link
Member

For example, #18320, introduced a few new fields that take u8s that are supposed to be characters. These end up being documented as unsigned integers like so:

Screenshot 2023-09-25 at 12 17 52 PM

Even though in configuration they would be specified as character literals (i.e. , rather than 44).

@jszwedko jszwedko added the domain: external docs Anything related to Vector's external, public documentation label Sep 25, 2023
@scMarkus
Copy link
Contributor

scMarkus commented Nov 30, 2023

Is there a good way of representing u8 as characters in the documentation? I can not think of anything else then changing the datatype in the config to strings an converting them to u8 later when creating the actual serializer. In any case i can take over this task.

@scMarkus
Copy link
Contributor

@jszwedko any thoughts on how to best solve this? I am by no means an rust or vector expert but I am happy to put in the work given some thought through guidance :)

@jszwedko
Copy link
Member Author

I discovered that the delimiter is actually already marked as an "ASCII character" in the code:

/// The field delimiter to use when writing CSV.
#[serde(
default = "default_delimiter",
with = "vector_core::serde::ascii_char",
skip_serializing_if = "vector_core::serde::is_default"
)]
pub delimiter: u8,

However this is not correctly translated in the cue definition:

delimiter: {
description: "The field delimiter to use when writing CSV."
required: false
type: uint: default: 44

It should be using the ascii_char type mentioned here:

"ascii_char"?: #TypeAsciiChar & {_args: required: Args.required}

So I think the issue might lie in the translation from the configurable definition to the cue files (and potentially, past that, to the website rendering of the config options). The translation happens in a couple of places:

This is a nuanced issue 😅 If you are interested in digging into it, those would be some places to start though. Let me know if you have additional questions!

@scMarkus
Copy link
Contributor

scMarkus commented May 4, 2024

@jszwedko thanks a lot for those hints. what I found out so far (in the end I will ask for your intuition), is the generated json schema file, which is then fed into the ruby script does contain the undesired type already. so the faulty behavior happens before that.

regarding the generated json schema I have not yet identified the code doing the conversion but I found this list of valid types for the json schema (I could imagine based on that assumptions are made about u8 not being a string but a number):

/// The possible types of values in JSON Schema documents.
///
/// See [JSON Schema 4.2.1. Instance Data Model](https://tools.ietf.org/html/draft-handrews-json-schema-02#section-4.2.1).
#[derive(Clone, Copy, Debug, Deserialize, Eq, Hash, Ord, PartialEq, PartialOrd, Serialize)]
#[serde(rename_all = "camelCase")]
pub enum InstanceType {
Null,
Boolean,
Object,
Array,
Number,
String,
Integer,
}

looking through the quite well written documentation I found this peace of insight, which basically hints on the type conversion being just inherently complicated...:

/// Type of the field.
///
/// This is the as-defined type, and may not necessarily match the type we use for generating
/// the schema: see `delegated_ty` for more information.
pub fn ty(&self) -> &syn::Type {
&self.original.ty
}
/// Delegated type of the field, if any.
///
/// In some cases, helper types may be used to provide (de)serialization of types that cannot
/// have `Deserialize`/`Serialize`, such as types in the standard library, or may be used to
/// provide customized (de)serialization, such (de)serializing to and from a more human-readable
/// version of a type, like time strings that let you specify `1s` or `1 hour`, and don't just
/// force you to always specify the total number of seconds and nanoseconds, and so on.
///
/// When these helper types are in use, we need to be able to understand what _they_ look like
/// when serialized so that our generated schema accurately reflects what we expect to get
/// during deserialization. Even though we may end up with a `T` in our configuration type, if
/// we're (de)serializing it like a `U`, then we care about `U` when generating the schema, not `T`.
///
/// We currently scope this type to helper types defined with `serde_with`: the reason is
/// slightly verbose to explain (see `find_delegated_serde_deser_ty` for the details), but
/// unless `serde_with` is being used, specifically the `#[serde_as(as = "...")]` helper
/// attribute, then this will generally return `None`.
///
/// If `#[serde_as(as = "...")]` _is_ being used, then `Some` is returned containing a reference
/// to the delegated (de)serialization type. Again, see `find_delegated_serde_deser_ty` for more
/// details about exactly what we look for to figure out if delegation is occurring, and the
/// caveats around our approach.
pub fn delegated_ty(&self) -> Option<&syn::Type> {
self.attrs.delegated_ty.as_ref()
}

going one step further finding this hint which explicetly states the #[serde(with = "...")] syntax. maybe line 278 contains some type casting magic? I have to admit not fully understating what is happening there:

/// Tries to find a delegated (de)serialization type from attributes.
///
/// In some cases, the `serde_with` crate, more specifically the `serde_as` attribute macro, may be
/// used to help (de)serialize a field/container with type A via a (de)implementation on type B, in order to
/// provide more ergonomic (de)serialization of values that can represent type A without needing to
/// explicitly match type A when (de)serialized. This is similar to `serde`'s existing support for
/// "remote" types but is taken further with a more generic and extensible approach.
///
/// This, however, presents an issue because while normally we can handle scenarios like
/// `#[serde(from = "...")]` and its siblings, `serde_as` depends on `#[serde(with = "...")]` and
/// the fact that it simply constructs a path to the (de)serialize methods, rather than always
/// needing to explicitly reference a type. This means that we cannot simply grab the value of the
/// `with` name/value pair blindly, and assume if there's a value that a delegated/remote type is in
/// play... it could be a module path, too.
///
/// This method looks for two indicators to understand when it should be able to extract the
/// delegated type:
///
/// - `#[serde(with = "...")]` is present
/// - `#[serde_as(as = "...")]` is present
///
/// When both of these are true, we can rely on the fact that the value of `with` will be a valid
/// type path, and usable like a virtual newtype, which is where we use the type specified for
/// `try_from`/`from`/`into` for the delegated (de)serialization type of a container itself.
///
/// If we find both of those attribute name/value pairs, and the value of `with` can be parsed
/// successfully as a type path, `Some(...)` is returned, contained the type. Otherwise, `None` is
/// returned.
pub fn find_delegated_serde_deser_ty(attributes: &[syn::Attribute]) -> Option<syn::Type> {
// Make sure `#[serde_as(as = "...")]` is present.
find_name_value_attribute(attributes, "serde_as", "r#as")
// Make sure `#[serde(with = "...")]` is present, and grab its value.
.and_then(|_| find_name_value_attribute(attributes, "serde", "with"))
// Try and parse the value as a type path.
.and_then(|with| match with {
Lit::Str(s) => s.parse::<syn::Type>().ok(),
_ => None,
})
}

I would be willing to look into this more deeply but first i like to check back if I am on the right track?

@tobz
Copy link
Contributor

tobz commented May 14, 2024

@scMarkus 👋🏻 I'm the engineer who originally wrote all of the Configurable macro machinery and Jesse asked me to take a look at this issue.

Overall, while delegated types are sort of related to the problem, there's more of a fundamental underlying issue here: both Configurable and the documentation generation script don't actually handle ascii_char, or have a concept of it, really.

Essentially, there's no type in the standard library that represents an ASCII character. UTF-8 characters? Sure, char. Those may or may not be ASCII, though. We have two approaches we can take: simply annotate all these u8 fields being used as a way to handle an ASCII character, or create a newtype wrapper to enforce it.

The simpler approach is just annotating the u8 fields, because it solves half of the problem which is getting the right type into the documentation generation script. We can do this by adding the following annotation to the field: #[configurable(metadata(docs::type_override = "ascii_char"))].

This adds metadata that tells the documentation generation script to ignore determining the field type itself, and to instead just use the type we've given it. That gets us generating these fields as being ascii_char, to follow the logic in the Cue reference schema.

The second part we have to solve, likely, is actually having the documentation render that default value -- e.g. 44 -- as an ASCII character. My guess is that it may just end up rendering it as an integer, or maybe not at all, even after making the change to override the field type to ascii_char.

That's something you'll likely need to test, but we can also work through that as part of any PR you open.

@scMarkus
Copy link
Contributor

scMarkus commented May 14, 2024

Hi @tobz
thanks a lot for your input. you seam to be quite right. I naively added your idea.

    /// The field delimiter to use when writing CSV.
    #[configurable(metadata(docs::type_override = "ascii_char"))]
    #[serde(
        default = "default_delimiter",
        with = "vector_core::serde::ascii_char",
        skip_serializing_if = "vector_core::serde::is_default"
    )]
    pub delimiter: u8,

and looked at the intermediate json output. which changes from

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::human_name": "Delimiter"
              }
            },

to

            "delimiter": {
              "description": "The field delimiter to use when writing CSV.",
              "default": 44,
              "type": "integer",
              "maximum": 255.0,
              "minimum": 0.0,
              "_metadata": {
                "docs::numeric_type": "uint",
                "docs::type_override": "ascii_char",
                "docs::human_name": "Delimiter"
              }
            },

which makes the sinks cue file change but not website/cue/reference/remap/functions/parse_csv.cue file

diff --git a/website/cue/reference/components/sinks/base/aws_s3.cue b/website/cue/reference/components/sinks/base/aws_s3.cue
index b8cb9be0d..769c2d4d7 100644
--- a/website/cue/reference/components/sinks/base/aws_s3.cue
+++ b/website/cue/reference/components/sinks/base/aws_s3.cue
@@ -421,7 +421,7 @@ base: components: sinks: aws_s3: configuration: {
                                        delimiter: {
                                                description: "The field delimiter to use when writing CSV."
                                                required:    false
-                                               type: uint: default: 44
+                                               type: ascii_char: {}
                                        }
                                        double_quote: {
                                                description: """

I do agree to opening and working on a PR.

EDIT:
rereading previous comment I noticed there being a hint of maybe not just use ascii chars but support utf8 as well in the documentation. in my opinion utf8 would in general be nice for vector but then probably should be string anyway? in case of the csv parsing the underlying lib asks for a u8 specifically.

@tobz
Copy link
Contributor

tobz commented May 20, 2024

Ah, yeah. The documentation for remap functions is not generated automatically through all of the Configurable stuff, so it would have to be edited manually no matter what.

@scMarkus
Copy link
Contributor

scMarkus commented Aug 5, 2024

@jszwedko is this still open? I am asking because #20498 was merged.

@jszwedko
Copy link
Member Author

jszwedko commented Aug 5, 2024

@scMarkus thanks for noticing! I'll close this.

@jszwedko jszwedko closed this as completed Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation
Projects
None yet
Development

No branches or pull requests

3 participants