Data transformation before serialization #471

joscha · 2024-09-11T13:12:31Z

I am generating both an Avro schema as well as a client from the same OpenAPI spec.
The spec contains a whole bunch of enum values that are hyphenated, see https://github.com/planet-a-ventures/affinity-node/blob/364a0b2ac2f86d04b128f8ebb0ac8560f36ab8dd/openapi/2024-09-05.json#L2892-L2910

I opened a pull request for the Avro schema generator here OpenAPITools/openapi-generator#19549 which transforms these enum values into valid ones that adhere to the Avro specs.

One thing that I haven't solved ideal yet, however, is how to transform the data contents for each POJO that I want to write to a file via avsc.

I am currently doing something like this:

  const encoder = Avro.createFileEncoder('companies.avro', registry['model.Company']);
  for (const company of companies) {
    if (company.fields) {
      company.fields = company.fields.map((field: Field) => {

        if (field.enrichmentSource) {
          field.enrichmentSource = field.enrichmentSource.replace('-','_') as any;
        }
        if (field.type) {
          field.type = field.type.replace('-','_') as any;
        }
        return field;
      })
    }
    encoder.write(company);
  }
  encoder.end();

which even if I factor out the transform and detect enum fields automatically, is not ideal, especially in a complex schema like the one linked above.

So I was wondering what your thoughts were to introduce an additional optional callback property sanitizeEnumValue = (property: string) => string or similar on the BlockEncoder class, that is called for each serialized enum value and can transform the value that is passed to the encoder, similar to

avsc/types/index.d.ts

Line 88 in 7cb76a6

parseHook: (schema: Schema) => Type

The text was updated successfully, but these errors were encountered:

joscha · 2024-09-12T11:17:25Z

I'd actually like to upgrade this feature suggestion to a generic transformer callback, which receives a value just before it is written and can transform and return it.
Something that resembles this monkey patch but with a proper API:

  const origStringWrite = Avro.types.StringType.prototype._write;
  Avro.types.StringType.prototype._write = function (tap, val) {
    if (typeof val === 'object') {
      if (val.link && val.text) {
        // text value is actually a JSON object representing a link; transform to corresponding markdown
        val = `[${val.text.replaceAll(']','\\]')}](${val.link})`
      }
    }
    origStringWrite(tap, val);
  }

i.e.:

Avro.createFileEncoder('companies.avro', mySchema, {
  valueTransformer: (val, type) => val, // sample identity implementation
});

mtth · 2024-09-13T03:11:17Z

You may be able to do what you want using logical types. For example, here is a basic link implementation:

const linkPattern = /\[([^\]]+)\]\(([^)]+)\)/;

class LinkType extends avro.types.LogicalType {
  _fromValue(val) {
    const match = linkPattern.exec(val);
    return {text: match[1], url: match[2]};
  }
  _toValue(arg) { return `[${arg.text}](${arg.url})`; }
}

You could use it as follows:

const schema = {
  type: 'record',
  name: 'Example',
  fields: [
    {name: 'ref', type: {type: 'string', logicalType: 'link'}},
  ],
}

const type = avro.Type.forSchema(schema, {logicalTypes: {link: LinkType}});
const val = {ref: {text: 'cool text', url: 'http://example.com'}};
const str = type.toString(val);
console.log(str); // {"ref":"[cool text](http://example.com)"}
console.log(type.fromString(str)); // val

Logical types can be applied to any Avro type, allowing you to use custom data types across your entire schema by adding corresponding logicalType annotations.

If you don't have control over the schema to add the annotations, you can also apply them dynamically using a type hook. This would also give you more flexibility, including the ability to decorate unions. See #329 (comment) for a related example.

joscha · 2024-09-13T08:52:37Z

If you don't have control over the schema to add the annotations, you can also apply them dynamically using a type hook.

I have only little control over the schema, I am using https://openapi-generator.tech/docs/generators/avro-schema/ to generate the according Avro schemas. Theoretically I can patch each generated type with a logicalType addition, however that would render the automatic generation somewhat tedious, as I'd need to apply these patches each time I regenerate.

This would also give you more flexibility, including the ability to decorate unions. See #329 (comment) for a related example.

I see. This seems like requires a lot more implementation than a transformer function. I am basically extending the schema just for the sake of then calling the same transformer packaged in a logical type extension.
I can given this a try, however I am wondering: does the logicalType become part of the generated schema when using this in a file encoder? I am intending on feeding the resulting avro file to Snowflake via COPY INTO and wouldn't have any control there over the decoder implementation.

mtth · 2024-09-16T04:18:45Z

I can given this a try, however I am wondering: does the logicalType become part of the generated schema when using this in a file encoder? I am intending on feeding the resulting avro file to Snowflake via COPY INTO and wouldn't have any control there over the decoder implementation.

You can choose whether to include the logical type or not in the encoder's schema. Snowflake may not support them however.

joscha · 2024-09-16T08:51:16Z

You can choose whether to include the logical type or not in the encoder's schema.

How do I do that, please? Sorry if I missed it somewhere, but I can find it neither in the types nor in the logical type docs.

Snowflake may not support them however.

Is there documentation on how this behaves in regards to portability anywhere? I don't believe any logical types are supported by Snowflake, so if I used them to encode but didn't include them in the exported schema, would Snowflake just use the main type to interpret the data?

mtth · 2024-09-18T04:09:41Z

How do I do that, please? Sorry if I missed it somewhere, but I can find it neither in the types nor in the logical type docs.

Via the file encoder's schema argument. However, it doesn't look like it's currently possible to strip the logical types if they are used to encode the value. We could add this if needed (though it probably isn't useful - see just below).

Is there documentation on how this behaves in regards to portability anywhere?

The Avro spec mandates that implementations ignore any unknown logical type (see here). In this case, the underlying type is used.

joscha · 2024-09-18T09:49:41Z

The Avro spec mandates that implementations ignore any unknown logical type (see here). In this case, the underlying type is used.

Okay, this is good, I missed this when reading the logical type spec, was too eager to look at the specific logical types below. I tried your suggestion from above with:

class StringifiedJsonType extends Avro.types.LogicalType {
  _fromValue(val: any) {
    try {
      // try to parse as JSON
      return JSON.parse(val);
    } catch (e) {
      return val;
    }
  }
  _toValue(val: any) {
    return typeof val === "object" && val !== null && !Array.isArray(val)
      ? JSON.stringify(val)
      : val;
  }
  _resolve(type: any) {
    if (
      Avro.Type.isType(type, "string", "logical:string-or-stringified-json")
    ) {
      return this._fromValue;
    }
  }
}

and adding logicalType: "string-or-stringified-json" to my records.

I am not completely convinced this is better than transforming the value on the go, as I know that my target system won't use the logical type anyway, but it does work for the string types targets that are not string.

For the enum case I am still not yet sure, I still get the error that the enum symbol is not valid due to the -'s. Will give it one more debug.

joscha · 2024-09-19T10:00:08Z

For the enum case I am still not yet sure, I still get the error that the enum symbol is not valid due to the -'s. Will give it one more debug.

Okay. Was able to make this work, thank you @mtth.

It's actually pretty good:

class SanitizedEnumLogicalType extends Avro.types.LogicalType {
  static readonly NAME = "sanitized-enum";

  _fromValue(val: any) {
    return val.replace(/_/g, "-");
  }
  _toValue(val: any) {
    return val.replace(/-/g, "_");
  }
  _resolve(type: any) {
    if (
      Avro.Type.isType(
        type,
        "string",
        `logical:${SanitizedEnumLogicalType.NAME}`,
      )
    ) {
      return this._fromValue;
    }
  }
}

function isEnumType(
  schema: Avro.schema.AvroSchema | Avro.Type,
): schema is Avro.schema.EnumType {
  return (
    typeof schema === "object" && "type" in schema && schema.type === "enum"
  );
}

const typeHook: Avro.ForSchemaOptions["typeHook"] = (schema, opts) => {
  if (isEnumType(schema)) {
    (
      schema as Avro.schema.EnumType & Avro.schema.LogicalTypeExtension
    ).logicalType = SanitizedEnumLogicalType.NAME;
  }
};

const opts: Partial<Avro.ForSchemaOptions> = {
  typeHook,
  logicalTypes: {
    [SanitizedEnumLogicalType.NAME]: SanitizedEnumLogicalType,
  },
};

joscha changed the title ~~Enum transformation~~ Data transformation before serialization Sep 12, 2024

joscha closed this as completed Sep 19, 2024

joscha mentioned this issue Oct 21, 2024

BlobDecoder does not seem to respect logical types #489

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data transformation before serialization #471

Data transformation before serialization #471

joscha commented Sep 11, 2024 •

edited

Loading

joscha commented Sep 12, 2024 •

edited

Loading

mtth commented Sep 13, 2024

joscha commented Sep 13, 2024 •

edited

Loading

mtth commented Sep 16, 2024

joscha commented Sep 16, 2024

mtth commented Sep 18, 2024

joscha commented Sep 18, 2024

joscha commented Sep 19, 2024 •

edited

Loading

Data transformation before serialization #471

Data transformation before serialization #471

Comments

joscha commented Sep 11, 2024 • edited Loading

joscha commented Sep 12, 2024 • edited Loading

mtth commented Sep 13, 2024

joscha commented Sep 13, 2024 • edited Loading

mtth commented Sep 16, 2024

joscha commented Sep 16, 2024

mtth commented Sep 18, 2024

joscha commented Sep 18, 2024

joscha commented Sep 19, 2024 • edited Loading

joscha commented Sep 11, 2024 •

edited

Loading

joscha commented Sep 12, 2024 •

edited

Loading

joscha commented Sep 13, 2024 •

edited

Loading

joscha commented Sep 19, 2024 •

edited

Loading