Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data transformation before serialization #471

Closed
joscha opened this issue Sep 11, 2024 · 8 comments
Closed

Data transformation before serialization #471

joscha opened this issue Sep 11, 2024 · 8 comments

Comments

@joscha
Copy link
Contributor

joscha commented Sep 11, 2024

I am generating both an Avro schema as well as a client from the same OpenAPI spec.
The spec contains a whole bunch of enum values that are hyphenated, see https://github.com/planet-a-ventures/affinity-node/blob/364a0b2ac2f86d04b128f8ebb0ac8560f36ab8dd/openapi/2024-09-05.json#L2892-L2910

I opened a pull request for the Avro schema generator here OpenAPITools/openapi-generator#19549 which transforms these enum values into valid ones that adhere to the Avro specs.

One thing that I haven't solved ideal yet, however, is how to transform the data contents for each POJO that I want to write to a file via avsc.

I am currently doing something like this:

  const encoder = Avro.createFileEncoder('companies.avro', registry['model.Company']);
  for (const company of companies) {
    if (company.fields) {
      company.fields = company.fields.map((field: Field) => {

        if (field.enrichmentSource) {
          field.enrichmentSource = field.enrichmentSource.replace('-','_') as any;
        }
        if (field.type) {
          field.type = field.type.replace('-','_') as any;
        }
        return field;
      })
    }
    encoder.write(company);
  }
  encoder.end();

which even if I factor out the transform and detect enum fields automatically, is not ideal, especially in a complex schema like the one linked above.

So I was wondering what your thoughts were to introduce an additional optional callback property sanitizeEnumValue = (property: string) => string or similar on the BlockEncoder class, that is called for each serialized enum value and can transform the value that is passed to the encoder, similar to

parseHook: (schema: Schema) => Type

@joscha
Copy link
Contributor Author

joscha commented Sep 12, 2024

I'd actually like to upgrade this feature suggestion to a generic transformer callback, which receives a value just before it is written and can transform and return it.
Something that resembles this monkey patch but with a proper API:

  const origStringWrite = Avro.types.StringType.prototype._write;
  Avro.types.StringType.prototype._write = function (tap, val) {
    if (typeof val === 'object') {
      if (val.link && val.text) {
        // text value is actually a JSON object representing a link; transform to corresponding markdown
        val = `[${val.text.replaceAll(']','\\]')}](${val.link})`
      }
    }
    origStringWrite(tap, val);
  }

i.e.:

Avro.createFileEncoder('companies.avro', mySchema, {
  valueTransformer: (val, type) => val, // sample identity implementation
});

@joscha joscha changed the title Enum transformation Data transformation before serialization Sep 12, 2024
@mtth
Copy link
Owner

mtth commented Sep 13, 2024

You may be able to do what you want using logical types. For example, here is a basic link implementation:

const linkPattern = /\[([^\]]+)\]\(([^)]+)\)/;

class LinkType extends avro.types.LogicalType {
  _fromValue(val) {
    const match = linkPattern.exec(val);
    return {text: match[1], url: match[2]};
  }
  _toValue(arg) { return `[${arg.text}](${arg.url})`; }
}

You could use it as follows:

const schema = {
  type: 'record',
  name: 'Example',
  fields: [
    {name: 'ref', type: {type: 'string', logicalType: 'link'}},
  ],
}

const type = avro.Type.forSchema(schema, {logicalTypes: {link: LinkType}});
const val = {ref: {text: 'cool text', url: 'http://example.com'}};
const str = type.toString(val);
console.log(str); // {"ref":"[cool text](http://example.com)"}
console.log(type.fromString(str)); // val

Logical types can be applied to any Avro type, allowing you to use custom data types across your entire schema by adding corresponding logicalType annotations.

If you don't have control over the schema to add the annotations, you can also apply them dynamically using a type hook. This would also give you more flexibility, including the ability to decorate unions. See #329 (comment) for a related example.

@joscha
Copy link
Contributor Author

joscha commented Sep 13, 2024

If you don't have control over the schema to add the annotations, you can also apply them dynamically using a type hook.

I have only little control over the schema, I am using https://openapi-generator.tech/docs/generators/avro-schema/ to generate the according Avro schemas. Theoretically I can patch each generated type with a logicalType addition, however that would render the automatic generation somewhat tedious, as I'd need to apply these patches each time I regenerate.

This would also give you more flexibility, including the ability to decorate unions. See #329 (comment) for a related example.

I see. This seems like requires a lot more implementation than a transformer function. I am basically extending the schema just for the sake of then calling the same transformer packaged in a logical type extension.
I can given this a try, however I am wondering: does the logicalType become part of the generated schema when using this in a file encoder? I am intending on feeding the resulting avro file to Snowflake via COPY INTO and wouldn't have any control there over the decoder implementation.

@mtth
Copy link
Owner

mtth commented Sep 16, 2024

I can given this a try, however I am wondering: does the logicalType become part of the generated schema when using this in a file encoder? I am intending on feeding the resulting avro file to Snowflake via COPY INTO and wouldn't have any control there over the decoder implementation.

You can choose whether to include the logical type or not in the encoder's schema. Snowflake may not support them however.

@joscha
Copy link
Contributor Author

joscha commented Sep 16, 2024

You can choose whether to include the logical type or not in the encoder's schema.

How do I do that, please? Sorry if I missed it somewhere, but I can find it neither in the types nor in the logical type docs.

Snowflake may not support them however.

Is there documentation on how this behaves in regards to portability anywhere? I don't believe any logical types are supported by Snowflake, so if I used them to encode but didn't include them in the exported schema, would Snowflake just use the main type to interpret the data?

@mtth
Copy link
Owner

mtth commented Sep 18, 2024

How do I do that, please? Sorry if I missed it somewhere, but I can find it neither in the types nor in the logical type docs.

Via the file encoder's schema argument. However, it doesn't look like it's currently possible to strip the logical types if they are used to encode the value. We could add this if needed (though it probably isn't useful - see just below).

Is there documentation on how this behaves in regards to portability anywhere?

The Avro spec mandates that implementations ignore any unknown logical type (see here). In this case, the underlying type is used.

@joscha
Copy link
Contributor Author

joscha commented Sep 18, 2024

The Avro spec mandates that implementations ignore any unknown logical type (see here). In this case, the underlying type is used.

Okay, this is good, I missed this when reading the logical type spec, was too eager to look at the specific logical types below. I tried your suggestion from above with:

class StringifiedJsonType extends Avro.types.LogicalType {
  _fromValue(val: any) {
    try {
      // try to parse as JSON
      return JSON.parse(val);
    } catch (e) {
      return val;
    }
  }
  _toValue(val: any) {
    return typeof val === "object" && val !== null && !Array.isArray(val)
      ? JSON.stringify(val)
      : val;
  }
  _resolve(type: any) {
    if (
      Avro.Type.isType(type, "string", "logical:string-or-stringified-json")
    ) {
      return this._fromValue;
    }
  }
}

and adding logicalType: "string-or-stringified-json" to my records.

I am not completely convinced this is better than transforming the value on the go, as I know that my target system won't use the logical type anyway, but it does work for the string types targets that are not string.

For the enum case I am still not yet sure, I still get the error that the enum symbol is not valid due to the -'s. Will give it one more debug.

@joscha
Copy link
Contributor Author

joscha commented Sep 19, 2024

For the enum case I am still not yet sure, I still get the error that the enum symbol is not valid due to the -'s. Will give it one more debug.

Okay. Was able to make this work, thank you @mtth.

It's actually pretty good:

class SanitizedEnumLogicalType extends Avro.types.LogicalType {
  static readonly NAME = "sanitized-enum";

  _fromValue(val: any) {
    return val.replace(/_/g, "-");
  }
  _toValue(val: any) {
    return val.replace(/-/g, "_");
  }
  _resolve(type: any) {
    if (
      Avro.Type.isType(
        type,
        "string",
        `logical:${SanitizedEnumLogicalType.NAME}`,
      )
    ) {
      return this._fromValue;
    }
  }
}

function isEnumType(
  schema: Avro.schema.AvroSchema | Avro.Type,
): schema is Avro.schema.EnumType {
  return (
    typeof schema === "object" && "type" in schema && schema.type === "enum"
  );
}

const typeHook: Avro.ForSchemaOptions["typeHook"] = (schema, opts) => {
  if (isEnumType(schema)) {
    (
      schema as Avro.schema.EnumType & Avro.schema.LogicalTypeExtension
    ).logicalType = SanitizedEnumLogicalType.NAME;
  }
};

const opts: Partial<Avro.ForSchemaOptions> = {
  typeHook,
  logicalTypes: {
    [SanitizedEnumLogicalType.NAME]: SanitizedEnumLogicalType,
  },
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants