-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data transformation before serialization #471
Comments
I'd actually like to upgrade this feature suggestion to a generic transformer callback, which receives a value just before it is written and can transform and return it. const origStringWrite = Avro.types.StringType.prototype._write;
Avro.types.StringType.prototype._write = function (tap, val) {
if (typeof val === 'object') {
if (val.link && val.text) {
// text value is actually a JSON object representing a link; transform to corresponding markdown
val = `[${val.text.replaceAll(']','\\]')}](${val.link})`
}
}
origStringWrite(tap, val);
} i.e.: Avro.createFileEncoder('companies.avro', mySchema, {
valueTransformer: (val, type) => val, // sample identity implementation
}); |
You may be able to do what you want using logical types. For example, here is a basic link implementation: const linkPattern = /\[([^\]]+)\]\(([^)]+)\)/;
class LinkType extends avro.types.LogicalType {
_fromValue(val) {
const match = linkPattern.exec(val);
return {text: match[1], url: match[2]};
}
_toValue(arg) { return `[${arg.text}](${arg.url})`; }
} You could use it as follows: const schema = {
type: 'record',
name: 'Example',
fields: [
{name: 'ref', type: {type: 'string', logicalType: 'link'}},
],
}
const type = avro.Type.forSchema(schema, {logicalTypes: {link: LinkType}});
const val = {ref: {text: 'cool text', url: 'http://example.com'}};
const str = type.toString(val);
console.log(str); // {"ref":"[cool text](http://example.com)"}
console.log(type.fromString(str)); // val Logical types can be applied to any Avro type, allowing you to use custom data types across your entire schema by adding corresponding If you don't have control over the schema to add the annotations, you can also apply them dynamically using a type hook. This would also give you more flexibility, including the ability to decorate unions. See #329 (comment) for a related example. |
I have only little control over the schema, I am using https://openapi-generator.tech/docs/generators/avro-schema/ to generate the according Avro schemas. Theoretically I can patch each generated type with a
I see. This seems like requires a lot more implementation than a transformer function. I am basically extending the schema just for the sake of then calling the same transformer packaged in a logical type extension. |
You can choose whether to include the logical type or not in the encoder's schema. Snowflake may not support them however. |
How do I do that, please? Sorry if I missed it somewhere, but I can find it neither in the types nor in the logical type docs.
Is there documentation on how this behaves in regards to portability anywhere? I don't believe any logical types are supported by Snowflake, so if I used them to encode but didn't include them in the exported schema, would Snowflake just use the main type to interpret the data? |
Via the file encoder's
The Avro spec mandates that implementations ignore any unknown logical type (see here). In this case, the underlying type is used. |
Okay, this is good, I missed this when reading the logical type spec, was too eager to look at the specific logical types below. I tried your suggestion from above with: class StringifiedJsonType extends Avro.types.LogicalType {
_fromValue(val: any) {
try {
// try to parse as JSON
return JSON.parse(val);
} catch (e) {
return val;
}
}
_toValue(val: any) {
return typeof val === "object" && val !== null && !Array.isArray(val)
? JSON.stringify(val)
: val;
}
_resolve(type: any) {
if (
Avro.Type.isType(type, "string", "logical:string-or-stringified-json")
) {
return this._fromValue;
}
}
} and adding I am not completely convinced this is better than transforming the value on the go, as I know that my target system won't use the logical type anyway, but it does work for the string types targets that are not string. For the enum case I am still not yet sure, I still get the error that the enum symbol is not valid due to the |
Okay. Was able to make this work, thank you @mtth. It's actually pretty good: class SanitizedEnumLogicalType extends Avro.types.LogicalType {
static readonly NAME = "sanitized-enum";
_fromValue(val: any) {
return val.replace(/_/g, "-");
}
_toValue(val: any) {
return val.replace(/-/g, "_");
}
_resolve(type: any) {
if (
Avro.Type.isType(
type,
"string",
`logical:${SanitizedEnumLogicalType.NAME}`,
)
) {
return this._fromValue;
}
}
}
function isEnumType(
schema: Avro.schema.AvroSchema | Avro.Type,
): schema is Avro.schema.EnumType {
return (
typeof schema === "object" && "type" in schema && schema.type === "enum"
);
}
const typeHook: Avro.ForSchemaOptions["typeHook"] = (schema, opts) => {
if (isEnumType(schema)) {
(
schema as Avro.schema.EnumType & Avro.schema.LogicalTypeExtension
).logicalType = SanitizedEnumLogicalType.NAME;
}
};
const opts: Partial<Avro.ForSchemaOptions> = {
typeHook,
logicalTypes: {
[SanitizedEnumLogicalType.NAME]: SanitizedEnumLogicalType,
},
}; |
I am generating both an Avro schema as well as a client from the same OpenAPI spec.
The spec contains a whole bunch of enum values that are hyphenated, see https://github.com/planet-a-ventures/affinity-node/blob/364a0b2ac2f86d04b128f8ebb0ac8560f36ab8dd/openapi/2024-09-05.json#L2892-L2910
I opened a pull request for the Avro schema generator here OpenAPITools/openapi-generator#19549 which transforms these enum values into valid ones that adhere to the Avro specs.
One thing that I haven't solved ideal yet, however, is how to transform the data contents for each POJO that I want to write to a file via
avsc
.I am currently doing something like this:
which even if I factor out the transform and detect enum fields automatically, is not ideal, especially in a complex schema like the one linked above.
So I was wondering what your thoughts were to introduce an additional optional callback property
sanitizeEnumValue = (property: string) => string
or similar on theBlockEncoder
class, that is called for each serialized enum value and can transform the value that is passed to the encoder, similar toavsc/types/index.d.ts
Line 88 in 7cb76a6
The text was updated successfully, but these errors were encountered: