Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement mapeo-schema with typescript types & a binary storage encoding #20

Closed
sethvincent opened this issue Oct 27, 2022 · 17 comments · Fixed by #22
Closed

Re-implement mapeo-schema with typescript types & a binary storage encoding #20

sethvincent opened this issue Oct 27, 2022 · 17 comments · Fixed by #22
Assignees

Comments

@sethvincent
Copy link
Contributor

sethvincent commented Oct 27, 2022

The mapeo-schema package is responsible for defining the schemas of core data types used in mapeo.

To start we'll focus on the observation schema.

Ideally this package:

  • generates typescript types for each data schema
  • provides encode, decode, and validate functions for each data schema (the validate function may be less essential than the other two)
  • assigns "magic bytes" to each data schema that is prepended to each hypercore block and then used by mapeo-core to determine which data schema a block is using

Implementation options:

  • encoding
    • the primary goal is to reduce storage size as much as possible, secondary goal is fast runtime speed for encoding/decoding
    • based on this benchmark it looks like protocol buffers will be the best choise
    • other encoding formats we've considered:
  • generate types from protocol buffers
    • protobuf-es this one looks a little easier to use than ts-proto
    • ts-proto may be useful for generating types if we go with protocol buffers

This cli tool looks useful for working with protocol buffers: https://github.com/bufbuild/buf

@tomasciccola
Copy link
Contributor

So, I've started working on this feature on this branch.

I'm using buf to work with .proto files and protobuf-es as a typescript implementation.
I'm basically wrapping the Observer class generated by protobuf-es and adding an encode, decode and validate methods.

For now I'm modelling the schema according to this reference and the actual schema on master.

protobuf3 doesn't have required fields so every field has an initial default value
Additionally using ListValue and NullValue which are well-known types provided by protobufs aren't being detected by buf, so I'll see whats up with that

@sethvincent
Copy link
Contributor Author

Cool! A few ideas reading through that branch:

  • I think I neglected to mention that we're generally using javascript rather than writing typescript files directly, but still using typescript types defined either by packages we use or by jsdoc. For this repo that'll mean generating typescript types as part of the build step.
  • Ideally we end up with an Observation typescript type that reflects the protobuf schema that we're able to reference as arguments, return values, and similar within mapeo-core-next. Running tsc on the generated code in that branch gets us pretty close, but doesn't
  • Maybe there's a better way to have nested objects for the metadata property and others instead of using nested messages like I had it in the benchmark repo. I'm learning protobuf as we work on this so I'm not sure.
  • Not having required properties might be ok. One of the unusual challenges is that in older Mapeo projects not all existing data will fit the schema exactly, so required fields might actually get in the way in that case.

@tomasciccola
Copy link
Contributor

Thanks for the guidance!

  1. I'll start using js directly.
  2. I made a change were I tested using ts-proto instead of protobuf-es since the latter generates a class instead of an interface/type. Every member of the class (field in the proto) has a default value according to the protobuf spec, so one could initialize the schema with an empty object and get a bunch of empty strings as values of fields, which I don't think is the right approach? ts-proto actually generates an interface which means we can enforce type checking.
    (more on this on the last commit messages!)
  3. I'll see about a better way of nesting objects
  4. How this tracks with 2?

@tomasciccola
Copy link
Contributor

tomasciccola commented Nov 2, 2022

Talking with @gmaclennan , there are some fields in the schema that can be derived from stuff like the core key and sequence number or that should be put on the magic bytes.
Omitting this fields from the protobuf schema means that probably we should need a protobuf schema and a json schema. Quoting gregor:

Writing this down I think we need to maintain two schema definitions: a protobuf definition for the storage on the hypercore, and a schema for the data that will be presented to the client (buffers converted to hex-encoded strings, versions being strings of concatenated hypercore ID and block index, plus the derived data). I think JSON schema makes sense for the client schema. I suggest using typebox or zod for writing this client schema. Maybe typebox because you get both json Schema and typescript out of it. Then we will need a custom encode/decode function that uses the type derived from the protobuf as input, and the type defined by typebox as the output.

Comments on fields:

id -> 32-byte random number that is going to be stored raw and presented as a hex-encoded string on the client-schema
deviceId -> derived from coreKey + identityKey (identity is our own invention, but I don't full grasp yet how is generated)
creatorId -> coreKey of the original record. We can trace links to obtain this or store it on every write.
timestamp -> see protobuf timestamp format
version -> this shouldn't be stored in the schema since is the seq number of the core
links -> to handle forked feeds (relates to kappa-core and our own strategy for multiwriter)
deleted -> not currently in the schema. is a boolean
userId -> to support multiple users on the same device, that feature not currently on the roadmap so on the fence to include it.
type -> should be store as magic bytes, relates to the type of record (i.e. Observation)
schemaVersion -> also store as magic bytes

@tomasciccola
Copy link
Contributor

A doubt that has come up:
given the following API for the module

const encode = (obj) => { // obj is a js object that matches the protobuf schema
  return buf; // buf is the protobuf with prepended magic bytes
}
const decode = (buf) => { //  buf is a buffer return by core.get(seq)
  return obj // obj is the js object that matches the json client schema
}

How am I able to derive the rest of the fields without getting access to the core?
should I instead pass something like (core, seq) to the decode fn and let it get the record from the core, so we can get the extra fields like version and deviceId ?
Also stuff like the id of the record which is a random 32 byte buffer, should be generated automatically on the encode function or should it be the responsability of the consumer of the schema to pass that field? The same go to fields like timestamp (last modified). Should the Date be instanciated inside the encode fn?

@gmaclennan
Copy link
Member

  • I think this should be symmetrical (e.g. the type of obj for encode should be the json client schema)
  • For decode, I think we need to have a second arg opts: { key: Buffer /** public key of core where the data is stored */, index: number /** index (seq) of buf in source core */ }
    Not sure if opts.key should be a string of Buffer? Thoughts? I think the source of this info will be from multi-core-indexer so maybe Buffer, so the encoding of Buffer to string can be the responsibility of this module (rather than expecting the consumer of this API to encode it).

@tomasciccola
Copy link
Contributor

A follow up on the work I've been doing. I'm leaving a list of questions that we can start to address them sometime next week. Mostly to see if what I've been doing makes sense! @gmaclennan @sethvincent

  • Is the json schema up to date? should we revise it? (are there other record types? are all the record types used?)
    i.e. there's a mix of snake_case and camelCase. Probuf expects snake_case for field names and camelCase for message types. ts-proto (turns .proto into types) can be configured to produce either convention so we can go either way.
  • schema version for records. i.e. observation demans schema version 4, when should this be increased?
  • General review of the code. Does the api and code makes sense? in the issue there was the expectation of exposing a validate fn, shouldn't we take care of that directly and validate on encoding/decoding? (This is what I'm doing now)
    The only think that may be weird is that in the src/index.js there's a map (actually an object) that maps record type to corresponding magic byte, jsonSchema and protobufSchema. Does this approach work?
  • I'm reusing all the schemas and generating types from them using json-schema-to-typescript. Does this scan? I tried typebox to do the other way around but it seemed better to take advantage of the work already being done. (also the signatures produce by typebox didn't seem that useful)
  • Getting in-editor documentation for the jsonSchema doesn't seem to work. If part of this work is to have a better DX, what are the expectations for that??
  • For now I'm using one byte for recordType and another for schemaVersion, mostly as a way of testing the encoding/decoding of buffers. @gregor mentioned calculating hash collisions with the birthday paradox and I don't know what to make up of this. Probably the idea is that we have like 20 possible types and schemaVersions before raising the probability of collitions to like 50%? Should this header bytes be generated randomly? How is this going to be documented?

@sethvincent
Copy link
Contributor Author

@tomasciccola

Is the json schema up to date?

I expect we'll need to change the schema a couple times over the coming weeks no matter what, so don't worry about that too much. I think the main thing that needs to be done in this branch is just the system for generating the types and code for encoding/decoding. We can perfect the schema in followup prs.

observation demans schema version 4, when should this be increased?

🤷 maybe we bump that version before doing a new release? Not sure what's best.

generating types from them using json-schema-to-typescript. Does this scan?

Yeah, I think this makes sense. I also tried using typebox and found it really challenging to even make sense of it.

Getting in-editor documentation for the jsonSchema doesn't seem to work

At a glance it seems like this should work. I'll try using the protobufsTypescript branch in mapeo-core-next and let you know how it goes.

Should this header bytes be generated randomly? How is this going to be documented?

My impression was that the bytes that indicate the schema type/version are something we would decide ahead of time and wouldn't need to be randomized at all. But yeah the fixed byte length would need to be enough to support some maximum number of schemas. What that maximum number should be seems somewhat arbitrary. Maybe 100?

@sethvincent
Copy link
Contributor Author

sethvincent commented Nov 22, 2022

I looked into getting types available in vs code and decided to go ahead and make a pr into the protobufsTypescript branch: https://github.com/digidem/mapeo-schema/pull/21

I also realized that this is a case where it might be helpful to publish to npm via a github action. That way we can do the build step there and publish the results instead of it needing to happen as a prepublish or prepublishOnly script. If we do that we could gitignore the docs/api/ and other generated folders.

@tomasciccola
Copy link
Contributor

Awesome, thanks for adding typedoc and I agree about github actions. We can generate the types from protobufs and schemas and the documentation directly on ghactions. I'll look up into it, add some tests and then create a PR.

@tomasciccola tomasciccola linked a pull request Nov 29, 2022 that will close this issue
@tomasciccola tomasciccola moved this from In Progress to Code Review in Mapeo - Sprint 2022 Nov 30, 2022
@tomasciccola tomasciccola moved this from Code Review to In Progress in Mapeo - Sprint 2022 Jan 19, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Mapeo - Sprint 2022 Feb 14, 2023
@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.0 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.1 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.2 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.3 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.4 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

github-actions bot commented Aug 3, 2023

🎉 This issue has been resolved in version 3.0.0-next.5 🎉

The release is available on:

Your optic bot 📦🚀

@github-actions
Copy link

🎉 This issue has been resolved in version 3.0.0-next.6 🎉

The release is available on:

Your optic bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants