Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to create schema from tag map: type : not a valid Type string #599

Open
janpfeifer opened this issue Dec 7, 2024 · 1 comment
Open

Comments

@janpfeifer
Copy link

hi all,

I'm using the library for the first time (*) and I got a parquet file form a HuggingFace dataset, with the following schema:

$ parquet-tools -cmd schema -file 000_00000.parquet 
{
  "Tag": "name=Schema, repetitiontype=REQUIRED",
  "Fields": [
    {
      "Tag": "name=Text, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
    {
      "Tag": "name=Id, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"
    },
...

And I'm trying to create a reader for the following struct:

type FineWebEntry struct {
    Text string `parquet:"name=Text, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
    ID string `parquet:"name=Id, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
    Dump string `parquet:"name=Dump, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
    URL string `parquet:"name=Url, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
}

Then I get the error in the subject, when executing NewParquetReader(...):

failed to create schema from tag map: type : not a valid Type string

Any pointers to what could be wrong ?

I'm trying to parse what it is saying, but the error is uninformative: what is the first type and the second Type in the error message about ? Which field is it talking about ? Where is the string coming from ? Is it the type of the field in my struct or is it from the parquet file itself ?

Thanks!

@wojtess
Copy link

wojtess commented Dec 7, 2024

I have same problem with https://huggingface.co/datasets/allenai/ai2_arc
I tired many different things but I cant make it work.
Here is my latest code:

type Choice struct {
	Text  []string `parquet:"name=Text, type=BYTE_ARRAY, repetitiontype=OPTIONAL"`
	Label []string `parquet:"name=Label, type=BYTE_ARRAY, repetitiontype=OPTIONAL"`
}

type Dataset struct {
	Id        string   `parquet:"name=Id, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
	Question  string   `parquet:"name=Question, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
	Choices   []Choice `parquet:"name=Choices, repetitiontype=OPTIONAL"`
	AnswerKey string   `parquet:"name=AnswerKey, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=OPTIONAL"`
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants