Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TypeError: Couldn't cast array of type] Can only load a subset of the dataset #5596

Closed
loubnabnl opened this issue Mar 1, 2023 · 5 comments

Comments

@loubnabnl
Copy link

loubnabnl commented Mar 1, 2023

Describe the bug

I'm trying to load this dataset which consists of jsonl files and I get the following error:

casted_values = _c(array.values, feature[0])
  File "/opt/conda/lib/python3.7/site-packages/datasets/table.py", line 1839, in wrapper
    return func(array, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/datasets/table.py", line 2132, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
struct<type: string, action: string, datetime: timestamp[s], author: string, title: string, description: string, comment_id: int64, comment: string, labels: list<item: string>>
to
{'type': Value(dtype='string', id=None), 'action': Value(dtype='string', id=None), 'datetime': Value(dtype='timestamp[s]', id=None), 'author': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'comment_id': Value(dtype='int64', id=None), 'comment': Value(dtype='string', id=None)}

But I can succesfully load a subset of the dataset, for example this works:

ds = load_dataset('bigcode-data/the-stack-gh-issues', split="train", data_files=[f"data/data-{x}.jsonl" for x in range(10)])

and ds.features returns:

{'repo': Value(dtype='string', id=None),
 'org': Value(dtype='string', id=None),
 'issue_id': Value(dtype='int64', id=None),
 'issue_number': Value(dtype='int64', id=None),
 'pull_request': {'user_login': Value(dtype='string', id=None),
  'repo': Value(dtype='string', id=None),
  'number': Value(dtype='int64', id=None)},
 'events': [{'type': Value(dtype='string', id=None),
   'action': Value(dtype='string', id=None),
   'datetime': Value(dtype='timestamp[s]', id=None),
   'author': Value(dtype='string', id=None),
   'title': Value(dtype='string', id=None),
   'description': Value(dtype='string', id=None),
   'comment_id': Value(dtype='int64', id=None),
   'comment': Value(dtype='string', id=None)}]}

So I'm not sure if there's an issue with just some of the files. Grateful if you have any suggestions to fix the issue.

Side note:
I saw this related issue and tried to write a loading script to have events as a Sequence and not list here (the script was renamed). It worked with a subset locally but doesn't for the remote dataset it can't find https://huggingface.co/datasets/bigcode-data/the-stack-gh-issues/resolve/main/data.

Steps to reproduce the bug

from datasets import load_dataset

ds = load_dataset('bigcode-data/the-stack-gh-issues', split="train")

Expected behavior

Load the entire dataset succesfully.

Environment info

  • datasets version: 2.10.1
  • Platform: Linux-4.19.0-23-cloud-amd64-x86_64-with-debian-10.13
  • Python version: 3.7.12
  • PyArrow version: 9.0.0
  • Pandas version: 1.3.4
@lhoestq
Copy link
Member

lhoestq commented Mar 1, 2023

Apparently some JSON objects have a "labels" field. Since this field is not present in every object, you must specify all the fields types in the README.md

EDIT: actually specifying the feature types doesn’t solve the issue, it raises an error because “labels” is missing in the data

@loubnabnl
Copy link
Author

We've updated the dataset to remove the extra labels field from some files, closing this issue. Thanks!

@surya-narayanan
Copy link

A similar error occurs in the Pile dataset (EleutherAI/the_pile)

Loading the dataset produces the following error.

TypeError: Couldn't cast array of type
struct<file: string, id: string>
to
{'id': Value(dtype='string', id=None)}

@lhoestq
Copy link
Member

lhoestq commented Apr 19, 2023

@jingenyan
Copy link

jingenyan commented Dec 5, 2023

i have the same problem ,how to solve :
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
list<item: string>
to
{'content': Value(dtype='string', id=None), 'role': Value(dtype='string', id=None)}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants