Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with reading dataset #4

Open
HerrKrishna opened this issue Jun 24, 2021 · 4 comments
Open

Problem with reading dataset #4

HerrKrishna opened this issue Jun 24, 2021 · 4 comments

Comments

@HerrKrishna
Copy link

I tried to follow the training section of the readme.
I get the following error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 22, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1409, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']
train.zip

i've attached the training file that i use. It follows the standards described in the readme.

@luyug
Copy link
Owner

luyug commented Jun 24, 2021

What version of datasets are you using?

@luyug
Copy link
Owner

luyug commented Jun 24, 2021

Check huggingface/datasets#2548

@HerrKrishna
Copy link
Author

Thank you for helping. I'm using datasets 1.8.0
I've reordered neg pos and qry. Now i get this error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 25, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1414, in pyarrow.lib.Table.cast
File "pyarrow\table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\pyarrow\compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, passage: list<item: int64>> to struct using function cast_struct

Can you help with that?

@luyug
Copy link
Owner

luyug commented Jun 25, 2021

Please first try out our tested environment setup torch==1.6.0, transformers==4.2.0, datasets==1.1.3, and in addition pyarrow==2.0.0 to see where the regression comes from. Meanwhile, your data does not seem to be in correct format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants