Problem with reading dataset #4

HerrKrishna · 2021-06-24T12:13:44Z

I tried to follow the training section of the readme.
I get the following error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 22, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1409, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['qry', 'pos', 'neg'], ['neg', 'pos', 'qry']
train.zip

i've attached the training file that i use. It follows the standards described in the readme.

luyug · 2021-06-24T13:41:39Z

What version of datasets are you using?

luyug · 2021-06-24T18:23:02Z

Check huggingface/datasets#2548

HerrKrishna · 2021-06-25T09:10:01Z

Thank you for helping. I'm using datasets 1.8.0
I've reordered neg pos and qry. Now i get this error:

Traceback (most recent call last):
File "C:\Users\Christoph.Schneider\PycharmProjects\SentBertHelpDesk\try_reranker.py", line 25, in
train_dataset = GroupedTrainDataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\reranker\data.py", line 31, in init
self.nlp_dataset = datasets.load_dataset(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\load.py", line 742, in load_dataset
builder_instance.download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 574, in download_and_prepare
self._download_and_prepare(
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 652, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\builder.py", line 1041, in _prepare_split
for key, table in utils.tqdm(generator, unit=" tables", leave=False, disable=not_verbose):
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\tqdm\std.py", line 1133, in iter
for obj in iterable:
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\datasets\packaged_modules\json\json.py", line 96, in _generate_table
s
pa_table = pa_table.cast(self.config.schema)
File "pyarrow\table.pxi", line 1414, in pyarrow.lib.Table.cast
File "pyarrow\table.pxi", line 277, in pyarrow.lib.ChunkedArray.cast
File "C:\Users\Christoph.Schneider\Anaconda3\envs\SentBertHelpDesk\lib\site-packages\pyarrow\compute.py", line 281, in cast
return call_function("cast", [arr], options)
File "pyarrow_compute.pyx", line 465, in pyarrow._compute.call_function
File "pyarrow_compute.pyx", line 294, in pyarrow._compute.Function.call
File "pyarrow\error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<qid: string, passage: list<item: int64>> to struct using function cast_struct

Can you help with that?

luyug · 2021-06-25T13:56:11Z

Please first try out our tested environment setup torch==1.6.0, transformers==4.2.0, datasets==1.1.3, and in addition pyarrow==2.0.0 to see where the regression comes from. Meanwhile, your data does not seem to be in correct format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with reading dataset #4

Problem with reading dataset #4

HerrKrishna commented Jun 24, 2021

luyug commented Jun 24, 2021

luyug commented Jun 24, 2021

HerrKrishna commented Jun 25, 2021

luyug commented Jun 25, 2021

Problem with reading dataset #4

Problem with reading dataset #4

Comments

HerrKrishna commented Jun 24, 2021

luyug commented Jun 24, 2021

luyug commented Jun 24, 2021

HerrKrishna commented Jun 25, 2021

luyug commented Jun 25, 2021