-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for categorical/dictionary types #6892
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome ! I'll add a change to this PR to fix this if you don't mind:
import pyarrow as pa
from datasets import Dataset
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"]).cast(
pa.dictionary(pa.int32(), pa.string())
)
t = pa.Table.from_arrays([animals], names=["animal"])
ds = Dataset(t) # has 'string' feature type
ds.add_item({"animal": "Pikachu"})
currently it raises ArrowTypeError: Unable to merge: Field animal has incompatible types: dictionary<values=string, indices=int32, ordered=0> vs string
because the ds.data
is still the dictionary type
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
@lhoestq Thanks a ton for helping this get merged! |
Arrow has a very useful dictionary/categorical type (https://arrow.apache.org/docs/python/generated/pyarrow.dictionary.html). This data type has significant speed, memory and disk benefits over pa.string() when there are only a few unique text strings in a column.
Unfortunately, huggingface datasets currently does not support this type. So huggingface datasets cannot natively read many parquet files that use this datatype .This PR adds support for Huggingface Datasets to read categorical/dictionary data.
Note: This PR functions by simply converting those dictionary/categorical types to strings. This means that huggingface datasets cannot take advantage of the compute benefits of categoricals, but it significantly simplifies logic. At this time, I do not think it makes sense to optimize categorical support within huggingface datasets and that we should only try to optimize later, if necessary.
Closes #5706