You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Json reader doesn't seem to be inferring the dtype of a column which has floating values. It appears to be inferred as object dtype. The only different I see with resepect to other float columns being inferred correctly is the difference in scientific notation, i.e., 34.3e+09 is being inferred correctly but 34.3e+304 is not.
Steps/Code to reproduce bug
Attachment(Just rename the file to temp.parquet and load): temp.parquet.zip
>>>importpandasaspd>>>importcudf>>>pdf=pd.read_parquet('temp.parquet')
>>>pdf012345602.250622e+09-6.494446e+307-9.164551e+126735604693236-119.01.0-3.826755e+1212.035766e+09NaN-2.369047e+127574969090271NaNNaN-5.743424e+1221.939400e+09-1.145175e+3087.534900e+12362804247882628.0NaN-5.231790e+1231.871836e+091.556428e+3072.206686e+12767786704906-62.0NaNNaN49.807541e+07NaN1.257964e+12-7919323897153-96.0NaNNaN53.067252e+09-1.678511e+3084.968256e+125588109998021-37.00.0NaN63.540629e+09-1.599954e+3071.356411e+128960674970810NaN1.0-2.679076e+127NaNNaNNaN-860586134642582.0NaN8.262327e+1281.528666e+09NaNNaN-351747900046140.00.0NaN9NaN9.992153e+307-5.273377e+12797527920835713.01.0-2.294980e+12>>>pdf.dtypes0float641float642float643int644float645float646float64dtype: object>>>pdf.to_json('a', orient='records', lines=True)
>>>new_pdf=pd.read_json('a', orient='records', lines=True)
>>>new_pdf012345602.250622e+09-6.494446e+307-9.164551e+126735604693236-119.01.0-3.826755e+1212.035766e+09NaN-2.369047e+127574969090271NaNNaN-5.743424e+1221.939400e+09-1.145175e+3087.534900e+12362804247882628.0NaN-5.231790e+1231.871836e+091.556428e+3072.206686e+12767786704906-62.0NaNNaN49.807541e+07NaN1.257964e+12-7919323897153-96.0NaNNaN53.067252e+09-1.678511e+3084.968256e+125588109998021-37.00.0NaN63.540629e+09-1.599954e+3071.356411e+128960674970810NaN1.0-2.679076e+127NaNNaNNaN-860586134642582.0NaN8.262327e+1281.528666e+09NaNNaN-351747900046140.00.0NaN9NaN9.992153e+307-5.273377e+12797527920835713.01.0-2.294980e+12>>>new_pdf.dtypes0float641float642float643int644float645float646float64dtype: object>>>gdf=cudf.read_json('a', engine='cudf', orient='records', lines=True)
>>>gdf012345602.250621731e+09-6.494445507e+307-9.164551369e+126735604693236-119.01.0-3.826754862e+1212.03576588e+09<NA>-2.3690472e+127574969090271<NA><NA>-5.743423727e+1221.939399618e+09-1.145175127e+3087.534900233e+12362804247882628.0<NA>-5.231789981e+1231.871835886e+091.556428026e+3072.206686314e+12767786704906-62.0<NA><NA>498075413.0<NA>1.257964087e+12-7919323897153-96.0<NA><NA>53.067251544e+09-1.678511417e+3084.96825619e+125588109998021-37.00.0<NA>63.540629185e+09-1.599954352e+3071.356410998e+128960674970810<NA>1.0-2.679076194e+127<NA><NA><NA>-860586134642582.0<NA>8.26232733e+1281.528665908e+09<NA><NA>-351747900046140.00.0<NA>9<NA>9.992152856e+307-5.273376883e+12797527920835713.01.0-2.294979586e+12>>>gdf.dtypes0float641object#<---- This should be float64, See column 1 values & dtypes below2float643int644float645float646float64dtype: object>>>gdf['1']
0-6.494445507e+3071<NA>2-1.145175127e+30831.556428026e+3074<NA>5-1.678511417e+3086-1.599954352e+3077<NA>8<NA>99.992152856e+307Name: 1, dtype: object>>>new_pdf[1]
0-6.494446e+3071NaN2-1.145175e+30831.556428e+3074NaN5-1.678511e+3086-1.599954e+3077NaN8NaN99.992153e+307Name: 1, dtype: float64
Expected behavior
Correctly infer the float dtype.
Environment overview (please complete the following information)
Environment location: [Bare-metal]
Method of cuDF install: [from source]
Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
Describe the bug
Json reader doesn't seem to be inferring the dtype of a column which has floating values. It appears to be inferred as
object
dtype. The only different I see with resepect to other float columns being inferred correctly is the difference in scientific notation, i.e.,34.3e+09
is being inferred correctly but34.3e+304
is not.Steps/Code to reproduce bug
Attachment(Just rename the file to
temp.parquet
and load): temp.parquet.zipExpected behavior
Correctly infer the float dtype.
Environment overview (please complete the following information)
Environment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsClick here to see environment details
Additional context
Surfaced while running fuzz tests: #6001
The text was updated successfully, but these errors were encountered: