Fix infer_schema_length: nil option #600

awerment · 2023-05-23T22:14:44Z

Currently, setting infer_schema_length: nil when creating a DataFrame from CSV has no effect, because there is a redundant fallback to the default value of 1000 rows if the max_rows option is not set (or set to nil).

Effectively, it is not possible to use all rows for schema inference without setting infer_schema_length (or max_rows) to a large enough "guess". I believe this does not match the documented behavior:

:max_rows - Maximum number of lines to read. (default: nil)
:infer_schema_length Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default: 1000).

This PR removes the fallback to the default.

I'm not sure if the changes here are actually preferable to changing the docs, as scanning all rows will be slow for larger files (as stated), but ran into an issue myself with the following situation:

the files' line count is not known upfront (guessing a large enough infer_schema_length / max_rows value felt wrong)
first Nk lines contain 0, later lines contain float values. (an issue with the source of the files, but fixing that is not possible)

Remove additional fallback to the default value of 1000

awerment · 2023-05-23T22:17:26Z

test/explorer/data_frame/csv_test.exs

+      assert_raise RuntimeError, ~r/from_csv failed:/, fn ->
+        DF.from_csv!(csv, infer_schema_length: nil, max_rows: 10)
+      end


The reason why this second case causes a RuntimeError is that there are no guarantees with max_rows, as documented here.

josevalim · 2023-05-24T10:19:18Z

💚 💙 💜 💛 ❤️

Fix infer_schema_length: nil option

be67a5a

Remove additional fallback to the default value of 1000

awerment commented May 23, 2023

View reviewed changes

josevalim merged commit 75207a8 into elixir-explorer:main May 24, 2023

awerment deleted the fix/nil_infer_schema_length branch May 24, 2023 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix infer_schema_length: nil option #600

Fix infer_schema_length: nil option #600

awerment commented May 23, 2023 •

edited

Loading

awerment May 23, 2023

josevalim commented May 24, 2023

Fix infer_schema_length: nil option #600

Fix infer_schema_length: nil option #600

Conversation

awerment commented May 23, 2023 • edited Loading

awerment May 23, 2023

Choose a reason for hiding this comment

josevalim commented May 24, 2023

awerment commented May 23, 2023 •

edited

Loading