Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix infer_schema_length: nil option #600

Merged

Conversation

awerment
Copy link
Contributor

@awerment awerment commented May 23, 2023

Currently, setting infer_schema_length: nil when creating a DataFrame from CSV has no effect, because there is a redundant fallback to the default value of 1000 rows if the max_rows option is not set (or set to nil).

Effectively, it is not possible to use all rows for schema inference without setting infer_schema_length (or max_rows) to a large enough "guess". I believe this does not match the documented behavior:

:max_rows - Maximum number of lines to read. (default: nil)
:infer_schema_length Maximum number of rows read for schema inference. Setting this to nil will do a full table scan and will be slow (default: 1000).

This PR removes the fallback to the default.

I'm not sure if the changes here are actually preferable to changing the docs, as scanning all rows will be slow for larger files (as stated), but ran into an issue myself with the following situation:

  • the files' line count is not known upfront (guessing a large enough infer_schema_length / max_rows value felt wrong)
  • first Nk lines contain 0, later lines contain float values. (an issue with the source of the files, but fixing that is not possible)

Remove additional fallback to the default value of 1000
Comment on lines +289 to +291
assert_raise RuntimeError, ~r/from_csv failed:/, fn ->
DF.from_csv!(csv, infer_schema_length: nil, max_rows: 10)
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why this second case causes a RuntimeError is that there are no guarantees with max_rows, as documented here.

@josevalim josevalim merged commit 75207a8 into elixir-explorer:main May 24, 2023
@josevalim
Copy link
Member

💚 💙 💜 💛 ❤️

@awerment awerment deleted the fix/nil_infer_schema_length branch May 24, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants