Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WISH] Choosing columns in LOAD-CSV #167

Open
endo64 opened this issue Sep 10, 2024 · 2 comments
Open

[WISH] Choosing columns in LOAD-CSV #167

endo64 opened this issue Sep 10, 2024 · 2 comments

Comments

@endo64
Copy link

endo64 commented Sep 10, 2024

I usually work with relatively big CSV files with lots of columns (over 1000) exported from other systems and then I process them with Red.

Even though it is not difficult to add an intermediate step to delete unwanted columns from a CSV file, it would be nice to have a refinement to choose which columns will be loaded. This way loading would also be faster for big files.

columns: [1 5 27]
load-csv/columns data columns

;or

columns: ["id" "firstname" "lastname"]
load-csv/header/columns data columns
@hiiamboris
Copy link

hiiamboris commented Sep 10, 2024

I don't see how it would be faster. You should still load the whole row, then you remove unused columns from it. Same computational complexity as loading all of the rows and then removing unused columns from all of them, only with higher peak RAM usage. In fact such per-row filtering would even be slower if one uses /as-columns mode, as removing whole columns once would be faster than doing that for every row.

Besides, I don't think we should bake into load features that are orthogonal to it, as our goal is reducing complexity, not increasing it. If we need this way of removing columns from data let it be a separate function.

If this is a so common thing you do, why not simply wrap the load on a mezz level?

multi-pick: function [data indices] [map-each/only i indices [:data/:i]]	;) ideally
multi-pick: function [data indices] [										;) faster currently
	buf: clear []
	foreach i indices [append/only buf :data/:i]
	copy buf
]
load-only: function [source columns /header /as-columns] [
	data: load/:header/:as-columns source
	either any [header as-columns] [
		unless string? columns/1 [
			headers: keys-of data
			columns: map-each i columns [headers/:i]
		] 
		remove-each [title column] data [not find columns title]
	][
		if string? columns/1 [
			headers: keys-of data
			columns: map-each c columns [index? find headers c]
		] 
		map-each/self/only row data [multi-pick row columns]
	]
]

Ultimately we want the codecs to be incremental, so you would also be able to filter out data as it appears, and that would also eliminate the issue of parsing multiple resulting formats that a decoder can produce or an encoder can accept.

This also ties to the idea of having a table! datatype, where row/column operations would be a given.

@endo64
Copy link
Author

endo64 commented Sep 13, 2024

You are right, somehow, I thought load-csv actually loads the values, like dates and integers etc., that's why I said it would be faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants