[WISH] Choosing columns in LOAD-CSV #167

endo64 · 2024-09-10T15:31:11Z

I usually work with relatively big CSV files with lots of columns (over 1000) exported from other systems and then I process them with Red.

Even though it is not difficult to add an intermediate step to delete unwanted columns from a CSV file, it would be nice to have a refinement to choose which columns will be loaded. This way loading would also be faster for big files.

columns: [1 5 27]
load-csv/columns data columns

;or

columns: ["id" "firstname" "lastname"]
load-csv/header/columns data columns

hiiamboris · 2024-09-10T16:23:40Z

I don't see how it would be faster. You should still load the whole row, then you remove unused columns from it. Same computational complexity as loading all of the rows and then removing unused columns from all of them, only with higher peak RAM usage. In fact such per-row filtering would even be slower if one uses /as-columns mode, as removing whole columns once would be faster than doing that for every row.

Besides, I don't think we should bake into load features that are orthogonal to it, as our goal is reducing complexity, not increasing it. If we need this way of removing columns from data let it be a separate function.

If this is a so common thing you do, why not simply wrap the load on a mezz level?

multi-pick: function [data indices] [map-each/only i indices [:data/:i]]	;) ideally
multi-pick: function [data indices] [										;) faster currently
	buf: clear []
	foreach i indices [append/only buf :data/:i]
	copy buf
]
load-only: function [source columns /header /as-columns] [
	data: load/:header/:as-columns source
	either any [header as-columns] [
		unless string? columns/1 [
			headers: keys-of data
			columns: map-each i columns [headers/:i]
		] 
		remove-each [title column] data [not find columns title]
	][
		if string? columns/1 [
			headers: keys-of data
			columns: map-each c columns [index? find headers c]
		] 
		map-each/self/only row data [multi-pick row columns]
	]
]

Ultimately we want the codecs to be incremental, so you would also be able to filter out data as it appears, and that would also eliminate the issue of parsing multiple resulting formats that a decoder can produce or an encoder can accept.

This also ties to the idea of having a table! datatype, where row/column operations would be a given.

endo64 · 2024-09-13T17:46:16Z

You are right, somehow, I thought load-csv actually loads the values, like dates and integers etc., that's why I said it would be faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WISH] Choosing columns in LOAD-CSV #167

[WISH] Choosing columns in LOAD-CSV #167

endo64 commented Sep 10, 2024

hiiamboris commented Sep 10, 2024 •

edited

Loading

endo64 commented Sep 13, 2024

[WISH] Choosing columns in LOAD-CSV #167

[WISH] Choosing columns in LOAD-CSV #167

Comments

endo64 commented Sep 10, 2024

hiiamboris commented Sep 10, 2024 • edited Loading

endo64 commented Sep 13, 2024

hiiamboris commented Sep 10, 2024 •

edited

Loading