-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xread: an "extract+read" #1950
Comments
Implementation noteWhereas the standard Another crucial property of the xread-frame is that the principal driver of the materialization process is the ordered read loop. Thus, the process can be visualized as follows:
The main ordered loop thus pulls the data from the "raw data source", puts it into an internal buffer, and then pushes that data through "filter/transform" data, finally storing into the resulting frame. Acquire dataIn this step, the data is retrieved from some "raw data source", stored in the intermediate read buffer, and is passed on to the main ordered loop. Crucially, the main loop is the driver here: it commands how much of the data to retrieve.
Multiple data sources can be effectively chained, for example a file that is loaded from a URL may need to be decompressed, and then perhaps even decoded from a legacy encoding into UTF-8. After acquiring each data chunk, we parse it as CSV into a table of values which is stored internally as a buffer. This buffer must then pass through the filter/transform stage. Filter/transform dataThe buffer that was obtained when parsing a CSV chunk is instantiated into a virtual Frame. This Frame will have virtual columns pointing directly to the underlying data buffer. At this point we:
Special notes about handling slice selectors
|
This is a proposal for implementing a new function
xread()
, which would be conceptually similar tofread()
, but much lazier. In particular,xread()
would parse only the firstn_sample_lines=100
lines of the file, detecting the general information such as parse settings, the number columns, their names and types. After that,xread()
returns a "lazy frame" object, which can be used with the standard[i,j]
notation:[:n, :]
returns just the firstn
rows of the dataset (equivalent tomax_nrows
parameter);[1000:2000, :]
returns rows from 1000 to 2000. Generally, we should allow the user to request consecutive ranges on the same lazy frame. This will be equivalent to "chunked reading", which is a popular request;[-100:, :]
returns the last 100 rows. For a "file object" sources, this would require that the file is read in its entirety. For files on disk (or in memory), we could try to parse from the end of the file.[:, :5]
return only the first 5 columns.[:, ["A", "B", "C"]]
return columns namedA
,B
andC
.[:, f + {"wage_per_hour": f.salary/f.hours}]
return all columns + an additional column containing thesalary
divided byhours
.[f.status != '', :]
return only the rows where fieldstatus
is not empty.[dt.random.randu() < 0.2, :]
randomly sample 20% of the rows.These are just some of the examples of what could be possible. Obviously, the
i
andj
selectors can be combined into a single[i, j]
selector too. It should be even possible to add join operations and groupbys into the mix (provided that we use single-pass hash-based grouping).The text was updated successfully, but these errors were encountered: