Let's talk DataFrame APIs #473

xdssio · 2023-01-19T17:26:24Z

xdssio
Jan 19, 2023

xdssio · 2023-01-19T18:48:46Z

xdssio
Jan 19, 2023
Author

Here is a great reference to the Consortium for Python Data API which also discuss some of this issues.

0 replies

jaychia · 2023-01-19T20:12:29Z

jaychia
Jan 19, 2023
Maintainer

Thanks for this @xdssio!

A few initial comments:

polars, which is the backbone engine for Daft

We actually run directly on Arrow instead of relying on Polars. Our current codebase does use certain functionality from PyArrow/Polars/Pandas, but we are moving into a Rust implementation which will allow us to write our own kernels and leverage Arrow2 where possible. At some point, PyArrow/Polars/Pandas will all be optional dependencies.

we need to remember the implications of the distributed implementation ... use of lazy evaluation in implementations and the use of UDFs, which can fill many transformations-needs so we can avoid implementing it

You're absolutely correct! Some things are trivial in a non-distributed world, but when dealing with a distributed dataframe may involve expensive shuffles and be prohibitively expensive. Additionally, being lazy can also lead to some unexpected behavior - for example the simple len() function is ubiquituous Python for getting the length of some in-memory datastructure, but calling len() and having it be blocking for minutes as our lazy dataframe computes could be unexpected behavior for users!

API Discussions

Here are my comments for DataFrame APIs. I'll also start converting some of these to issues/separate discussions that we can chip away at!

head and tail

Currently, .limit has the same behavior as a Pandas/Polars .head. We have no equivalent of a .tail, which actually might require some work on our scheduler to make work since we have a feed-forward scheduler that processes partitions from start to end.

Interestingly, in the distributed world, Spark actually has a different behavior than Pandas/Polars for these methods. Instead of returning a new DataFrame, it blocks on computation and then returns a list of Row objects to the client. Internally within the team, we have discussed having a row iterator interface that can accomplish something similar.

Additionally, PostgreSQL LIMITs do not support "tailing from the back". Instead, it allows for an offset when calling LIMIT.

Proposed Features:

Let's put .tail on the backburner for now - SQL itself doesn't seem to provide any builtin functionality for this. Here is the GitHub discussion feature thread to track this [Feature] Tailing DataFrames #474
I will open a GitHub Discussion for a row iterator interface so that users can stream rows of the DataFrame, similar to the Spark .head/.tail APIs. This is a feature that can be useful for things such as feeding data into ML training. [Feature] Row streaming from a DataFrame #475

len

Using Polars and Spark as references, they do not support calling len over a lazy dataframe. This makes sense as the length of a lazy (uncomputed) dataframe is by definition, undefined.

Understandably, this may cause some confusion as users are used to calling len() over a Pandas/eager-Polars DataFrame. In order to prevent unexpected behavior, I'd like to propose that len() never cause any side-effects of df execution.

Proposed Features:

#476

We implement a .count_rows() method that returns an integer count of the number of rows in the final dataframe.
Calling __len__ over a materialized DataFrame can return the length of the DataFrame. However, calling __len__ over a non-materialized DataFrame should raise an appropriate error informing users to either materialize the df first, or use .count_rows().

collect

.collect() is intended to be a way for users to interact with Daft in an interactive setting. By calling .collect(), the user is saying "materialize my current results so that everything up to this step is cached, and subsequent operations in my interactive session will be faster".

NOTE: Unlike a PySpark .collect(), .collect() does not pull data down to the caller in distributed execution. I.e. data is materialized in distributed memory (potentially out-of-core if it does not fit in distributed memory).

In an eager dataframe such as Pandas/eager-Polars, every operation on the DataFrame is collected, since every step is eagerly executed and materialized. Daft's .collect() mechanism is simply a way for users to control this materialization for a better interactive development experience.

# Eager DataFrame: blocks and materialized on every line
# Lazy DataFrame: only blocks on materializes on `.collect()` so that `print(df.count_rows())` and `df.write_csv(...)` can avoid recomputation

df = df.with_column(...)
df = df.where(...)
# for a lazy dataframe, call df.collect() here
print("Writing number of rows:", df.count_rows())
df.write_csv(...)

Perhaps we should rename .collect() to .execute() or something similar, to avoid confusion with Spark. I'll leave this as an open question for now.

Proposed Features:

We should have a better user guide on .collect() and how to use it in an interactive session.

getitem

daft.col("A") is useful for cases where you don't yet have a dataframe to reference a column, but want to write an expression. This is a bit of a rarer case but does come up from time to time.
df["A"] is our preferred way of referring to columns in a DataFrame since it is unambiguous during resolution.
df.a is ambiguous as it could refer to a column name or a method in a DataFrame. It does give some autocompletion benefits, but I think the downsides outweigh the benefits here. We like having one way of doing things (Zen of Python: There should be one-- and preferably only one --obvious way to do it.).

We actually don't yet have a Series abstraction, so there is no way to access a Series.

setitem

We have discussed this before internally within the team - this is actually surprisingly difficult to implement and get right. Happy to talk through it more if this is a hotly requested feature.

(edited)

select columns

df[[col1,col2]] should work as well actually. It more-or-less aliases df.select(col1, col2).

Daft does not have indices like Pandas, so df[int, str] is not a valid operation.

select rows

We like that dataframe indexing (df[...]) is explicit and unambiguous in referring to column selection at this moment.

Selection of rows can be achieved with .limit, and we are open to adding an offset option there to enable slicing. See discussion on head/tail.

unique, num_unique, value_counts, countna

These make sense, and should make it into the API. We currently only have a df.distinct() which behaves similarly to a SQL SELECT DISTINCT *.

Proposed Features:

#477

df.distinct() should take in column arguments to perform a DISTINCT operation on only a selected group of columns. This should cover the unique use-case as well where a user wants to retrieve the unique values for a given column/tuple of columns. []
num_unique is an aggregation - basically a customized count function. We should implement this similar to other aggregations such as mean/count
value_counts is an .agg([(c, "count_unique") for c in cols]) to return a DataFrame of unique columns + counts on their occurrences
countna is an aggregation as well, and can be implemented as such. Aliasing a sum of .is_null().cast(int) might do the trick

0 replies

jaychia · 2023-01-20T01:46:19Z

jaychia
Jan 20, 2023
Maintainer

Column Operations

We currently have a very limited set of operations, but would like to start adding a more comprehensive set of operations for:

Strings [Feature] String Operations #478
Dates and Datetimes [Feature] Datetime Operations #479

Image operations are coming soon - we have plans to write Arrow extension types for representing images, and can start to define kernels on these types. For now, UDFs are the main way of interacting with images.

I/O

We have a basic set of I/O operations, but are open to supporting more on a use-case driven basis.

Data manipulations

.map feature proposal: [Feature] Expression.map #480
We'd like to avoid with_columns for now to avoid feature bloat, but happy to explore if this can significantly reduce code verbosity
transpose - happy to explore use-cases for transpose, but we were not able to think of much uses for it.
pivot - this is a pretty advanced aggregation, we'll start a feature thread for it please comment on it if it needs to be prioritized: [Feature] Pivot aggregations #481
NumPy functions - interesting, not sure what this means but happy to learn more. Is this a .apply?
fillna - this could be useful, and would just alias df["col"].is_null().if_else(val, df["col"]). We need to be careful here about null semantics: [Feature] Add Expression helper to fill NA #482
dropna - this can just alias DataFrame.where(~df["col"].is_null()): [Feature] Drop NA rows #483

2 replies

dendihandian Jun 13, 2023

what is the equivalent syntax for this pandas syntax in daft: df = df.apply(lambda row: processing_row(row, sqlite_conn), axis=1)

jaychia Jun 13, 2023
Maintainer

Thanks for the question! Answered in: #1041

jaychia · 2023-01-20T07:54:55Z

jaychia
Jan 20, 2023
Maintainer

Rolling/Window

I will add a Discussion post on rolling/window operations, but we don't currently have any fleshed out specs around this. Would love to start discussing about the use-cases and thinking about potential implementations though: #484

Aggregations

I added a discussion thread on aggregations: #485

We can also discuss custom aggregations there. We'd like to understand the use-cases before designing an API around that.

Combining

Concat: #486

ML

Repartition is actually different from a split. We will probably want a dedicated split function for train/val sets that produce two separate dataframes based on some ratio of a split: #487

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's talk DataFrame APIs #473

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Let's talk DataFrame APIs #473

xdssio Jan 19, 2023

Exploration

DataFrame

Column (pandas collection)

IO

Data Manipulation

Time-series

GroupBy aggregations

Combining

ML

Replies: 4 comments · 2 replies

xdssio Jan 19, 2023 Author

jaychia Jan 19, 2023 Maintainer

API Discussions

head and tail

len

collect

getitem

setitem

select columns

select rows

unique, num_unique, value_counts, countna

jaychia Jan 20, 2023 Maintainer

Column Operations

I/O

Data manipulations

dendihandian Jun 13, 2023

jaychia Jun 13, 2023 Maintainer

jaychia Jan 20, 2023 Maintainer

Rolling/Window

Aggregations

Combining

ML

xdssio
Jan 19, 2023

Replies: 4 comments 2 replies

xdssio
Jan 19, 2023
Author

jaychia
Jan 19, 2023
Maintainer

jaychia
Jan 20, 2023
Maintainer

jaychia Jun 13, 2023
Maintainer

jaychia
Jan 20, 2023
Maintainer