Replies: 4 comments 2 replies
-
Here is a great reference to the Consortium for Python Data API which also discuss some of this issues. |
Beta Was this translation helpful? Give feedback.
-
Thanks for this @xdssio! A few initial comments:
We actually run directly on Arrow instead of relying on Polars. Our current codebase does use certain functionality from PyArrow/Polars/Pandas, but we are moving into a Rust implementation which will allow us to write our own kernels and leverage Arrow2 where possible. At some point, PyArrow/Polars/Pandas will all be optional dependencies.
You're absolutely correct! Some things are trivial in a non-distributed world, but when dealing with a distributed dataframe may involve expensive shuffles and be prohibitively expensive. Additionally, being lazy can also lead to some unexpected behavior - for example the simple API DiscussionsHere are my comments for DataFrame APIs. I'll also start converting some of these to issues/separate discussions that we can chip away at! head and tailCurrently, Interestingly, in the distributed world, Spark actually has a different behavior than Pandas/Polars for these methods. Instead of returning a new DataFrame, it blocks on computation and then returns a list of Additionally, PostgreSQL LIMITs do not support "tailing from the back". Instead, it allows for an offset when calling LIMIT. Proposed Features:
lenUsing Polars and Spark as references, they do not support calling Understandably, this may cause some confusion as users are used to calling Proposed Features:
collect
NOTE: Unlike a PySpark In an eager dataframe such as Pandas/eager-Polars, every operation on the DataFrame is collected, since every step is eagerly executed and materialized. Daft's # Eager DataFrame: blocks and materialized on every line
# Lazy DataFrame: only blocks on materializes on `.collect()` so that `print(df.count_rows())` and `df.write_csv(...)` can avoid recomputation
df = df.with_column(...)
df = df.where(...)
# for a lazy dataframe, call df.collect() here
print("Writing number of rows:", df.count_rows())
df.write_csv(...) Perhaps we should rename Proposed Features:
getitem
We actually don't yet have a Series abstraction, so there is no way to access a Series. setitemWe have discussed this before internally within the team - this is actually surprisingly difficult to implement and get right. Happy to talk through it more if this is a hotly requested feature. (edited) select columns
Daft does not have indices like Pandas, so select rowsWe like that dataframe indexing ( Selection of rows can be achieved with unique, num_unique, value_counts, countnaThese make sense, and should make it into the API. We currently only have a Proposed Features:
|
Beta Was this translation helpful? Give feedback.
-
Column OperationsWe currently have a very limited set of operations, but would like to start adding a more comprehensive set of operations for:
Image operations are coming soon - we have plans to write Arrow extension types for representing images, and can start to define kernels on these types. For now, UDFs are the main way of interacting with images. I/OWe have a basic set of I/O operations, but are open to supporting more on a use-case driven basis. Data manipulations
|
Beta Was this translation helpful? Give feedback.
-
Rolling/WindowI will add a Discussion post on rolling/window operations, but we don't currently have any fleshed out specs around this. Would love to start discussing about the use-cases and thinking about potential implementations though: #484 AggregationsI added a discussion thread on aggregations: #485 We can also discuss custom aggregations there. We'd like to understand the use-cases before designing an API around that. CombiningConcat: #486 MLRepartition is actually different from a split. We will probably want a dedicated split function for train/val sets that produce two separate dataframes based on some ratio of a split: #487 |
Beta Was this translation helpful? Give feedback.
-
It is evident at this point that pandas had led the way in how people interact and expect DataFrames to behave and can be the baseline and the starting point for an API of the new age of more scalable feature riched and machine-learning-oriented DataFrames.
As PySpark tried to make a DataFrame for distributed work, we can learn from their work and adoption to see another approach. I will reference koalas as an attempt to back-track some of those decisions, as currently, it does not integrate as expected with PySpark itself and the conversions between them are cumbersome.
The last relevant references I'll mention is polars, which is the backbone engine for Daft, but shy away from distributed execution and currently needs to address machine-learning and, specifically, machine-learning pipelines needs with extensions and Ray, which is the distributed backbone for Daft.
Whenever deciding on an API, we need to remember the implications of the distributed implementation cost in development time and the attempt to keep the Daft APIs concise but expressive.
Another essential element to consider is the use of lazy evaluation in implementations and the use of UDFs, which can fill many transformations-needs so we can avoid implementing it.
Reference
Let's review some standard APIs and open the discussion to new and missed ones.
Exploration
DataFrame
daft.col('a')
vsdf['a']
ordf.a
, which can be messed up with problematic column names but gives you auto-completion?Do we want to have this distinction between a series and an expression like Polars, or always use lazy expressions like Vaex?
Column (pandas collection)
IO
Data Manipulation
-[ ] explode
Time-series
GroupBy aggregations
Combining
ML
Beta Was this translation helpful? Give feedback.
All reactions