From 32174c6b6305ff87f266fe6dcc9e2a020513add0 Mon Sep 17 00:00:00 2001
From: dataframe-api-bot DataFrame.is_null()
DataFrame.iter_columns()
+ DataFrame.join()
DataFrame.max()
@@ -695,6 +697,8 @@
DataFrame.is_null()
DataFrame.iter_columns()
+ DataFrame.join()
DataFrame.max()
@@ -1331,6 +1335,11 @@
but note that the Standard makes no guarantees about them.
Return iterator over columns.
+Join with other dataframe.
@@ -1393,22 +1402,19 @@ at most once per dataframe, and as late as possible in the pipeline.For example, do this
df: DataFrame
-features = []
result = df.std() > 0
result = result.persist()
-for column_name in df.column_names:
- if result.col(column_name).get_value(0):
- features.append(column_name)
+features = [col.name for col in df.iter_columns() if col.get_value(0)]
instead of this:
df: DataFrame
-features = []
-for column_name in df.column_names:
- # Do NOT call `persist` on a `DataFrame` within a for-loop!
- # This may re-trigger the same computation multiple times
- if df.persist().col(column_name).std() > 0:
- features.append(column_name)
+result = df.std() > 0
+features = [
+ # Do NOT do this! This will trigger execution of the entire
+ # pipeline for element in the for-loop!
+ col.name for col in df.iter_columns() if col.get_value(0).persist()
+]
DataFrame.group_by()
DataFrame.is_nan()
DataFrame.is_null()
DataFrame.iter_columns()
DataFrame.join()
DataFrame.max()
DataFrame.mean()
df: DataFrame
-features = []
-for column_name in df.column_names:
- if df.col(column_name).std() > 0:
- features.append(column_name)
-return features
+features = [col.name for col in df.iter_columns() if col.std() > 0]
If df
is a lazy dataframe, then the call df.col(column_name).std() > 0
returns
+
If df
is a lazy dataframe, then the call col.std() > 0
returns
a (ducktyped) Python boolean scalar. No issues so far. Problem is,
-what happens when if df.col(column_name).std() > 0
is called?
Under the hood, Python will call (df.col(column_name).std() > 0).__bool__()
in
+what happens when if col.std() > 0
is called?
Under the hood, Python will call (col.std() > 0).__bool__()
in
order to extract a Python boolean. This is a problem for “lazy” implementations,
as the laziness needs breaking in order to evaluate the above.
Dask and Polars both require that .compute
(resp. .collect
) be called beforehand
diff --git a/draft/genindex.html b/draft/genindex.html
index 68898d4b..8815617a 100644
--- a/draft/genindex.html
+++ b/draft/genindex.html
@@ -692,6 +692,8 @@
@3-w@
z$8Etqw_WPG;Q_q8P99wVD!Z??f{J=EQet$Gwt8@+P}~?8t*9Pl0@dYW}n4Wy)_B
zdIqfs?WU;fFLyVdZ>&sy|9dU!a@A<|_7=O&u;!lf>h?iu;8w{UYF%(=u8B8!HvYlb
z+*?7{Vy{-77n}W!?{YOAG+c`lI_bX>2sY@s@tg?<$wvs=wVgwQxl2$Nh`QHqjYlK_
zjAt7WzhudOF-5ElbZYlwA2ONx&Iz=e1==FA?%~S