[Feature] Data exploration of unique rows #477

jaychia · 2023-01-19T21:58:55Z

jaychia
Jan 19, 2023
Maintainer

Summary

An important part of data exploration is the retrieval and interrogation of uniqueness of values within column(s) as well as counts.

In SQL, this is achieved with:

SELECT DISTINCT x, y, z FROM ... - retrieves distinct tuples of (x, y, z)
SELECT x, y, z, COUNT(*) GROUP BY x, y, z - retrieves distinct tuples of (x, y, z) and counts of their occurrences
SELECT COUNT(DISTINCT x), COUNT(DISTINCT y), COUNT(DISTINCT z) - retrieves counts of distinct occurrences of x, y, and z

In other DataFrames such as Pandas, there are dedicated methods such as:

df.drop_duplicates
df.groupby(x, y, z).size()
df.value_counts

Proposal

# 1: Retrieve a DataFrame of distinct rows - equivalent to a SQL `SELECT DISTINCT` or `pd.drop_duplicates`
df = df.distinct(df["x"], df["y"], df["z"])

# 2: Count occurrences of distinct tuples
# Alias for: df.groupby(df["x"], df["y"], df["z"]).agg([("*", "count")])
df.groupby(df["x"], df["y"], df["z"]).count()

# 3: Retrieve counts of distinct occurrences
df.agg([
    (df["x"], "count_distinct"),
    (df["y"], "count_distinct"),
    (df["z"], "count_distinct"),
])

Work Breakdown

Concretely, the work needed to achieve the above is:

Allow .distinct() to take in columns as arguments
Implement aggregations on groupby keys
Implement a `"count_distinct" aggregation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Data exploration of unique rows #477

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

[Feature] Data exploration of unique rows #477

jaychia Jan 19, 2023 Maintainer

Summary

Proposal

Work Breakdown

Replies: 0 comments

jaychia
Jan 19, 2023
Maintainer