Skip to content

Commit

Permalink
PPL fieldsummary command (opensearch-project#766)
Browse files Browse the repository at this point in the history
* add support for FieldSummary
 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>

* add support for FieldSummary
 - antlr syntax
 - ast expression builder
 - ast node builder
 - catalyst ast builder

Signed-off-by: YANGDB <[email protected]>

* update sample query
fix scala style format

Signed-off-by: YANGDB <[email protected]>

* support spark prior to 3.5 with its extended table identifier (existing table identifier only has 2 parts)

Signed-off-by: YANGDB <[email protected]>

* update union queries based summary

Signed-off-by: YANGDB <[email protected]>

* update scala fmt style

Signed-off-by: YANGDB <[email protected]>

* update scala fmt style

Signed-off-by: YANGDB <[email protected]>

* update query with where clause predicate

Signed-off-by: YANGDB <[email protected]>

* update command and remove the topvalues

Signed-off-by: YANGDB <[email protected]>

* update command docs

Signed-off-by: YANGDB <[email protected]>

* update with comments feedback

Signed-off-by: YANGDB <[email protected]>

* update `FIELD SUMMARY` symbols to the keywordsCanBeId bag of words

Signed-off-by: YANGDB <[email protected]>

---------

Signed-off-by: YANGDB <[email protected]>
  • Loading branch information
YANG-DB authored and 14yapkc1 committed Dec 11, 2024
1 parent a038e58 commit 6c9aa09
Show file tree
Hide file tree
Showing 15 changed files with 1,963 additions and 14 deletions.
6 changes: 6 additions & 0 deletions docs/ppl-lang/PPL-Example-Commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,12 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
- `source = table | eval b1 = b + 1 | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)
- `source = table | eval b1 = lower(b) | fields - b1,c` (Field `b1` cannot be dropped caused by SPARK-49782)

**Field-Summary**
[See additional command details](ppl-fieldsummary-command.md)
- `source = t | fieldsummary includefields=status_code nulls=false`
- `source = t | fieldsummary includefields= id, status_code, request_path nulls=true`
- `source = t | where status_code != 200 | fieldsummary includefields= status_code nulls=true`

**Nested-Fields**
- `source = catalog.schema.table1, catalog.schema.table2 | fields A.nested1, B.nested1`
- `source = catalog.table | where struct_col2.field1.subfield > 'valueA' | sort int_col | fields int_col, struct_col.field1.subfield, struct_col2.field1.subfield`
Expand Down
83 changes: 83 additions & 0 deletions docs/ppl-lang/ppl-fieldsummary-command.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
## PPL `fieldsummary` command

**Description**
Using `fieldsummary` command to :
- Calculate basic statistics for each field (count, distinct count, min, max, avg, stddev, mean )
- Determine the data type of each field

**Syntax**

`... | fieldsummary <field-list> (nulls=true/false)`

* command accepts any preceding pipe before the terminal `fieldsummary` command and will take them into account.
* `includefields`: list of all the columns to be collected with statistics into a unified result set
* `nulls`: optional; if the true, include the null values in the aggregation calculations (replace null with zero for numeric values)

### Example 1:

PPL query:

os> source = t | where status_code != 200 | fieldsummary includefields= status_code nulls=true
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| Fields | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF |
|------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| "status_code" | 2 | 2 | 301 | 403 | 352.0 | 352.0 | 72.12489168102785 | 0 | "int" |
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|

### Example 2:

PPL query:

os> source = t | fieldsummary includefields= id, status_code, request_path nulls=true
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| Fields | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF |
|------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| "id" | 6 | 6 | 1 | 6 | 3.5 | 3.5 | 1.8708286933869707 | 0 | "int" |
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| "status_code" | 4 | 3 | 200 | 403 | 184.0 | 184.0 | 161.16699413961905 | 2 | "int" |
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|
| "request_path" | 2 | 2 | /about| /home | 0.0 | 0.0 | 0 | 2 |"string"|
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------|

### Additional Info
The actual query is translated into the following SQL-like statement:

```sql
SELECT
id AS Field,
COUNT(id) AS COUNT,
COUNT(DISTINCT id) AS COUNT_DISTINCT,
MIN(id) AS MIN,
MAX(id) AS MAX,
AVG(id) AS AVG,
MEAN(id) AS MEAN,
STDDEV(id) AS STDDEV,
(COUNT(1) - COUNT(id)) AS Nulls,
TYPEOF(id) AS TYPEOF
FROM
t
GROUP BY
TYPEOF(status_code), status_code;
UNION
SELECT
status_code AS Field,
COUNT(status_code) AS COUNT,
COUNT(DISTINCT status_code) AS COUNT_DISTINCT,
MIN(status_code) AS MIN,
MAX(status_code) AS MAX,
AVG(status_code) AS AVG,
MEAN(status_code) AS MEAN,
STDDEV(status_code) AS STDDEV,
(COUNT(1) - COUNT(status_code)) AS Nulls,
TYPEOF(status_code) AS TYPEOF
FROM
t
GROUP BY
TYPEOF(status_code), status_code;
```
For each such columns (id, status_code) there will be a unique statement and all the fields will be presented togather in the result using a UNION operator


### Limitation:
- `topvalues` option was removed from this command due the possible performance impact of such sub-query. As an alternative one can use the `top` command directly as shown [here](ppl-top-command.md).

Loading

0 comments on commit 6c9aa09

Please sign in to comment.