-
Notifications
You must be signed in to change notification settings - Fork 33
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add support for FieldSummary - antlr syntax - ast expression builder - ast node builder - catalyst ast builder Signed-off-by: YANGDB <[email protected]> * add support for FieldSummary - antlr syntax - ast expression builder - ast node builder - catalyst ast builder Signed-off-by: YANGDB <[email protected]> * update sample query fix scala style format Signed-off-by: YANGDB <[email protected]> * support spark prior to 3.5 with its extended table identifier (existing table identifier only has 2 parts) Signed-off-by: YANGDB <[email protected]> * update union queries based summary Signed-off-by: YANGDB <[email protected]> * update scala fmt style Signed-off-by: YANGDB <[email protected]> * update scala fmt style Signed-off-by: YANGDB <[email protected]> * update query with where clause predicate Signed-off-by: YANGDB <[email protected]> * update command and remove the topvalues Signed-off-by: YANGDB <[email protected]> * update command docs Signed-off-by: YANGDB <[email protected]> * update with comments feedback Signed-off-by: YANGDB <[email protected]> * update `FIELD SUMMARY` symbols to the keywordsCanBeId bag of words Signed-off-by: YANGDB <[email protected]> --------- Signed-off-by: YANGDB <[email protected]>
- Loading branch information
Showing
17 changed files
with
1,993 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
## PPL `fieldsummary` command | ||
|
||
**Description** | ||
Using `fieldsummary` command to : | ||
- Calculate basic statistics for each field (count, distinct count, min, max, avg, stddev, mean ) | ||
- Determine the data type of each field | ||
|
||
**Syntax** | ||
|
||
`... | fieldsummary <field-list> (nulls=true/false)` | ||
|
||
* command accepts any preceding pipe before the terminal `fieldsummary` command and will take them into account. | ||
* `includefields`: list of all the columns to be collected with statistics into a unified result set | ||
* `nulls`: optional; if the true, include the null values in the aggregation calculations (replace null with zero for numeric values) | ||
|
||
### Example 1: | ||
|
||
PPL query: | ||
|
||
os> source = t | where status_code != 200 | fieldsummary includefields= status_code nulls=true | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| Fields | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF | | ||
|------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| "status_code" | 2 | 2 | 301 | 403 | 352.0 | 352.0 | 72.12489168102785 | 0 | "int" | | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
|
||
### Example 2: | ||
|
||
PPL query: | ||
|
||
os> source = t | fieldsummary includefields= id, status_code, request_path nulls=true | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| Fields | COUNT | COUNT_DISTINCT | MIN | MAX | AVG | MEAN | STDDEV | NUlls | TYPEOF | | ||
|------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| "id" | 6 | 6 | 1 | 6 | 3.5 | 3.5 | 1.8708286933869707 | 0 | "int" | | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| "status_code" | 4 | 3 | 200 | 403 | 184.0 | 184.0 | 161.16699413961905 | 2 | "int" | | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
| "request_path" | 2 | 2 | /about| /home | 0.0 | 0.0 | 0 | 2 |"string"| | ||
+------------------+-------------+------------+------------+------------+------------+------------+------------+----------------| | ||
|
||
### Additional Info | ||
The actual query is translated into the following SQL-like statement: | ||
|
||
```sql | ||
SELECT | ||
id AS Field, | ||
COUNT(id) AS COUNT, | ||
COUNT(DISTINCT id) AS COUNT_DISTINCT, | ||
MIN(id) AS MIN, | ||
MAX(id) AS MAX, | ||
AVG(id) AS AVG, | ||
MEAN(id) AS MEAN, | ||
STDDEV(id) AS STDDEV, | ||
(COUNT(1) - COUNT(id)) AS Nulls, | ||
TYPEOF(id) AS TYPEOF | ||
FROM | ||
t | ||
GROUP BY | ||
TYPEOF(status_code), status_code; | ||
UNION | ||
SELECT | ||
status_code AS Field, | ||
COUNT(status_code) AS COUNT, | ||
COUNT(DISTINCT status_code) AS COUNT_DISTINCT, | ||
MIN(status_code) AS MIN, | ||
MAX(status_code) AS MAX, | ||
AVG(status_code) AS AVG, | ||
MEAN(status_code) AS MEAN, | ||
STDDEV(status_code) AS STDDEV, | ||
(COUNT(1) - COUNT(status_code)) AS Nulls, | ||
TYPEOF(status_code) AS TYPEOF | ||
FROM | ||
t | ||
GROUP BY | ||
TYPEOF(status_code), status_code; | ||
``` | ||
For each such columns (id, status_code) there will be a unique statement and all the fields will be presented togather in the result using a UNION operator | ||
|
||
|
||
### Limitation: | ||
- `topvalues` option was removed from this command due the possible performance impact of such sub-query. As an alternative one can use the `top` command directly as shown [here](ppl-top-command.md). | ||
|
Oops, something went wrong.