The DuckDB step enables you to use DuckDB, a column-oriented in-process SQL OLAP database management system, to access data. It is designed with a focus on high-performance analytics. DuckDB’s architecture is optimized for vectorized query execution, which allows it to process large amounts of data efficiently, providing impressive query performance. It has particular strength working with Parquet files and csv files. Depending on the use case running analytical SQL queries (using SQL aggregate functions) can be very efficient when using DuckDB.
The Usage section below explains how to check whether the custom step is finding the Python DuckDB libraries. Examples of using this custom step, including some sample data, can be found in extras/README.md.
DuckDB is very simple to install. For this Custom Step all it needs is to install its Python libraries via “pip“.
In this version the following data sources are supported:
- You can use the step either as a start step or in the middle of a flow. When used in the middle of a flow, you can reference the input table in a SQL statement as _flow_table.
For example:SELECT * FROM _flow_table
- You can build SQL statements referencing tables from different data stores like Parquet, CSV, etc.
For example:SELECT * FROM taxi.parquet t JOIN driver.csv d ON t.driver_id = d.driver_number
For documentation on supported SQL see DuckDB SQL documentation.
There are 5 tab on the user interface for the custom step:
In the tab SQL you can submit one or more SQL statements.
Standalone mode | Flow mode |
---|---|
UI Field | Comment |
---|---|
SQL Pass Through (Ignore file options) |
When ticked: - All options set in tab Parquet and CSV will be ignored. - SQL will be sent as is to DuckDB (SQL Pass Through). - You can use SQL with DuckDB functions. E.g.: SELECT * FROM read_parquet('test.parq'); |
SQL statement | Field for SQL statement. For allowed SQL syntax see DuckDB SQL documentation You can submit several SQL statements in this field. SQL statements need to delimited with a semicolon (;) when submitting more than one statement. Example - Load table into DuckDB: CREATE SCHEMA IF NOT EXISTS asd; --create taxt tabke based on mytaxidata; CREATE TABLE taxi as ( SELECT * FROM mytaxidata.csv); Note: When using a single line comment in the SQL statement (starting with double hyphen ( -- )) you need to end the comment with a semicolon (;). |
This tab is to set general options for DuckDB.
DuckDB Options |
---|
UI Filed | Comment |
---|---|
File search path | Set all paths as comma separated list where DuckDb can find Parquet or CSV files |
Use persistent database | When ticked you can set a persistent database for DuckDB |
Set DuckDB folder and name | Set the folder and name for the database. The database has to reside on SAS Server. The database cannot reside in SAS Content. |
Configuration settings | Set DuckDB configuration settings. Each command must be delimited with a semicolon (;). For available parameters see DuckDB documentation |
This tab is to set parameters for reading Parquet files. The options in this tab will only take effect on selecting data from one Parquet file. The options will not take effect when joining tables.
Parquet files |
---|
UI Filed | Comment | DuckDB Parameter |
---|---|---|
Add row number column | A column containing the row number will be added to the output table. | file_row_number |
Add file name column | A column containing the the Parquet file name will be added to the output table. | filename |
Union multiple files by column name instead of column position | You can read several files with the same file structure in one SQL statement. E.g.: SELECT * FROM yellow*.parquet. If this option is ticked DuckDB will union the files by column name. If the option is not ticked DuckDb will union the files by column position. | union_by_name |
This tab is to set parameters for reading csv files. You can set General settings valid for all csv files or you can overwrite settings for up to 5 names files.
CSV - General settings | CSV - File settings |
---|---|
UI Filed | Comment | DuckDB Parameter |
---|---|---|
Number of sample rows to read to detect column type | Number of sample rows to be read to detect column type. | sample_size |
Number of lines at the top to skip | Number of rows from the top not to be read. | skip |
Make all columns of type VARCHAR | Makes all columns in the file of type varchar even if they only hold numeric values. | all_varchar |
Union multiple files by column name instead of column position | You can read several files with the same file structure in one SQL statement. E.g.: SELECT * FROM yellow*.csv. If this option is ticked DuckDB will union the files by column name. If the option is not ticked DuckDb will union the files by column position. | union_by_name |
Add file name column | A column containing the the csv file name will be added to the output table. The option only take effect on selecting data from a single file. The options will not take effect when joining tables. | filename |
Column delimiter | Delimiter to separate columns in the csv file. | delim |
Quoting string when a data value is quoted | Specify the quoting string to be used when a data value is quoted. Default is double quotes. | quote |
Date format to use when parsing dates | Specify the date format to use when parsing dates. E.g. use %d/%m/%Y to show the date like 15/05/2023. For more format information see DuckDB documentation | dateformat |
Decimal separator of numbers | Set decimal separator of numbers. Default is dot (.) | decimal_separator |
String that represents a NULL value | Specify a string that represents a NULL value | nullstr |
UI Field | Comment |
---|---|
Number of csv files to overwrite general options | Set the number of named file to overwrite general options. You can set up to 5 overwrite settings. |
In Overwrite settings you have the same settings as in the General settings section plus some file specific settings. If an overwrite setting for a file is set, all settings for the named files will be picked up from the appropriate overwrite section. In the table below only the file specific settings are listed:
UI Filed | Comment | DuckDB Parameter |
---|---|---|
File name | The name of the csv file to overwrite the general settings for. The file name is case sensitive and must include the file suffix '.csv' | |
Column names | Set or overwrite the file column names. The column names are supplied as a string comma separated list. E.g.: 'A1','B2','C3' will set the column names for columns 1 - 3 in the respective order. | names |
Column names and types | Set or overwrite the column names and types. The format is: 'column name':'column type' as a comma separated list. E.g.: 'Col1':'INTEGER','Col2':'VARCHAR'. For a list of available data types see DuckDB documentation | columns |
This tab is to set connection parameters for PostgreSQL. When connected with Postgres, DuckDb will create views for all tables in the Postgres schema. No data is uploaded into DuckDB. DuckDb is using its Postgres Scanner to receive data from Postgres.
Tip: Depending on data access frequency, data size and the required performance it may be worth uploading data into DuckDB to gain better performance.
PostgreSQL |
---|
UI Filed | Comment |
---|---|
Connect Postgres database | When ticked a connection to Postgres will be established using the below connection settings |
Postgres server | Numeric IP address or name of host to connect to. |
Postgres port | Port number to connect to at the server host. |
User | PostgreSQL user name to connect as. |
Password | Password to be used. |
Postgres database | The Postgres database name to connect to. |
Postgres schema | The Postgres database schema to connect to. |
DuckDB Schema | The database schema in DuckDB where the views for the Postgres tables are written to. If the schema name does not exist it will be created. |
2023.01 or later
- Python needs to be installed and configured to work with SAS Studio.
- The Python package for DuckDB (version 0.8.1 or later) to be installed
pip install duckdb
When you have installed the custom step, test it to ensure it is working with your Python DuckDB.
- Right mouse click on the step.
- Choose "Open in a tab".
- Set output table to WORK.duckout
- Set SQL statement
SELECT 42
. - Run step.
If DuckDB is not installed or Python cannot find it you will receive an error: "Unhandled Python exception. No module named 'duckdb'"
For examples using the DuckDB step in a flow in SAS Studio see here
Version 1.0 (12JUL2023)
- Initial version