Improved schema and catalog support from SQL #1877

matthewmturner · 2022-02-24T16:46:05Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Right now there is limited schema and catalog support in sql. You have to be using language bindings and register the catalog / schema to the ExecutionContext before being able to select. I would like to be able to create schemas and catalogs from sql, without language bindings, so that its easier to manage ddl and do data analysis just using datafusion-cli.

DataFusion CLI v7.0.0
❯ CREATE TABLE sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.032 seconds.
❯ show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | sch.tbl    | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
| datafusion    | information_schema | columns    | VIEW       |
+---------------+--------------------+------------+------------+
3 rows in set. Query took 0.015 seconds.
❯ CREATE TABLE cat.sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.003 seconds.
❯ show tables;
+---------------+--------------------+-------------+------------+
| table_catalog | table_schema       | table_name  | table_type |
+---------------+--------------------+-------------+------------+
| datafusion    | public             | cat.sch.tbl | BASE TABLE |
| datafusion    | public             | sch.tbl     | BASE TABLE |
| datafusion    | information_schema | tables      | VIEW       |
| datafusion    | information_schema | columns     | VIEW       |
+---------------+--------------------+-------------+------------+
4 rows in set. Query took 0.006 seconds.

❯ CREATE SCHEMA abc;
NotImplemented("Unsupported SQL statement: CreateSchema { schema_name: ObjectName([Ident { value: \"abc\", quote_style: None }]), if_not_exists: false }")

CREATE CATALOG xyz;  🤔 Invalid statement: sql parser error: Expected an object type after CREATE, found: CATALOG

Describe the solution you'd like
A clear and concise description of what you want to happen.
Support the following sql statements:

CREATE CATALOG abc
CREATE SCHEMA xyz
Parse catalog and schema from CREATE TABLE abc.xyz.tbl instead of creating table `
Create Schema support
Create Catalog / Database support
/c command on datafusion cli to set active db / catalog

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.
Postgres reference here
https://www.postgresql.org/docs/9.3/sql-createschema.html

The text was updated successfully, but these errors were encountered:

matthewmturner · 2022-02-24T17:21:40Z

@houqp @jimexist FYI - so you have an idea where im trying to go...im trying to make datafusion-cli a really quick and powerful tool for doing ad hoc data analysis / query testing - and i think improved catalog & schema support will help with that. Through a combination of this, #1872, and im hoping to also eventually add an s3 feature (but i have plenty to work on before that) i hope to achieve that. I just wanted to give you both some additional color about my longer term objectives.

Let me know if any concerns.

bobtins · 2022-03-12T01:11:59Z

I made a comment prematurely but I looked at code more and now I have some context.
It looks like part of catalog/schema has already been implemented...and you're just trying to make it work the rest of the way. In the ExecutionContext there are already catalogs and schemas, which is why "show tables" works the way it does; however, to fully implement these, much more is needed:

table refs now have to be parsed into [catalog.[schema.]]table (you found that this is broken)
what if existing tables have "."? how do you escape? will you break things?
current catalog and schema need to be maintained in ExecutionContext
1, 2, and 3-part names have to be interpreted correctly based on current context
add/update/delete of all these entities need to be implemented

As I said in my previous rant, a lot of times you don't need this...and it can add extra complexity for developers to maintain, and for users to keep track of.
How much would it take to finish this? Is there an alternate path?
Just because something's in information_schema doesn't mean it has to be fully supported; it is just meant to be a cross-database metadata standard which maps down to existing functionality.
Hope this perspective is helpful!

matthewmturner · 2022-03-12T16:43:37Z

Fortunately, we already have the ability to parse table refs into catalog, schema, and table with TableReference::resolve(default_catalog, default_schema) -> ResolvedTableReference. I havent seen any issues with this yet.

Definitely agree on maintaining current active catalog / schema. From a pure sql / datafusion-cli perspective i was thinking of using the \c command like psql for connecting to a database / catalog. Then for schema maybe support for the SET command with options like search_path. Personally, im ok with those being follow on steps as im fine with using fully qualified names in queries for now.

i believe postgres supports putting databases names, schemas, and tables in double quotes so i would think we could support something similar for resolving names with ".".

In my current PR #1959 i thought i was actually quite close with adding CREATE SCHEMA support however im currently having an issue with getting the schema to register to the catalog. Still need to figure that out.

bobtins · 2022-03-15T02:04:04Z

In this example, it seemed like parsing the table refs wasn't working:

❯ CREATE TABLE sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.032 seconds.
❯ show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | sch.tbl    | BASE TABLE |

Then I saw a change in datafusion/src/catalog/mod.rs to split the table reference; assuming that fixes it.

You said

Personally, im ok with those being follow on steps as im fine with using fully qualified names in queries for now.

I'm assuming we will still be able to use simple table names, right?
Take what I say with a grain of salt--I'm a little obsessive about keeping things simple, but I don't know the whole history. It sounds like you are making this feature more robust, and that's always good :)

matthewmturner · 2022-03-15T12:32:19Z

ah yeah, i was referring to in my PR (#1959) which was leveraging on what looked like an incomplete feature for properly resolving catalog / schema / table. You can still query for single table names in the currently active catalog / schema - which right now is only 'datafusion' and 'public' - but i plan on adding a feature (such as psql \c to connect to database) to help with that.

I made significant progress on the above referenced PR, just working on the ballista serialization now. Would be good to get another set of eyes on it if youre interested. Below is example from that PR.

❯ CREATE SCHEMA abc;
0 rows in set. Query took 0.001 seconds.
❯ CREATE TABLE abc.y AS VALUES (1,2,3);
0 rows in set. Query took 0.002 seconds.
❯ SHOW TABLES;
+---------------+--------------------+----------------+------------+
| table_catalog | table_schema       | table_name     | table_type |
+---------------+--------------------+----------------+------------+
| datafusion    | public             | alltypes_plain | BASE TABLE |
| datafusion    | public             | x              | BASE TABLE |
| datafusion    | abc                | y              | BASE TABLE |
| datafusion    | information_schema | tables         | VIEW       |
| datafusion    | information_schema | columns        | VIEW       |
+---------------+--------------------+----------------+------------+
5 rows in set. Query took 0.016 seconds.
❯ SELECT * FROM x;
+---------+---------+
| column1 | column2 |
+---------+---------+
| 1       | 2       |
+---------+---------+
1 row in set. Query took 0.004 seconds.
❯ SELECT * FROM y;
Plan("Table or CTE with name 'y' not found")
❯ SELECT * FROM abc.y;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | 2       | 3       |
+---------+---------+---------+
1 row in set. Query took 0.005 seconds.

alamb · 2023-06-12T10:51:54Z

I think this is now complete. Let's file additional tickets for follow on work

DataFusion CLI v26.0.0
❯ create schema foo;
0 rows in set. Query took 0.019 seconds.
❯ create database bar;
0 rows in set. Query took 0.000 seconds.

matthewmturner added the enhancement New feature or request label Feb 24, 2022

matthewmturner mentioned this issue Mar 8, 2022

Add Create Schema functionality in SQL #1959

Merged

matthewmturner mentioned this issue Mar 25, 2022

Add CREATE DATABASE command to SQL #2094

Merged

alamb closed this as completed Jun 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved schema and catalog support from SQL #1877

Improved schema and catalog support from SQL #1877

matthewmturner commented Feb 24, 2022 •

edited

Loading

matthewmturner commented Feb 24, 2022

bobtins commented Mar 12, 2022 •

edited

Loading

matthewmturner commented Mar 12, 2022

bobtins commented Mar 15, 2022

matthewmturner commented Mar 15, 2022 •

edited

Loading

alamb commented Jun 12, 2023

Improved schema and catalog support from SQL #1877

Improved schema and catalog support from SQL #1877

Comments

matthewmturner commented Feb 24, 2022 • edited Loading

matthewmturner commented Feb 24, 2022

bobtins commented Mar 12, 2022 • edited Loading

matthewmturner commented Mar 12, 2022

bobtins commented Mar 15, 2022

matthewmturner commented Mar 15, 2022 • edited Loading

alamb commented Jun 12, 2023

matthewmturner commented Feb 24, 2022 •

edited

Loading

bobtins commented Mar 12, 2022 •

edited

Loading

matthewmturner commented Mar 15, 2022 •

edited

Loading