Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved schema and catalog support from SQL #1877

Closed
2 of 3 tasks
matthewmturner opened this issue Feb 24, 2022 · 6 comments
Closed
2 of 3 tasks

Improved schema and catalog support from SQL #1877

matthewmturner opened this issue Feb 24, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@matthewmturner
Copy link
Contributor

matthewmturner commented Feb 24, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)

Right now there is limited schema and catalog support in sql. You have to be using language bindings and register the catalog / schema to the ExecutionContext before being able to select. I would like to be able to create schemas and catalogs from sql, without language bindings, so that its easier to manage ddl and do data analysis just using datafusion-cli.

DataFusion CLI v7.0.0
❯ CREATE TABLE sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.032 seconds.
❯ show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | sch.tbl    | BASE TABLE |
| datafusion    | information_schema | tables     | VIEW       |
| datafusion    | information_schema | columns    | VIEW       |
+---------------+--------------------+------------+------------+
3 rows in set. Query took 0.015 seconds.
❯ CREATE TABLE cat.sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.003 seconds.
❯ show tables;
+---------------+--------------------+-------------+------------+
| table_catalog | table_schema       | table_name  | table_type |
+---------------+--------------------+-------------+------------+
| datafusion    | public             | cat.sch.tbl | BASE TABLE |
| datafusion    | public             | sch.tbl     | BASE TABLE |
| datafusion    | information_schema | tables      | VIEW       |
| datafusion    | information_schema | columns     | VIEW       |
+---------------+--------------------+-------------+------------+
4 rows in set. Query took 0.006 seconds.
❯ CREATE SCHEMA abc;
NotImplemented("Unsupported SQL statement: CreateSchema { schema_name: ObjectName([Ident { value: \"abc\", quote_style: None }]), if_not_exists: false }")
CREATE CATALOG xyz;  🤔 Invalid statement: sql parser error: Expected an object type after CREATE, found: CATALOG

Describe the solution you'd like
A clear and concise description of what you want to happen.
Support the following sql statements:

  • CREATE CATALOG abc

  • CREATE SCHEMA xyz

  • Parse catalog and schema from CREATE TABLE abc.xyz.tbl instead of creating table `

  • Create Schema support

  • Create Catalog / Database support

  • /c command on datafusion cli to set active db / catalog

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.
Postgres reference here
https://www.postgresql.org/docs/9.3/sql-createschema.html

@matthewmturner matthewmturner added the enhancement New feature or request label Feb 24, 2022
@matthewmturner
Copy link
Contributor Author

@houqp @jimexist FYI - so you have an idea where im trying to go...im trying to make datafusion-cli a really quick and powerful tool for doing ad hoc data analysis / query testing - and i think improved catalog & schema support will help with that. Through a combination of this, #1872, and im hoping to also eventually add an s3 feature (but i have plenty to work on before that) i hope to achieve that. I just wanted to give you both some additional color about my longer term objectives.

Let me know if any concerns.

@bobtins
Copy link

bobtins commented Mar 12, 2022

I made a comment prematurely but I looked at code more and now I have some context.
It looks like part of catalog/schema has already been implemented...and you're just trying to make it work the rest of the way. In the ExecutionContext there are already catalogs and schemas, which is why "show tables" works the way it does; however, to fully implement these, much more is needed:

  • table refs now have to be parsed into [catalog.[schema.]]table (you found that this is broken)
  • what if existing tables have "."? how do you escape? will you break things?
  • current catalog and schema need to be maintained in ExecutionContext
  • 1, 2, and 3-part names have to be interpreted correctly based on current context
  • add/update/delete of all these entities need to be implemented

As I said in my previous rant, a lot of times you don't need this...and it can add extra complexity for developers to maintain, and for users to keep track of.
How much would it take to finish this? Is there an alternate path?
Just because something's in information_schema doesn't mean it has to be fully supported; it is just meant to be a cross-database metadata standard which maps down to existing functionality.
Hope this perspective is helpful!

@matthewmturner
Copy link
Contributor Author

Fortunately, we already have the ability to parse table refs into catalog, schema, and table with TableReference::resolve(default_catalog, default_schema) -> ResolvedTableReference. I havent seen any issues with this yet.

Definitely agree on maintaining current active catalog / schema. From a pure sql / datafusion-cli perspective i was thinking of using the \c command like psql for connecting to a database / catalog. Then for schema maybe support for the SET command with options like search_path. Personally, im ok with those being follow on steps as im fine with using fully qualified names in queries for now.

i believe postgres supports putting databases names, schemas, and tables in double quotes so i would think we could support something similar for resolving names with ".".

In my current PR #1959 i thought i was actually quite close with adding CREATE SCHEMA support however im currently having an issue with getting the schema to register to the catalog. Still need to figure that out.

@bobtins
Copy link

bobtins commented Mar 15, 2022

In this example, it seemed like parsing the table refs wasn't working:

❯ CREATE TABLE sch.tbl AS VALUES (1,2,3);
0 rows in set. Query took 0.032 seconds.
❯ show tables;
+---------------+--------------------+------------+------------+
| table_catalog | table_schema       | table_name | table_type |
+---------------+--------------------+------------+------------+
| datafusion    | public             | sch.tbl    | BASE TABLE |

Then I saw a change in datafusion/src/catalog/mod.rs to split the table reference; assuming that fixes it.

You said

Personally, im ok with those being follow on steps as im fine with using fully qualified names in queries for now.

I'm assuming we will still be able to use simple table names, right?
Take what I say with a grain of salt--I'm a little obsessive about keeping things simple, but I don't know the whole history. It sounds like you are making this feature more robust, and that's always good :)

@matthewmturner
Copy link
Contributor Author

matthewmturner commented Mar 15, 2022

ah yeah, i was referring to in my PR (#1959) which was leveraging on what looked like an incomplete feature for properly resolving catalog / schema / table. You can still query for single table names in the currently active catalog / schema - which right now is only 'datafusion' and 'public' - but i plan on adding a feature (such as psql \c to connect to database) to help with that.

I made significant progress on the above referenced PR, just working on the ballista serialization now. Would be good to get another set of eyes on it if youre interested. Below is example from that PR.

❯ CREATE SCHEMA abc;
0 rows in set. Query took 0.001 seconds.
❯ CREATE TABLE abc.y AS VALUES (1,2,3);
0 rows in set. Query took 0.002 seconds.
❯ SHOW TABLES;
+---------------+--------------------+----------------+------------+
| table_catalog | table_schema       | table_name     | table_type |
+---------------+--------------------+----------------+------------+
| datafusion    | public             | alltypes_plain | BASE TABLE |
| datafusion    | public             | x              | BASE TABLE |
| datafusion    | abc                | y              | BASE TABLE |
| datafusion    | information_schema | tables         | VIEW       |
| datafusion    | information_schema | columns        | VIEW       |
+---------------+--------------------+----------------+------------+
5 rows in set. Query took 0.016 seconds.
❯ SELECT * FROM x;
+---------+---------+
| column1 | column2 |
+---------+---------+
| 1       | 2       |
+---------+---------+
1 row in set. Query took 0.004 seconds.
❯ SELECT * FROM y;
Plan("Table or CTE with name 'y' not found")
❯ SELECT * FROM abc.y;
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | 2       | 3       |
+---------+---------+---------+
1 row in set. Query took 0.005 seconds.

@alamb
Copy link
Contributor

alamb commented Jun 12, 2023

I think this is now complete. Let's file additional tickets for follow on work

DataFusion CLI v26.0.0
❯ create schema foo;
0 rows in set. Query took 0.019 seconds.
❯ create database bar;
0 rows in set. Query took 0.000 seconds.

@alamb alamb closed this as completed Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants