Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines to handle schemes and views #24

Closed
zozlak opened this issue Dec 9, 2014 · 40 comments · Fixed by #220
Closed

Guidelines to handle schemes and views #24

zozlak opened this issue Dec 9, 2014 · 40 comments · Fixed by #220

Comments

@zozlak
Copy link

zozlak commented Dec 9, 2014

I rise my issue here, because I think it would be valuable to decide such things on the "top level", so implementations of the DBI API can be compiliant one with the other and provide users with a really unified API.

Wide variety of modern database systems supports:

  • many schemes per database;
  • views;

but there are no official guidelines how to handle that in packages implementing the DBI API.

There is a nottice in dbGetTables() description that "This should, where possible, include temporary tables.". Is it possible to add a similar notice regarding views? And maybe also in dbReadTable() description?
I am asking about it because views play a vital role for both:

  • assuring user privileages;
  • making complicated database structures available for "everyday users";

and ommiting them in DBI API implementations dramatically reduce their functionality.
This is even more painful as dplyr uses DBI database connectivity backends.

The schemes issue is more complicated. DBI wanted a table to be identified by simple a name. For database systems which support multiple schemes per database a clarification is needed:

  • Does a schema name (if other the default scheme) should be the part of the "DBI table name"? (which by the way is simply wrong, because we can not assure proper escaping, anyway it works like that in some DBI implementations)
  • Or maybe DBI drivers are by definition limited to use only a default scheme?
  • Or maybe DBI should define other (but common for all DBI compiliant drivers) way to deal with schemes?

I checked RPostgreSQL, RMySQL and ROracle and they deal with scheme in different way:

  • RMySQL pretends that there is no such things like schemes even MySQL supports schemes (in its own way, by mapping them to databases, but anyway);
  • ROracle functions take an additional and optional "scheme" parameter;
  • RPostgreSQL functions accept the "name" parameter to be a character vector; in such a case they take the first element of the vector as schema name and the second as table name.

And checking onother DBI API implementations would probably make this list longer.
This inconsistency limits the usage of the DBI API when it cames to handle schemes.

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2015

  • Views and temporary tables should be returned with dbListTables()
  • DBI should officially support multi-component table names (mostly for dbReadTable() and dbWriteTable()) via the new Table class that are combined to a single identifier with the help of a new dbBuildIdentifier(Connection, character); the latter defaults to quoting the individual names and combining them with a "."

I think everything can be tested in a generic fashion in the DBItest package

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2015

Need schema parameter in dbListTables(), default to "current schema"

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2015

dbBuildIdentifier() or dbQuoteIdentifier()? Double-check if anyone uses it, probably not (at least not dplyr) --> then it's dbQuoteIdentifier() with character vector

@zozlak
Copy link
Author

zozlak commented Nov 11, 2015

Need schema parameter in dbListTables(), default to "current schema"

On one hand I fully support your request but on the other this will be backward-incompatible change in the API. It is against CRAN guidelines and as DBI is quite widely used it is a really serious thing.

Although I can imagine solutions which will not break backward compatibility (e.g. passing schemas vector as an attribute of the returned vector) it will be rather inconvenient for the user.

Taking this into account maybe it would be better to introduce a new function and mark dbListTables() as deprecated...

@hadley
Copy link
Member

hadley commented Nov 11, 2015

@zozlak how will it be a backward incompatible change?

(And backward incompatible changes aren't against CRAN policy as long as you give everyone adequate notice about the change)

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2015

From the policies:

If an update will change the package’s API and hence affect packages depending on it, it is expected that you will contact the maintainers of affected packages and suggest changes, and give them time to prepare updates before submitting your updated package. Do mention in the submission email which packages are affected and that their maintainers have been informed. In order to derive the reverse dependencies of a package including the addresses of maintainers who have to be notified upon changes, the function reverse_dependencies_with_maintainers is available from the developer website.

@zozlak
Copy link
Author

zozlak commented Nov 11, 2015

@zozlak how will it be a backward incompatible change?

If I want to keep current return type (character vector, one element per table), then I should either join schema name and table name (probably properly escaped) either do some strange things (like assign schema names to attribute).
The problem with returning joined schema and table name is that:

  • if I return them unescaped, then it may be impossible to distinguish table name from schema name;
  • it I want to return them in escaped form I have a problem because different database systems use different escapes for objects identifiers and that is something I would prefer to hide from R users.

And anyway if we agree that other functions (dbReadTable(), etc.) should take schema and table name separated from each other (e.g. as a two-element character vector), to be able to escape them properly, then I would expect dbListTables() to return schema and table name separately.

Now if I want to return schema name separated from table name and do it in a natural way it would mean for me to return them as a character matrix (or data frame) with schema and table name in separate columns but this would change return type of dbListTables() (and so will be backward incompatible).

Or am I missing something?

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2015

I think dbListTables() should keep returning a plain character vector with unquoted names. The schema should be specified as input parameter.

@zozlak
Copy link
Author

zozlak commented Nov 12, 2015

@kiril I like your idea. It will make change totally transparent for those who don't care about schemes and at the same time allow to use schemes for those who need them.

Then we would also need dbListSchemes()

@krlmlr
Copy link
Member

krlmlr commented Jan 13, 2016

  • Views should be treated the same as tables
    • No special handling for trying to write a view
  • Multi-component table names: Two options:
    • Character vector -> several problems
    • Preferred: New function dbTableName(Connection, character) -> SQL
      • dbTableName(Connection, SQL) -> SQL returns argument unchanged

@hadley
Copy link
Member

hadley commented Jan 13, 2016

dbReadTable() and dbWriteTable() would take character vectors and would call dbTableName() for you. This doesn't change existing behaviour but allows you to do (e.g.) dbReadTable(c("schema", "test")).

What should dbReadTable(SQL("SELECT * FROM a WHERE x = 1")) do? What does it do now?

@krlmlr
Copy link
Member

krlmlr commented Jan 28, 2017

dbReadTable(SQL("SELECT * FROM a WHERE x = 1")) should look for a table named SELECT * FROM a WHERE x = 1. This has given us headaches with RSQLite, because some packages did the quoting themselves.

Currently, the design for schema support looks like this:

  • The dbQuoteIdentifier() generic gains an optional namespace argument that allows prepending namespace components. This expects a character vector that will be passed to dbQuoteIdentifier(..., namespace = NULL) (again) before connecting the components. If length 0, no namespace is prepended
  • The dbReadTable(), dbWriteTable(), dbExistsTable() and dbRemoveTable() gain a default implementation that allows passing a list as name argument; this will then be used in a call to dbQuoteIdentifier() to simplify passing the namespace to these methods
  • There will be a new dbListNamespaces() generic, TBD
  • The dbListTables() method also gains an optional namespace argument, which can also be a list

@krlmlr
Copy link
Member

krlmlr commented Jan 29, 2017

Another option: dbReadTable() et al. are expected to always pass ... along to dbQuoteIdentifier(). This would allow arbitrary optional signatures to dbQuoteIdentifier().

@krlmlr
Copy link
Member

krlmlr commented Jan 29, 2017

Next iteration of the design:

  • dbQuoteIdentifier() methods use ... to extend an identifier with namespace components. DBI will give suggestions but not enforce anything. The default should not add any namespace components.
  • All methods that take a table name (dbReadTable(), dbWriteTable(), dbExistsTable(), dbRemoveTable(), ...) call dbQuoteIdentifier(conn, x = name, ...), this allows specifying namespace components as parts of a call to these methods.
  • There will be a new dbListNamespaces() generic, this will return a data frame with a list column that contains the schema specification; this will have to be used in the dbQuoteIdentifier() call to access this schema. The default empty schema is always part of this list.
  • dbListTables() methods use ... to specify the namespace just like dbQuoteIdentifier() does.

Example:

dbQuoteIdentifier(con, "name", namespace = c("database", "schema"))
## <SQL> "database"."schema"."name"
dbReadTable(con, name, namespace = c("database", "schema"))
## ...
dbListNamespaces(con)$namespaces
[[1]]
list()

[[2]]
list(namespace = c("database", "schema"))
...

dbListTables(con, name, namespace = c("database", "schema"))

@iangow
Copy link

iangow commented Apr 6, 2017

OK. These would all work for my wish list.

tbl(pg, dbId("some_schema", "some_table"))
dbListTables(dbId("some_schema"))
compute(dbId("some_schema", "some_table"))

(Actually, compute(dbId("some_schema", "some_table"), overwrite = TRUE) would be great, but that's a different issue.)

@krlmlr
Copy link
Member

krlmlr commented Apr 11, 2017

Thanks for your input. I'm still confident that we can make do without a dbId() constructor method (which would really need to be called as dbId(con, ...)), but I'll come back to the discussion if it turns out that we do need a dbId().

The approach I'm anticipating is described in #24 (comment), I'm repeating it below (slightly modified). We will not be able to decompose fully qualified table names into their components (cluster, database, schema, ...), but I'd argue we don't need to.


I think we need two components:

  1. A way to pass through table/view identifiers with database/schema/... qualification
  2. A way to list databases, schemas, ...

For the first, we already have the SQL class, but even the DBI specification doesn't enforce but only only suggests using it -- instead it specifies that quoted identifiers are immune to calls to dbQuoteIdentifier(), i.e., that dbQuoteIdentifier(con, x) is identical to x if it is already quoted. dbWriteTable() et al. are specified so that they accept a quoted identifier.

For the second, I'd like to define a generic dbListObjects() that returns a data frame with (at least) a name column and a is_table column. The data frame can contain other columns, but the names should be prefixed with a dot, so that we can extend the spec later. The name column is expected to contain a quoted identifier (e.g., of class SQL), the is_table column is TRUE if the row refers to a table or view. dbListObjects(con) lists top-level objects (such as databases), and dbListObjects(con, name) list objects in the name database/schema/... (whatever applies to your DBMS). We could also define that dbListObjects(con, NA) lists the tables/views that are accessible without qualification.

@stephlocke
Copy link

Schema support is vital for working with big/enterprise databases.

In terms of what to expose as metadata - it'd be nifty if you could support the range of information_schema exposed concepts. information_schema is an ansi compliant metadata schema that can be leveraged for getting consistent results across databases.

dbIdentifier() as a way to provide schema and database names would be an OK way to go - it doesn't add too much typing. It'd also be good if you did something like dbId(dbConn,"[dbo].[orders]") and have that end up with the correct reference to the table.

@zozlak
Copy link
Author

zozlak commented Jun 21, 2017

"[dbo].[orders]"

example provided by @stephlocke makes me thinking about the approach proposed by @krlmlr .
I think the SQL class is not enough to pass table/view identifiers with database/schema/... qualification because my.tab can mean table "tab" in schema (or database) "my" as well as table "my.tab" in current context and the SQL class is unable to distinguish between these usecases. My impression is that the dbId() approach is really needed to solve such issue.

By the way while reading discussion about thedbId class I see two different approaches:

  1. The dbId object itself being "not aware" of the database entity type it is representing, e.g.:
    • in dbReadTable(con, dbId('foo')) foo denotes a table/view named foo
    • in dbListTables(dbId('foo')) foo denotes container for tables named foo (in most RDBMSs the foo schema, in MySQL the foo database)
    • in dbListSchemas(dbId('foo')) foo denotes container for schemas named foo (in all RDBMSs I know the foo database)
  2. The dbId object storing information about the entity type it is representing.

While the first approach provides users with short and intuitive syntax only the latter one will work with the dbListObjects() function proposed by @krlmlr. In the first case the dbListObjects() won't know what kind of database entities to list. Alternatively we can assume that the dbListObjects() lists all kinds of database entities (which doesn't sound good to me) but then looking at the dbId objects returned by it we can't tell, what kind of database entity they represent which is similarly problematic.

@jimhester
Copy link
Contributor

I just implemented a simple version of this for odbc, https://github.com/rstats-db/odbc/compare/SQLTable?expand=1#diff-ada2ba640dbb79b8d3a115369830f718. It would be nice to settle on something in DBI as well, as this generic should really be defined there.

I think reversing the order of the arguments like I did is pretty important; i.e. the highest level argument should the furthest right. This approach would also allow the generic to skip names for schema and catalog all together if desired and just take additional arguments as ....

Also I think you need to pass the connection object to dbId so that you can properly escape the identifiers and potentially use a different schema separator if needed.

@ldfreight
Copy link

ldfreight commented Mar 6, 2018

Hello,

thank you for the recent addition. Support for schema is great.

Using DBI 0.7.15 from github and trying to send to dbo.mytable using:

DBI::dbWriteTable(conn = con, name = DBI::Id(schema="dbo", name="mytable"), value = df, append=TRUE)
however this returns

unable to find an inherited method for function 'dbWriteTable' for signature '"Microsoft SQL Server", "SQL", "missing"'

Is this the correct usage of Id ?

thank you

@mmastand
Copy link

mmastand commented Mar 7, 2018

That looks like the right usage to me. I'm also having this same problem with Id and DBI 0.7-15 (installed from github). The following:

res <- DBI::dbWriteTable(conn = con,
                         name = DBI::Id(catalog = "testing", schema = "dbo", name = "test_table"),
                         value = v,
                         append = TRUE)

returns

Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘dbWriteTable’ for signature ‘"Microsoft SQL Server", "SQL", "missing"’

@hadley
Copy link
Member

hadley commented Mar 9, 2018

Commenting on a closed issue is not a good way to get help. You probably need to file an issue with whatever package you are using to connect to SQL server.

@krlmlr
Copy link
Member

krlmlr commented Mar 9, 2018

Thanks. Id() is called Table() in the most recent CRAN version. Please file a new issue, preferably with your DBI backend package, as @hadley suggested.

@krlmlr
Copy link
Member

krlmlr commented Mar 9, 2018

Oh, I'm confused, it's Id now. The usage looks correct, but the DBI backend needs to support it.

@mmastand
Copy link

mmastand commented Mar 9, 2018

Thanks, I'll move this to odbc.

@github-actions github-actions bot locked and limited conversation to collaborators Oct 9, 2020
krlmlr pushed a commit that referenced this issue Oct 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants