Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Allow creating multiple views or derived tables on the same physical HBase table #296

Closed
ivarley opened this issue Jun 26, 2013 · 9 comments
Assignees
Milestone

Comments

@ivarley
Copy link
Contributor

ivarley commented Jun 26, 2013

A single HBase table can comfortably grow to any size, and there are good reasons to have a relatively small number of them on a cluster (< thousands). Additionally, the fact that HBase can have an arbitrary set of column qualifiers in every row means that one physical table need not be limited to a single schema with a fixed set of columns. The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

I'd like to suggest giving Pheonix the ability to run one or more views (or tenant-specific tables) that actually use a single common physical HBase table. So when that tenant issues the query "SELECT * FROM foo_view", Phoenix translates this to really mean "SELECT c1, c2, c3 from foo_table".

The structure of the primary key (row key) of all the views would probably need to the same. I can think of a couple ways to structure this from a syntax point of view:

  • Add a way of specifying the backing HBase table, either adding a WITH clause or a TABLESPACE clause in the CREATE TABLE statement, along the lines of what Postgres allows: http://www.postgresql.org/docs/9.1/static/sql-createtable.html
  • A tenant could have the ability to only alter an existing table, not create a new one. The base table would need to be created beforehand on a connection that doesn't specify a tenant id. This would define the row key columns and any base key value columns. Tenants would then be able to add/remove their own key value columns, and Phoenix would take care of combining the "base" columns with the "tenant" columns based on the tenantId specified at connection time.

We'd also have to decide whether to automatically enforce row ownership by virtual table (by, for example, transparently including the tenant id in the rowkew) or just leave that up to the clients. (It might be weird if you get a bunch of rows back from your query that someone else inserted into another logical table, with all nulls for the columns you're expecting ...). This could also be done by including a hidden column in the row, or a bit in the rowkey, that indicates which virtual table the rows are part of.

@ryang-sfdc
Copy link
Contributor

+1

@apurtell
Copy link
Contributor

The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

This is a nice idea. If tenants can be segregated to single physical tables, it also opens up possibilities for pinning those tables to regionserver groups HBASE-6721, a feature I expect to see in an upcoming 0.94.

@ivarley
Copy link
Contributor Author

ivarley commented Jun 27, 2013

Ooh, nice suggestion Andrew. +1

@elilevine
Copy link
Contributor

Unless there are objections, I'm planning on taking a stab at this in the next couple of days. @jtaylor-sfdc, any suggestions on implementation approaches? Thanks.

@jtaylor-sfdc
Copy link
Contributor

Here's my first cut on how this could be implemented:

  • Add TENANT_ID as the leading part of the SYSTEM.TABLE row key. It could be defined as a nullable VARCHAR.
  • Create a conversion script that inserts a null byte to the row key of every row of the SYSTEM.TABLE. This would define the "base table".
  • Modify MetaDataEndPointImpl code to use the TenantId connection property to scan SYSTEM.TABLE. You'll need to pass this through as a new/additional argument to the endpoint coprocessor methods.
  • Add an HBASE_TABLE_NAME VARCHAR key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables.
  • Optionally add a TABLE_ID key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables with the "key prefix". If we're ok including the "logical" schema name plus table name in the row key of the physical table, we don't necessarily need this. But this would be the "condensed version" of this - we might even just generate it, given that we'll have sequence support shortly ( Support SEQUENCE creation and usage #18 ).
  • Make the above mentioned grammar changes to be able to define your "physical" HBase table in a CREATE TABLE statement. It could be as simple as a new property (like we've done with salting and immutable tables), then no grammar changes would be necessary.
  • Persist this data into SYSTEM.TABLE by passing it through the UPSERT command.
  • Cache this data in the PTable returned by the MetaDataEndPointImpl methods.
  • Modify MetaDataClient.createTable to check for this new property and enforce that no primary key is defined for these new types of derived tables. Note that we might be able to allow additional nullable row key columns to be added. We should check here that the base table row key conforms to what we need:
    • declares a tenant id column (as mentioned in Tenant data isolation for DML and SELECT. #356) and a logical table id column (same idea as with tenant id so we don't have "hard coded" column names)
    • leads the row key with the tenant id column followed by logical table id column
  • Modify WhereOptimizer.pushKeyExpressionsToScan to automatically insert AND conditions for tenant id (based on the TenantId connection property plus the column name identified by Tenant data isolation for DML and SELECT. #356) and logical table id (based on the table id that would be cached in the PTable plus again another column name defined at DDL time). This would be the "runtime" glue that'll cause your queries to run against the right set of data.

@ghost ghost assigned elilevine Aug 1, 2013
@elilevine
Copy link
Contributor

Tenant-specific indexes built on top of these "views" is the next logical step.

@elilevine
Copy link
Contributor

@jtaylor-sfdc, how about using the same table name for both regular and tenant-specific tables? The presence/absence of leading TENANT_ID row key part in SYSTEM.TABLE would distinguish one from another. That way we can omit the extra HBASE_TABLE_NAME column and assume the name of tenant-specific table would be shared by its base table (which would have no TENANT_ID).

@elilevine
Copy link
Contributor

Err... scratch that last comment. We have to have the extra column pointing back to the base table to support multiple "views" on the same base table per tenant. Thanks @ivarley for pointing it out.

@jtaylor-sfdc
Copy link
Contributor

Implemented by @elilevine and pulled into master. Fantastic job! I'll close this issue, as the bulk of the functionality is in and we have these more specific issues for follow up work:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants