Allow creating multiple views or derived tables on the same physical HBase table #296

ivarley · 2013-06-26T16:58:40Z

A single HBase table can comfortably grow to any size, and there are good reasons to have a relatively small number of them on a cluster (< thousands). Additionally, the fact that HBase can have an arbitrary set of column qualifiers in every row means that one physical table need not be limited to a single schema with a fixed set of columns. The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

I'd like to suggest giving Pheonix the ability to run one or more views (or tenant-specific tables) that actually use a single common physical HBase table. So when that tenant issues the query "SELECT * FROM foo_view", Phoenix translates this to really mean "SELECT c1, c2, c3 from foo_table".

The structure of the primary key (row key) of all the views would probably need to the same. I can think of a couple ways to structure this from a syntax point of view:

Add a way of specifying the backing HBase table, either adding a WITH clause or a TABLESPACE clause in the CREATE TABLE statement, along the lines of what Postgres allows: http://www.postgresql.org/docs/9.1/static/sql-createtable.html
A tenant could have the ability to only alter an existing table, not create a new one. The base table would need to be created beforehand on a connection that doesn't specify a tenant id. This would define the row key columns and any base key value columns. Tenants would then be able to add/remove their own key value columns, and Phoenix would take care of combining the "base" columns with the "tenant" columns based on the tenantId specified at connection time.

We'd also have to decide whether to automatically enforce row ownership by virtual table (by, for example, transparently including the tenant id in the rowkew) or just leave that up to the clients. (It might be weird if you get a bunch of rows back from your query that someone else inserted into another logical table, with all nulls for the columns you're expecting ...). This could also be done by including a hidden column in the row, or a bit in the rowkey, that indicates which virtual table the rows are part of.

ryang-sfdc · 2013-06-26T17:17:09Z

+1

apurtell · 2013-06-27T04:29:09Z

The 1-to-1 correspondance of schema <-> table in Phoenix is somewhat artificially limiting; it'd be nice if apps could share large physical tables across many logical phoenix schemas. (This is especially common in multi-tenant situations, which Phoenix already supports by virtue of accepting a tenant-id in the connection properties).

This is a nice idea. If tenants can be segregated to single physical tables, it also opens up possibilities for pinning those tables to regionserver groups HBASE-6721, a feature I expect to see in an upcoming 0.94.

ivarley · 2013-06-27T04:33:36Z

Ooh, nice suggestion Andrew. +1

elilevine · 2013-07-31T15:03:35Z

Unless there are objections, I'm planning on taking a stab at this in the next couple of days. @jtaylor-sfdc, any suggestions on implementation approaches? Thanks.

jtaylor-sfdc · 2013-08-01T02:03:56Z

Here's my first cut on how this could be implemented:

Add TENANT_ID as the leading part of the SYSTEM.TABLE row key. It could be defined as a nullable VARCHAR.
Create a conversion script that inserts a null byte to the row key of every row of the SYSTEM.TABLE. This would define the "base table".
Modify MetaDataEndPointImpl code to use the TenantId connection property to scan SYSTEM.TABLE. You'll need to pass this through as a new/additional argument to the endpoint coprocessor methods.
Add an HBASE_TABLE_NAME VARCHAR key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables.
Optionally add a TABLE_ID key value column to SYSTEM.TABLE. This would be null for the "base table" and populated for "derived" tables with the "key prefix". If we're ok including the "logical" schema name plus table name in the row key of the physical table, we don't necessarily need this. But this would be the "condensed version" of this - we might even just generate it, given that we'll have sequence support shortly ( Support SEQUENCE creation and usage #18 ).
Make the above mentioned grammar changes to be able to define your "physical" HBase table in a CREATE TABLE statement. It could be as simple as a new property (like we've done with salting and immutable tables), then no grammar changes would be necessary.
Persist this data into SYSTEM.TABLE by passing it through the UPSERT command.
Cache this data in the PTable returned by the MetaDataEndPointImpl methods.
Modify MetaDataClient.createTable to check for this new property and enforce that no primary key is defined for these new types of derived tables. Note that we might be able to allow additional nullable row key columns to be added. We should check here that the base table row key conforms to what we need:
- declares a tenant id column (as mentioned in Tenant data isolation for DML and SELECT. #356) and a logical table id column (same idea as with tenant id so we don't have "hard coded" column names)
- leads the row key with the tenant id column followed by logical table id column
Modify WhereOptimizer.pushKeyExpressionsToScan to automatically insert AND conditions for tenant id (based on the TenantId connection property plus the column name identified by Tenant data isolation for DML and SELECT. #356) and logical table id (based on the table id that would be cached in the PTable plus again another column name defined at DDL time). This would be the "runtime" glue that'll cause your queries to run against the right set of data.

elilevine · 2013-08-05T22:21:28Z

Tenant-specific indexes built on top of these "views" is the next logical step.

elilevine · 2013-08-06T03:31:45Z

@jtaylor-sfdc, how about using the same table name for both regular and tenant-specific tables? The presence/absence of leading TENANT_ID row key part in SYSTEM.TABLE would distinguish one from another. That way we can omit the extra HBASE_TABLE_NAME column and assume the name of tenant-specific table would be shared by its base table (which would have no TENANT_ID).

elilevine · 2013-08-06T04:25:47Z

Err... scratch that last comment. We have to have the extra column pointing back to the base table to support multiple "views" on the same base table per tenant. Thanks @ivarley for pointing it out.

jtaylor-sfdc · 2013-11-27T01:09:43Z

Implemented by @elilevine and pulled into master. Fantastic job! I'll close this issue, as the bulk of the functionality is in and we have these more specific issues for follow up work:

Tweak syntax for specifying base tenant table ( Tweak syntax for multi-tenant tables #592 )
Tenant data isolation for DML and SELECT ( Tenant data isolation for DML and SELECT. #356 )
Hide tenant id column in tenant-specific tables ( Hide tenant id column in tenant-specific tables #490 )
Base tenant table PK must lead with tenant_id and tenant_type_id columns ( Base tenant table PK must lead with tenant_id and tenant_type_id columns #452 )
Support secondary indexes on tenant-specific tables ( Support secondary indexes on tenant-specific tables #491 )
Changes metadata caches to be LRU ( Changes metadata caches to be LRU #493 )
Performance test multi-tenant implementation ( Performance test multi-tenant implementation #508 )

elilevine mentioned this issue Jul 30, 2013

Tenant data isolation for DML and SELECT. #356

Closed

ghost assigned elilevine Aug 1, 2013

This was referenced Sep 30, 2013

Base tenant table PK must lead with tenant_id and tenant_type_id columns #452

Closed

Multi-tenant tables #453

Merged

jtaylor-sfdc mentioned this issue Nov 12, 2013

Surface support for encryption #542

Open

jtaylor-sfdc closed this as completed Nov 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow creating multiple views or derived tables on the same physical HBase table #296

Allow creating multiple views or derived tables on the same physical HBase table #296

ivarley commented Jun 26, 2013

ryang-sfdc commented Jun 26, 2013

apurtell commented Jun 27, 2013

ivarley commented Jun 27, 2013

elilevine commented Jul 31, 2013

jtaylor-sfdc commented Aug 1, 2013

elilevine commented Aug 5, 2013

elilevine commented Aug 6, 2013

elilevine commented Aug 6, 2013

jtaylor-sfdc commented Nov 27, 2013

Allow creating multiple views or derived tables on the same physical HBase table #296

Allow creating multiple views or derived tables on the same physical HBase table #296

Comments

ivarley commented Jun 26, 2013

ryang-sfdc commented Jun 26, 2013

apurtell commented Jun 27, 2013

ivarley commented Jun 27, 2013

elilevine commented Jul 31, 2013

jtaylor-sfdc commented Aug 1, 2013

elilevine commented Aug 5, 2013

elilevine commented Aug 6, 2013

elilevine commented Aug 6, 2013

jtaylor-sfdc commented Nov 27, 2013