Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue): add L2 resources for Database and Table #1988

Merged
merged 25 commits into from
Mar 14, 2019
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
3551d75
Add glue database and table
Mar 8, 2019
3de3b03
Add unit tests for database and schema
Mar 9, 2019
e4df159
Stash
Mar 9, 2019
502b0cb
Improve test coverage of table
Mar 11, 2019
a08dc37
Add integration tests and README
Mar 11, 2019
f4db178
Update README with types
Mar 11, 2019
0d10989
Add validation for name uniqueness and at least one column
Mar 11, 2019
cf6d56d
Use strongly named references
Mar 11, 2019
9dc8e57
Update StorageType enums to be enum-like classes
Mar 11, 2019
c7f62d7
Add SSE-S3 and SSE-KMS encryption support
Mar 12, 2019
e0043a9
Restrict s3 grants to only objects containing the table's prefix
Mar 12, 2019
231c36a
Add tsdocs for Type
Mar 12, 2019
cee2e46
Add Encryption to README
Mar 12, 2019
4595fde
Add CSE encryption and distinguish SSE-KMS from SSE-KMS-MANAGED
Mar 12, 2019
b8f886b
Minor fixes to the README
Mar 12, 2019
db1960b
Some more minor fixes to the README
Mar 12, 2019
d245b1c
Merge branch 'master' into samgood/glue
Mar 13, 2019
f98d5fd
Rename prefix to s3Prefix and use haveResource in tests
Mar 13, 2019
d3ccd53
Use string concatentation
Mar 13, 2019
538cff9
Add docs and fix string concatenation
Mar 13, 2019
0c2f7fa
Rename StorageType to DataFormat
Mar 13, 2019
1cb7c4f
Improve docs and make the TableEncryption enum more consistent with B…
Mar 13, 2019
a5d45f0
Refactor s3 bucket creation into separate function and support unencr…
Mar 13, 2019
30f8a3c
add test for CSE-KMS with an explicit bucket
Mar 13, 2019
45434fe
minor fixes to README
Mar 13, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions packages/@aws-cdk/aws-glue/README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,131 @@
## The CDK Construct Library for AWS Glue
This module is part of the [AWS Cloud Development Kit](https://github.com/awslabs/aws-cdk) project.

### Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database'
});
```

By default, a S3 bucket is created where the Database is stored under `s3://<bucket-name>/`, but you can manually specify another location:
sam-goodwin marked this conversation as resolved.
Show resolved Hide resolved

```ts
new glue.Database(stack, 'MyDatabase', {
databaseName: 'my_database',
locationUri: 's3://explicit-bucket/some-path/'
});
```

### Table

A Glue table describes the structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket) and format of the files (Json, Avro, Parquet, etc.):

```ts
new glue.Table(stack, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
name: 'col1',
type: glue.Schema.string
}]
storageType: glue.StorageType.Json
});
```

By default, a S3 bucket will created to store the table's data but you can manually pass the `bucket` and `prefix`:

```ts
new glue.Table(stack, 'MyTable', {
bucket: myBucket,
prefix: 'my-table/'
...
});
```

#### Partitions

To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:

```ts
new glue.Table(stack, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering, if name is unique, why not use a hash?

Copy link
Contributor Author

@sam-goodwin sam-goodwin Mar 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two semantics we want to model as strictly as we can: column uniqueness and ordering.

  • A hash models uniqueness well, but it does not model ordering. In node.js, the order of variables is the order in which they are added to the object, but that is not the case for other languages like java, where a developer would have to know to use a LinkedHashMap.
  • An array explicitly and intuitively defines the ordering in all languages, but it doesn't model column uniqueness.

I chose to statically model the ordering property with an array and check the uniqueness at runtime because then, at least the experience is consistent for all consumers. Using a hash might create confusion for consumers - they would not receive an error, the layout of their columns could just change arbitrarily.

name: 'col1',
type: glue.Schema.string
}],
partitionKeys: [{
sam-goodwin marked this conversation as resolved.
Show resolved Hide resolved
name: 'year',
type: glue.Schema.smallint
}, {
name: 'month',
type: glue.Schema.smallint
}],
storageType: glue.StorageType.Json
});
```

### [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)

You can enable encryption on a S3 bucket:
* [SSE-S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingServerSideEncryption.html) - Server side encryption (SSE) with an Amazon S3-managed key.
```ts
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.SSE_S3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum names should be consistent with BucketEncryption

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue the other way around - the enum values are consistent with the S3, Athena, Glue and EMR documentation. What would I name CSE-KMS if I were copying BucketEncryption?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but I think we have a problem with ALL_CAPS when converting those member names to other languages. Can we find names that are PascalCase?

...
});
```
* [SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingKMSEncryption.html) - Server-side encryption (SSE) with a AWS Key Management Service customer managed key.

```ts
// with a KMS managed key
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.SSE_KMS
...
});

// with a customer-managed KMS key
new glue.Table(stack, 'MyTable', {
encryption: glue.TableEncryption.SSE_KMS,
encryptionKey: new kms.EncryptionKey(stack, 'MyKey')
...
});
```

### Types

A table's schema is a collection of columns, each of which have a `name` and a `type`. Types are recursive structures, consisting of primitive and complex types:

#### Primitive

Numeric:
* `bigint`
* `float`
* `integer`
* `smallint`
* `tinyint`

Date and Time:
* `date`
* `timestamp`

String Types:

* `string`
* `decimal`
* `char`
* `varchar`

Misc:
* `boolean`
* `binary`

#### Complex

* `array` - array of some other type.
* `map` - map of some primitive key type to any value type.
* `struct` - nested structure containing individually named and typed columns.
143 changes: 143 additions & 0 deletions packages/@aws-cdk/aws-glue/lib/database.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
import s3 = require('@aws-cdk/aws-s3');
import cdk = require('@aws-cdk/cdk');
import { CfnDatabase } from './glue.generated';

export interface IDatabase extends cdk.IConstruct {
/**
* The ARN of the catalog.
*/
readonly catalogArn: string;

/**
* The catalog id of the database (usually, the AWS account id)
*/
readonly catalogId: string;

/**
* The ARN of the database.
*/
readonly databaseArn: string;

/**
* The name of the database.
*/
readonly databaseName: string;

/**
* The location of the database (for example, an HDFS path).
*/
readonly locationUri: string;

export(): DatabaseImportProps;
}

export interface DatabaseImportProps {
catalogArn: string;
catalogId: string;
databaseArn: string;
databaseName: string;
locationUri: string;
}

export interface DatabaseProps {
/**
* The name of the database.
*/
databaseName: string;

/**
* The location of the database (for example, an HDFS path).
*
* @default a bucket is created and the database is stored under s3://<bucket-name>/<database-name>
sam-goodwin marked this conversation as resolved.
Show resolved Hide resolved
*/
locationUri?: string;
}

/**
* A Glue database.
*/
export class Database extends cdk.Construct {
/**
* Creates a Database construct that represents an external database.
*
* @param scope The scope creating construct (usually `this`).
* @param id The construct's id.
* @param props A `DatabaseImportProps` object. Can be obtained from a call to `database.export()` or manually created.
*/
public static import(scope: cdk.Construct, id: string, props: DatabaseImportProps): IDatabase {
return new ImportedDatabase(scope, id, props);
}

public readonly catalogArn: string;
public readonly catalogId: string;
public readonly databaseArn: string;
public readonly databaseName: string;
public readonly locationUri: string;

constructor(scope: cdk.Construct, id: string, props: DatabaseProps) {
super(scope, id);

if (props.locationUri) {
this.locationUri = props.locationUri;
} else {
const bucket = new s3.Bucket(this, 'Bucket');
eladb marked this conversation as resolved.
Show resolved Hide resolved
this.locationUri = cdk.Fn.join('', ['s3://', bucket.bucketName, props.databaseName]);
}

this.catalogId = this.node.stack.accountId;
eladb marked this conversation as resolved.
Show resolved Hide resolved
const resource = new CfnDatabase(this, 'Resource', {
catalogId: this.catalogId,
databaseInput: {
name: props.databaseName,
locationUri: this.locationUri
}
});

// see https://docs.aws.amazon.com/glue/latest/dg/glue-specifying-resource-arns.html#data-catalog-resource-arns
this.databaseName = resource.databaseName;
this.databaseArn = this.node.stack.formatArn({
service: 'glue',
resource: 'database',
resourceName: this.databaseName
});
// catalogId is implicitly the accountId, which is why we don't pass the catalogId here
this.catalogArn = this.node.stack.formatArn({
service: 'glue',
resource: 'catalog'
eladb marked this conversation as resolved.
Show resolved Hide resolved
});
}

/**
* Exports this database from the stack.
*/
public export(): DatabaseImportProps {
return {
catalogArn: new cdk.Output(this, 'CatalogArn', { value: this.catalogArn }).makeImportValue().toString(),
catalogId: new cdk.Output(this, 'CatalogId', { value: this.catalogId }).makeImportValue().toString(),
databaseArn: new cdk.Output(this, 'DatabaseArn', { value: this.databaseArn }).makeImportValue().toString(),
databaseName: new cdk.Output(this, 'DatabaseName', { value: this.databaseName }).makeImportValue().toString(),
locationUri: new cdk.Output(this, 'LocationURI', { value: this.locationUri }).makeImportValue().toString()
};
}
}

class ImportedDatabase extends cdk.Construct implements IDatabase {
public readonly catalogArn: string;
public readonly catalogId: string;
public readonly databaseArn: string;
public readonly databaseName: string;
public readonly locationUri: string;

constructor(parent: cdk.Construct, name: string, private readonly props: DatabaseImportProps) {
super(parent, name);
this.catalogArn = props.catalogArn;
this.catalogId = props.catalogId;
this.databaseArn = props.databaseArn;
this.databaseName = props.databaseName;
this.locationUri = props.locationUri;
}

public export() {
return this.props;
}
}
5 changes: 5 additions & 0 deletions packages/@aws-cdk/aws-glue/lib/index.ts
Original file line number Diff line number Diff line change
@@ -1,2 +1,7 @@
// AWS::Glue CloudFormation Resources:
export * from './glue.generated';

export * from './database';
export * from './schema';
export * from './storage-type';
export * from './table';
Loading