Skip to content

Commit

Permalink
feat(protobuf) Additional protobuf ingestion features
Browse files Browse the repository at this point in the history
* Domains
* Ownership
* Tags from comma delimited strings
* Custom proy lists
* Additional documentation
  • Loading branch information
leifker committed Mar 24, 2022
1 parent ec29242 commit b9627e7
Show file tree
Hide file tree
Showing 24 changed files with 552 additions and 97 deletions.
Binary file not shown.
237 changes: 232 additions & 5 deletions metadata-integration/java/datahub-protobuf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,9 +128,12 @@ import "google/protobuf/descriptor.proto";
*/
enum DataHubMetadataType {
PROPERTY = 0; // Datahub Custom Property
TAG = 1; // Datahub Tag
TERM = 2; // Datahub Term
PROPERTY = 0; // Datahub Custom Property
TAG = 1; // Datahub Tag
TAG_LIST = 2; // Datahub Tags from comma delimited string
TERM = 3; // Datahub Term
OWNER = 4; // Datahub Owner
DOMAIN = 5; // Datahub Domain
}
/*
Expand Down Expand Up @@ -201,16 +204,240 @@ message msg {
// Attach these Message/Dataset options as a tag and property.
string product = 5001 [(meta.fld.type) = TAG, (meta.fld.type) = PROPERTY];
string project = 5002 [(meta.fld.type) = TAG, (meta.fld.type) = PROPERTY];
string team = 5003 [(meta.fld.type) = TAG, (meta.fld.type) = PROPERTY];
string team = 5003 [(meta.fld.type) = OWNER, (meta.fld.type) = PROPERTY];
string domain = 60003 [(meta.fld.type) = TAG, (meta.fld.type) = PROPERTY];
string domain = 60003 [(meta.fld.type) = DOMAIN, (meta.fld.type) = PROPERTY];
meta.MetaEnumExample type = 60004 [(meta.fld.type) = TAG, (meta.fld.type) = PROPERTY];
bool bool_feature = 60005 [(meta.fld.type) = TAG];
string alert_channel = 60007 [(meta.fld.type) = PROPERTY];
}
}
```

#### DataHubMetadataType

| DataHubMetadataType | String | Bool | Enum | Repeated |
|---------------------|--------|------|------|----------|
| PROPERTY | X | X | X | X |
| TAG | X | X | X | |
| TAG_LIST | X | | | |
| TERM | X | | X | |
| OWNER | X | | | X |
| DOMAIN | X | | | X |

##### PROPERTY

Custom properties can be captured as key/value pairs where the protobuf option name is the key and the option value is the option's value.

For example, generating a custom property with key `prop1` and value `value1`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
string prop1 = 5000 [(meta.fld.type) = PROPERTY];
}
}
message Message {
option(meta.msg.prop1) = "value1";
}
```

Booleans are converted to a value of either `true` or `false`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
bool prop1 = 5000 [(meta.fld.type) = PROPERTY];
}
}
message Message {
option(meta.msg.prop1) = true;
}
```

Enum values are similarly converted to their string representation.

```protobuf
enum MetaEnumExample {
UNKNOWN = 0;
ENTITY = 1;
EVENT = 2;
}
message msg {
extend google.protobuf.MessageOptions {
MetaEnumExample prop1 = 5000 [(meta.fld.type) = PROPERTY];
}
}
message Message {
option(meta.msg.prop1) = ENTITY;
}
```

Repeated values will be collected and the value will be stored as a serialized json array. The following example would result in the value of `["a","b","c"]`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
repeated string prop1 = 5000 [(meta.fld.type) = PROPERTY];
}
}
message Message {
option(meta.msg.prop1) = "a";
option(meta.msg.prop1) = "b";
option(meta.msg.prop1) = "c";
}
```

##### TAG & TAG_LIST

The tag list assumes a string that contains the comma delimited values of the tags. In the example below, tags would be added as `a`, `b`, `c`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tags = 5000 [(meta.fld.type) = TAG_LIST];
}
}
message Message {
option(meta.msg.tags) = "a, b, c";
}
```

Tags could also be represented as separate boolean options. Only the `true` options result in tags. In this example, a single tag of `tagA` would be added to the dataset.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
bool tagA = 5000 [(meta.fld.type) = TAG];
bool tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = true;
option(meta.msg.tagB) = false;
}
```

Alternatively, tags can be separated into different fields with the option name as a dot delimited prefix. The following would produce two tags with values of `tagA.a` and `tagB.a`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
string tagA = 5000 [(meta.fld.type) = TAG];
string tagB = 5001 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tagA) = "a";
option(meta.msg.tagB) = "a";
}
```

The dot delimited prefix also works with enum types where the prefix is the enum type name. In this example two tags are created, `MetaEnumExample.ENTITY`.

```protobuf
enum MetaEnumExample {
UNKNOWN = 0;
ENTITY = 1;
EVENT = 2;
}
message msg {
extend google.protobuf.MessageOptions {
MetaEnumExample tag = 5000 [(meta.fld.type) = TAG];
}
}
message Message {
option(meta.msg.tag) = ENTITY;
}
```

##### TERM

Terms are specified by either a fully qualified string value or an enum where the enum type's name is the first element in the fully qualified term name.

The following example shows both methods, either of which would result in the term `Classification.HighlyConfidential` being applied.

```protobuf
enum Classification {
HighlyConfidential = 0;
Confidential = 1;
Sensitive = 2;
}
message msg {
extend google.protobuf.MessageOptions {
Classification term = 5000 [(meta.fld.type) = TERM];
string class = 5001 [(meta.fld.type) = TERM];
}
}
message Message {
option(meta.msg.term) = HighlyConfidential;
option(meta.msg.class) = "Classification.HighlyConfidential";
}
```

##### OWNER

One or more owners can be specified and can be any combination of `corpUser` and `corpGroup` entities. The default entity type is `corpGroup`. By default, the ownership type is set to `producer`, see the second example for setting the ownership type.

The following example assigns the ownership to a group of `myGroup` and a user called `myName`.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
repeated string owner = 5000 [(meta.fld.type) = OWNER];
}
}
message Message {
option(meta.msg.owner) = "corpUser:myName";
option(meta.msg.owner) = "myGroup";
}
```

In this example, the option name determines the ownership type. User `myName` is assigned as the Technical Owner and `myGroup` as the Data Steward.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
repeated string technical_owner = 5000 [(meta.fld.type) = OWNER];
repeated string data_steward = 5001 [(meta.fld.type) = OWNER];
}
}
message Message {
option(meta.msg.technical_owner) = "corpUser:myName";
option(meta.msg.data_steward) = "myGroup";
}
```

##### DOMAIN

Set the domain id for the dataset. The domain should exist already. Note that the *id* of the domain is the value. If not specified during domain creation it is likely a random string.

```protobuf
message msg {
extend google.protobuf.MessageOptions {
string domain = 5000 [(meta.fld.type) = DOMAIN];
}
}
message Message {
option(meta.msg.domain) = "engineering";
}
```

## Gradle Integration

An example application is included which works with the `protobuf-gradle-plugin`, see the standalone [example project](../datahub-protobuf-example).
Expand Down
8 changes: 5 additions & 3 deletions metadata-integration/java/datahub-protobuf/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,23 @@ dependencies {

implementation externalDependency.protobuf
implementation externalDependency.jgrapht
implementation externalDependency.gson

compileOnly externalDependency.lombok
annotationProcessor externalDependency.lombok
testImplementation externalDependency.junitJupiterApi
testRuntimeOnly externalDependency.junitJupiterEngine
}

import java.nio.file.Path
import java.nio.file.Paths
task compileProtobuf {
doLast {
def basePath = Path.of("${projectDir}/src/test/resources")
def basePath = Paths.get("${projectDir}/src/test/resources")
[
fileTree("${projectDir}/src/test/resources/protobuf") { include "*.proto" },
fileTree("${projectDir}/src/test/resources/extended_protobuf") { include "*.proto" }
].collectMany { it.collect() }.each { f ->
def input = basePath.relativize(Path.of(f.getAbsolutePath()))
def input = basePath.relativize(Paths.get(f.getAbsolutePath()))
println(input.toString() + " => " + input.toString().replace(".proto", ".protoc"))
exec {
workingDir "${projectDir}/src/test/resources"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,10 @@
import datahub.protobuf.visitors.ProtobufModelVisitor;
import datahub.protobuf.visitors.VisitContext;
import datahub.protobuf.visitors.dataset.DatasetVisitor;
import datahub.protobuf.visitors.dataset.DomainVisitor;
import datahub.protobuf.visitors.dataset.InstitutionalMemoryVisitor;
import datahub.protobuf.visitors.dataset.KafkaTopicPropertyVisitor;
import datahub.protobuf.visitors.dataset.OwnershipVisitor;
import datahub.protobuf.visitors.dataset.ProtobufExtensionPropertyVisitor;
import datahub.protobuf.visitors.dataset.ProtobufExtensionTagAssocVisitor;
import datahub.protobuf.visitors.dataset.ProtobufExtensionTermAssocVisitor;
Expand Down Expand Up @@ -147,6 +149,16 @@ public ProtobufDataset build() throws IOException {
new ProtobufExtensionTermAssocVisitor()
)
)
.ownershipVisitors(
List.of(
new OwnershipVisitor()
)
)
.domainVisitors(
List.of(
new DomainVisitor()
)
)
.build()
);
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
package datahub.protobuf.visitors;

import com.google.protobuf.ByteString;
import com.google.protobuf.DescriptorProtos;
import com.google.protobuf.Descriptors;
import com.google.protobuf.ExtensionRegistry;
Expand All @@ -8,6 +9,7 @@
import com.linkedin.common.urn.GlossaryTermUrn;
import com.linkedin.tag.TagProperties;

import java.util.Arrays;
import java.util.Collection;
import java.util.Map;
import java.util.Objects;
Expand All @@ -28,7 +30,7 @@ public static DescriptorProtos.FieldDescriptorProto extendProto(DescriptorProtos
}

public enum DataHubMetadataType {
PROPERTY, TAG, TERM;
PROPERTY, TAG, TAG_LIST, TERM, OWNER, DOMAIN;

public static final String PROTOBUF_TYPE = "DataHubMetadataType";
}
Expand Down Expand Up @@ -57,8 +59,16 @@ public static Map<Descriptors.FieldDescriptor, Object> filterByDataHubType(Map<D
.collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));
}

public static Stream<Map.Entry<String, String>> getProperties(Descriptors.FieldDescriptor field, DescriptorProtos.DescriptorProto value) {
return value.getUnknownFields().asMap().entrySet().stream().map(unknown -> {
Descriptors.FieldDescriptor fieldDesc = field.getMessageType().findFieldByNumber(unknown.getKey());
String fieldValue = unknown.getValue().getLengthDelimitedList().stream().map(ByteString::toStringUtf8).collect(Collectors.joining(""));
return Map.entry(String.join(".", field.getFullName(), fieldDesc.getName()), fieldValue);
});
}

public static Stream<TagProperties> extractTagPropertiesFromOptions(Map<Descriptors.FieldDescriptor, Object> options, ExtensionRegistry registry) {
return filterByDataHubType(options, registry, DataHubMetadataType.TAG).entrySet().stream()
Stream<TagProperties> tags = filterByDataHubType(options, registry, DataHubMetadataType.TAG).entrySet().stream()
.filter(e -> e.getKey().isExtension())
.map(entry -> {
switch (entry.getKey().getJavaType()) {
Expand All @@ -85,6 +95,22 @@ public static Stream<TagProperties> extractTagPropertiesFromOptions(Map<Descript
return null;
}
}).filter(Objects::nonNull);

Stream<TagProperties> tagListTags = filterByDataHubType(options, registry, DataHubMetadataType.TAG_LIST).entrySet().stream()
.filter(e -> e.getKey().isExtension())
.flatMap(entry -> {
switch (entry.getKey().getJavaType()) {
case STRING:
return Arrays.stream(entry.getValue().toString().split(","))
.map(t -> new TagProperties()
.setName(t.trim())
.setDescription(entry.getKey().getFullName()));
default:
return Stream.empty();
}
}).filter(Objects::nonNull);

return Stream.concat(tags, tagListTags);
}

public static Stream<GlossaryTermAssociation> extractTermAssociationsFromOptions(Map<Descriptors.FieldDescriptor, Object> options,
Expand Down
Loading

0 comments on commit b9627e7

Please sign in to comment.