fix(Ingestor/Glue): Implement column upstream lineage between S3 and Glue #10895

sagar-salvi-apptware · 2024-07-12T07:14:10Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

New Features
- Introduced column-level lineage extraction from S3 data for Glue sources.
- Enhanced metadata representation for datasets within AWS Glue, including lineage and schema details.
Tests
- Added new tests for column and table lineage extraction in Glue sources.
- Expanded testing framework to validate new lineage functionalities and schema metadata retrieval.
Chores
- Updated configurations and methods to support fine-grained lineage information.

coderabbitai · 2024-07-12T07:14:17Z

Walkthrough

The recent changes enhance AWS Glue data sources in DataHub by introducing fine-grained column-level lineage extraction. A new configuration option, include_column_lineage, allows users to enable this feature. Updated methods support detailed lineage extraction, while new methods facilitate the processing of schema metadata. These enhancements are validated by comprehensive tests and a new JSON file detailing Glue datasets and their lineage connections.

Changes

File(s)	Change Summary
`metadata-ingestion/src/datahub/ingestion/source/aws/glue.py`	Added `include_column_lineage` field in `GlueSourceConfig`, modified lineage handling, and introduced `get_fine_grained_lineages` method.
`metadata-ingestion/tests/unit/glue/glue_mces_golden_table_column_lineage.json`	New JSON file representing Glue dataset metadata, including column-level lineage.
`metadata-ingestion/tests/unit/test_glue_source.py`	Enhanced tests with new functions and parameters for lineage and schema metadata retrieval.
`metadata-ingestion/tests/unit/test_glue_source_stubs.py`	Added data structures for database and table lineage responses.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant GlueSource
    participant GlueClient
    participant S3
    participant DataHubGraph

    User->>GlueSource: Initiate ingestion with column lineage enabled
    GlueSource->>GlueClient: Fetch database and table metadata
    GlueClient-->>GlueSource: Return metadata
    GlueSource->>S3: Fetch S3 lineage data for columns
    S3-->>GlueSource: Return column lineage data
    GlueSource->>DataHubGraph: Emit lineage and metadata to DataHub
    DataHubGraph-->>GlueSource: Acknowledge receipt
    GlueSource-->>User: Ingestion complete

Poem

In the world of data's weave and twine,
Glue now tracks each column line.
From S3's depths to schema's crest,
Lineage flows at its best. 🌟
DataHub sings with newfound glee,
Thanks to fine-grained lineage, free! 🐰

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 89bda5b and 55c38db.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

27-27: datahub.emitter.mce_builder imported but unused

Remove unused import: datahub.emitter.mce_builder

(F401)

Additional comments not posted (3)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (3)

183-186: LGTM! New field include_column_lineage added to GlueSourceConfig.

The new field include_column_lineage has been added to enable column-level lineage extraction.

732-768: LGTM! Updates to get_lineage_if_enabled method.

The updates to the get_lineage_if_enabled method include logic for extracting schema metadata and creating fine-grained lineage if include_column_lineage is enabled.

790-828: LGTM! New method get_fine_grained_lineages added.

The new method get_fine_grained_lineages creates fine-grained lineage between schema fields of Glue and S3 datasets.

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 55c38db and daabcfb.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (6 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between daabcfb and 16d80dd.

Files selected for processing (5)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (6 hunks)
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_column_lineage.json (1 hunks)
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json (1 hunks)
metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)

Files not summarized due to errors (1)

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json: Error: Message exceeds token limit

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

Additional comments not posted (29)

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_column_lineage.json (1)

1-373: LGTM! JSON structure is well-formed and consistent.

The JSON data correctly represents the various aspects related to the Glue platform, including container properties, status, data platform instance, subtypes, dataset properties, schema metadata, ownership, upstream lineage, and fine-grained lineages.

metadata-ingestion/tests/unit/test_glue_source.py (6)

70-91: Verify the integration of new parameters in glue_source function.

The function includes new parameters: mock_datahub_graph, emit_s3_lineage, include_column_lineage, and extract_transforms. Ensure these parameters are correctly integrated and used within the function.

444-536: Verify new test cases in test_glue_ingest_include_table_lineage.

The function includes a new test case for table lineage with different platform instances and MCE files. Ensure the test cases are correctly defined and cover all necessary scenarios.

543-643: Verify new test function test_glue_ingest_include_column_lineage.

The function is a new test function for testing column lineage. Ensure the function is correctly defined and covers all necessary scenarios for column lineage.

Line range hint 291-303:
Verify stateful ingestion tests in test_glue_stateful.

The function includes stateful ingestion tests for Glue. Ensure the tests are correctly defined and cover all necessary scenarios.

Line range hint 401-419:
Verify delta schema ingestion tests in test_glue_with_delta_schema_ingest.

The function includes tests for ingesting delta schemas. Ensure the tests are correctly defined and cover all necessary scenarios.

Line range hint 421-439:
Verify malformed delta schema ingestion tests in test_glue_with_malformed_delta_schema_ingest.

The function includes tests for ingesting malformed delta schemas. Ensure the tests are correctly defined and cover all necessary scenarios.

metadata-ingestion/tests/unit/test_glue_source_stubs.py (2)

883-901: Ensure database parameters are accurate.

The dictionary get_databases_response_for_lineage appears to be well-structured. Verify that each parameter, particularly LocationUri and Parameters, accurately reflects the intended database setup for lineage purposes.

903-973: Ensure table parameters and lineage details are accurate.

The list tables_lineage_1 and the dictionary get_tables_lineage_response_1 are well-structured. Verify that each parameter, particularly Columns, Location, and Parameters, accurately reflects the intended table setup for lineage purposes. Consider adding more realistic comments and metadata to better simulate real-world scenarios.

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json (20)

2-22: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity.

23-33: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a status aspect.

34-44: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a dataPlatformInstance aspect.

45-57: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a subTypes aspect.

58-75: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with various aspects.

76-86: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a status aspect.

87-97: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a dataPlatformInstance aspect.

98-110: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a container entity with a subTypes aspect.

111-366: Field urn and various aspect fields are consistent and correct.

The urn and various aspect fields match the expected values for a DatasetSnapshot entity.

367-380: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a subTypes aspect.

381-390: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a container aspect.

391-410: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with an upstreamLineage aspect.

411-572: Field urn and various aspect fields are consistent and correct.

The urn and various aspect fields match the expected values for a DatasetSnapshot entity.

573-585: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a subTypes aspect.

586-595: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a container aspect.

596-615: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with an upstreamLineage aspect.

616-778: Field urn and various aspect fields are consistent and correct.

The urn and various aspect fields match the expected values for a DatasetSnapshot entity.

779-790: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a subTypes aspect.

791-801: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with a container aspect.

802-821: Field entityUrn and aspectName values are consistent and correct.

The entityUrn and aspectName fields match the expected values for a dataset entity with an upstreamLineage aspect.

treff7es · 2024-07-17T11:31:46Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

+                    (
+                        f
+                        for f in schema_metadata_for_s3.fields
+                        if f.fieldPath.split(".")[-1] == field.fieldPath.split(".")[-1]


Will this work for nested fields?
For example:
a.b.c != a.c

Here is an example fieldPath for the above:

unique_v2_field_paths = [ "[version=2.0].[type=union].[type=A].[type=string].f", "[version=2.0].[type=union].[type=B].[type=string].f" ]

hi @treff7es it will work only for top level as s3, Based on my understanding, S3 itself doesn’t handle nested levels directly.

here is output for nested level

It is true but you will connect these which you shouldn't if I'm not mistaken:
Let's say we have this column for an s3 file:
[version=2.0].[type=int].yr

and we have this column for the glue table:
[version=2.0].[type=union].[type=A].[type=string].yr

hi @treff7es
I have tested the scenario, and it seems that the lineage will only connect to the top-level fields. The lineage mapping does not work for nested column levels.

For eg
Glue Schema
The field path in Glue at the nested level is:
[version=2.0].[type=struct].[type=union].[type=struct0].union_type.[type=string].mapping

S3 Schema
The top-level field path in S3 is:
"fieldPath": "mapping"

Nested Field Details

{ "fieldPath": "union_type", "nullable": false, "type": { "type": { "com.linkedin.schema.RecordType": {} } }, "nativeDataType": "struct<int_value: int32, double_value: double, array_value: list<element: string>, struct_value: struct<employee_id: int32, mapping: string>>", "recursive": false, "isPartOfKey": false }

The lineage is not correctly lineage the field "mapping" because there is no corresponding top-level field in Glue. As a result, the lineage mapping fails.

output of this

Thanks !!!

hi @treff7es yes, issue is related to v2 vs. v1 field path only i ll use above to verify the lineage

hi @treff7es implemented the changes mentioned above

Output:

Let me know if anything is required

Thanks !!!

Thanks. It's much better and safer now. I left one small comment, and I think it will be fine.

i have fixed that, Thanks !!!

thanks, it looks good now

…Glue

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 16d80dd and 6c45be8.

Files selected for processing (5)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_column_lineage.json (1 hunks)
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json (1 hunks)
metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)

Files skipped from review as they are similar to previous changes (2)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_column_lineage.json

Additional comments not posted (37)

metadata-ingestion/tests/unit/test_glue_source.py (4)
443-536: New test function added for table lineage inclusion.

The new test function test_glue_ingest_include_table_lineage validates the inclusion of table lineage during the ingestion process.

540-643: New test function added for column lineage inclusion.

The new test function test_glue_ingest_include_column_lineage validates the inclusion of column lineage during the ingestion process.

565-618: New helper function added for schema metadata.

The new helper function fake_schema_metadata provides a mock implementation for schema metadata, which is useful for testing the column lineage feature.

70-91: LGTM! But verify the function usage in the codebase.

The code changes are approved.

However, ensure that all function calls to glue_source match the new signature.

Verification successful

Verified: All function calls to glue_source match the new signature.

The function calls in metadata-ingestion/tests/unit/test_glue_source.py have been updated to include the new parameters, ensuring consistency with the modified function signature.

metadata-ingestion/tests/unit/test_glue_source.py: Verified multiple instances of glue_source calls with the updated parameters.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify all function calls to `glue_source` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type python -A 5 $'glue_source'
Length of output: 66

Script:
#!/bin/bash
# Description: Verify all function calls to `glue_source` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type py -A 5 $'glue_source'
Length of output: 11046

Script:
#!/bin/bash
# Description: Extract and display all `glue_source` function calls with additional context lines.

rg --type py -A 10 -B 2 $'glue_source('
Length of output: 115

Script:
#!/bin/bash
# Description: Extract and display all `glue_source` function calls with additional context lines.

rg --type py -A 10 -B 2 'glue_source\('
Length of output: 6689
metadata-ingestion/tests/unit/test_glue_source_stubs.py (2)

883-901: New mock response added for database lineage.

The new mock response get_databases_response_for_lineage provides a mock implementation for database lineage, which is useful for testing the lineage feature.

903-973: New mock response added for table lineage.

The new mock response get_tables_lineage_response_1 provides a mock implementation for table lineage, which is useful for testing the lineage feature.
metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json (31)
7-21: Check for completeness of custom properties.

Ensure that all necessary custom properties are included and correctly populated.

27-32: Ensure correct status aspect.

The status aspect indicates that the container is not removed. Verify that this status is correct.

38-43: Ensure correct data platform instance.

The data platform instance is set to urn:li:dataPlatform:glue. Verify that this is correct.

3-21: Ensure consistent URN usage.

The URN urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba is used consistently across multiple aspects. Verify that this URN is correct and consistent with other parts of the project.

Verification successful

URN usage is consistent across the project.

The URN urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba is used consistently across multiple files and aspects within the project.

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json

metadata-ingestion/tests/unit/glue/glue_deleted_actor_mces_golden.json

metadata-ingestion/tests/unit/glue/glue_mces_golden.json
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the project.

# Test: Search for the URN usage. Expect: Only occurances of the specified URN.
rg --type json -A 5 $'urn:li:container:0b9f1f731ecf6743be6207fec3dc9cba'
Length of output: 6426
49-56: Ensure correct subTypes.

The subTypes aspect indicates that the container is of type Database. Verify that this is correct.

63-74: Check for completeness of custom properties.

Ensure that all necessary custom properties are included and correctly populated.

80-85: Ensure correct status aspect.

The status aspect indicates that the container is not removed. Verify that this status is correct.

91-96: Ensure correct data platform instance.

The data platform instance is set to urn:li:dataPlatform:glue. Verify that this is correct.

102-109: Ensure correct subTypes.

The subTypes aspect indicates that the container is of type Database. Verify that this is correct.

59-74: Ensure consistent URN usage.

The URN urn:li:container:bdf4342ea6899d162eae685bfe9074a7 is used consistently across multiple aspects. Verify that this URN is correct and consistent with other parts of the project.

Verification successful

URN usage is consistent and correct.

The URN urn:li:container:bdf4342ea6899d162eae685bfe9074a7 is used consistently across multiple files and aspects within the project. No discrepancies were found.

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json

metadata-ingestion/tests/unit/glue/glue_mces_golden.json

metadata-ingestion/tests/unit/glue/glue_deleted_actor_mces_golden.json
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the project.

# Test: Search for the URN usage. Expect: Only occurances of the specified URN.
rg --type json -A 5 $'urn:li:container:bdf4342ea6899d162eae685bfe9074a7'
Length of output: 12703
115-127: Check for completeness of custom properties.

Ensure that all necessary custom properties are included and correctly populated.

133-138: Ensure correct status aspect.

The status aspect indicates that the container is not removed. Verify that this status is correct.

144-149: Ensure correct data platform instance.

The data platform instance is set to urn:li:dataPlatform:glue. Verify that this is correct.

155-162: Ensure correct subTypes.

The subTypes aspect indicates that the container is of type Database. Verify that this is correct.

170-173: Ensure correct status aspect.

The status aspect indicates that the dataset is not removed. Verify that this status is correct.

175-201: Check for completeness of custom properties.

Ensure that all necessary custom properties are included and correctly populated.

333-335: Ensure correct data platform instance.

The data platform instance is set to urn:li:dataPlatform:glue. Verify that this is correct.

353-361: Ensure correct global tags.

The global tags aspect includes tags associated with the dataset. Verify that all tags are correctly defined.

369-378: Ensure correct subTypes.

The subTypes aspect indicates that the dataset is of type Table. Verify that this is correct.

383-390: Ensure correct container aspect.

The container aspect indicates the container URN. Verify that this is correct.

418-421: Ensure correct status aspect.

The status aspect indicates that the dataset is not removed. Verify that this status is correct.

423-448: Check for completeness of custom properties.

Ensure that all necessary custom properties are included and correctly populated.

538-540: Ensure correct data platform instance.

The data platform instance is set to urn:li:dataPlatform:glue. Verify that this is correct.

113-127: Ensure consistent URN usage.

The URN urn:li:container:110bc08849d1c1bde5fc345dab5c3ae7 is used consistently across multiple aspects. Verify that this URN is correct and consistent with other parts of the project.

Verification successful

URN usage is consistent.

The URN urn:li:container:110bc08849d1c1bde5fc345dab5c3ae7 is used consistently across multiple aspects in the project.

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json

metadata-ingestion/tests/unit/glue/glue_mces_golden.json
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the project.

# Test: Search for the URN usage. Expect: Only occurances of the specified URN.
rg --type json -A 5 $'urn:li:container:110bc08849d1c1bde5fc345dab5c3ae7'
Length of output: 4784
338-350: Ensure correct ownership.

The ownership aspect includes owners and their types. Verify that all owners and types are correctly defined.

543-555: Ensure correct ownership.

The ownership aspect includes owners and their types. Verify that all owners and types are correctly defined.
#!/bin/bash
# Description: Verify the ownership aspect correctness.

# Test: Search

<details>
<summary>Verification successful</summary>



---



**Ownership aspect verified.**

The ownership structure is consistent across all instances in the JSON file, with correctly defined owners and types. The "ownerTypes" object is empty, which may be acceptable depending on the expected data.

- Instances found: 3
- Owners and types: Correctly defined
- ownerTypes: Empty but consistent



</details>


<details>
<summary>Scripts executed</summary>

The following scripts were executed for the analysis:



---

Script:

```shell
#!/bin/bash
# Description: Search for all instances of "Ownership" in the JSON file and display their structure.

# Search for "Ownership" and display the surrounding lines for context.
rg '"com.linkedin.pegasus2avro.common.Ownership":' -A 10 metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json
Length of output: 1620
204-329: Ensure correct schema metadata.

The schema metadata includes fields and their properties. Verify that all fields and properties are correctly defined.

Verification successful

Schema metadata verification completed.

The schema metadata in the specified JSON file appears to be correctly defined. Each field includes all necessary attributes, and the values are consistent with the expected schema.

schemaName, platform, version, created, lastModified, hash, platformSchema, and fields are all present and correctly structured.

Each field within fields contains fieldPath, nullable, description (where applicable), type, nativeDataType, recursive, isPartOfKey, and jsonProps.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the schema metadata correctness.

# Test: Search for the schema metadata. Expect: All fields and properties are correctly defined.
rg --type json -A 5 $'schemaMetadata'
Length of output: 81650

Script:
#!/bin/bash
# Description: Extract and display the relevant portion of the JSON file to verify the schema metadata correctness.
# Test: Extract lines 204-329 from the specified JSON file and display them.

# Extract and display lines 204-329 from the JSON file
sed -n '204,329p' metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json
Length of output: 6828
165-364: Ensure consistent URN usage.

The URN urn:li:dataset:(urn:li:dataPlatform:glue,flights-database.avro,PROD) is used consistently across multiple aspects. Verify that this URN is correct and consistent with other parts of the project.

393-410: Ensure correct upstream lineage.

The upstream lineage aspect includes upstream datasets. Verify that all upstream datasets are correctly defined.

Verification successful

Let's run a script to extract the "dataset" URNs from the "upstreams" arrays in the JSON files and validate their format.

Upstream lineage datasets are correctly defined.

The extracted dataset URNs from the JSON files follow the expected format and correctly represent the upstream datasets.

Example URNs:

urn:li:dataset:(urn:li:dataPlatform:dbt,pagila.pagila_dev.customer_details,PROD)

urn:li:dataset:(urn:li:dataPlatform:bigquery,bigquery-public-data.covid19_geotab_mobility_impact.commercial_traffic_by_industry,PROD)

urn:li:dataset:(urn:li:dataPlatform:postgres,pagila.public.customer,PROD)
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the upstream lineage correctness.

# Test: Search for the upstream lineage aspect. Expect: All upstream datasets are correctly defined.
rg --type json -A 5 $'upstreamLineage'
Length of output: 193876

Script:
#!/bin/bash
# Description: Extract and validate dataset URNs from upstreamLineage aspects.

# Extract dataset URNs from upstreamLineage aspects in JSON files.
rg --type json -o -r '$1' '"dataset":\s*"([^"]+)"' | sort | uniq
Length of output: 127687
451-535: Ensure correct schema metadata.

The schema metadata includes fields and their properties. Verify that all fields and properties are correctly defined.

Verification successful

Schema metadata verification successful.

The schema metadata in metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json is correctly defined and includes all necessary fields and properties.

schemaName: Correctly specified.

platform: Correctly specified.

version: Correctly specified.

created and lastModified: Correctly specified with time and actor.

hash: Correctly specified.

platformSchema: Correctly specified with tableSchema.

fields: Correctly specified with fieldPath, nullable, type, nativeDataType, recursive, isPartOfKey, and jsonProps.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the schema metadata correctness.

# Test: Search for the schema metadata. Expect: All fields and properties are correctly defined.
rg --type json -A 5 $'schemaMetadata'
Length of output: 81650

Script:
#!/bin/bash
# Description: Search for the schema metadata and its detailed contents in the specified file.

# Test: Search for the schema metadata object and its fields and properties in the specified file.
rg --type json -A 20 $'schemaMetadata' metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json
Length of output: 108

Script:
#!/bin/bash
# Description: Search for the schema metadata and its detailed contents in the specified file.

# Test: Search for the term `SchemaMetadata` in the specified file.
rg --type json -A 20 $'SchemaMetadata' metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json
Length of output: 3102
413-567: Ensure consistent URN usage.

The URN urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD) is used consistently across multiple aspects. Verify that this URN is correct and consistent with other parts of the project.

Verification successful

Ensure consistent URN usage.

The URN urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD) is used consistently across multiple test files in the metadata-ingestion/tests/unit/glue directory. No inconsistencies were found in its usage.

metadata-ingestion/tests/unit/glue/glue_mces_golden_table_lineage.json

metadata-ingestion/tests/unit/glue/glue_deleted_actor_mces_golden.json

metadata-ingestion/tests/unit/glue/glue_mces_golden.json
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the project.

# Test: Search for the URN usage. Expect: Only occurances of the specified URN.
rg --type json -A 5 $'urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD)'
Length of output: 103

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the entire codebase.

# Test: Search for the URN usage. Expect: Occurrences of the specified URN in various contexts.
rg -A 5 $'urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD)'
Length of output: 91

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the entire codebase, including all file types.

# Test: Search for the URN usage. Expect: Occurrences of the specified URN in various contexts.
rg -A 5 'urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD)'
Length of output: 91

Script:
#!/bin/bash
# Description: Verify the URN usage consistency across the entire codebase using fd and grep.

# List all files in the codebase and search for the URN within those files.
fd -t f -x grep -Hn 'urn:li:dataset:(urn:li:dataPlatform:glue,test-database.test_jsons_markers,PROD)' {}
Length of output: 2018

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6c45be8 and 50e1f92.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

…b-project#10895)

jjoyce0510 · 2024-08-20T03:21:48Z

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

+                    (
+                        f
+                        for f in schema_metadata_for_s3.fields
+                        if simplify_field_path(f.fieldPath) == field_path_v1


All of this downgrading to V1 field paths. Is that the best practice going forward here?

* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820) * refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2) (datahub-project#10764) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819) * feat(ingest/transformer): tags to terms transformer (datahub-project#10758) Co-authored-by: Aseem Bansal <[email protected]> * fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752) Co-authored-by: Aseem Bansal <[email protected]> * feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822) * feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823) * feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824) * feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825) * add flag for includeSoftDeleted in scroll entities API (datahub-project#10831) * feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832) * feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826) * add scroll parameters to openapi v3 spec (datahub-project#10833) * fix(ingest): correct profile_day_of_week implementation (datahub-project#10818) * feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666) Co-authored-by: Harshal Sheth <[email protected]> * feat(cli): add more details to get cli (datahub-project#10815) * fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836) * fix(ingestion): fix datajob patcher (datahub-project#10827) * fix(smoke-test): add suffix in temp file creation (datahub-project#10841) * feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784) * feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645) Co-authored-by: Hendrik Richert <[email protected]> * docs(patch): add patch documentation for how implementation works (datahub-project#10010) Co-authored-by: John Joyce <[email protected]> * fix(jar): add missing custom-plugin-jar task (datahub-project#10847) * fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391) Co-authored-by: John Joyce <[email protected]> * docs(): Update posts.md (datahub-project#9893) Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * chore(ingest): update acryl-datahub-classify version (datahub-project#10844) * refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834) * fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849) * fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848) * fix(smoke-test): missing test for move domain (datahub-project#10837) * ci: update usernames to not considered for community (datahub-project#10851) * env: change defaults for data contract visibility (datahub-project#10854) * fix(ingest/tableau): quote special characters in external URL (datahub-project#10842) * fix(smoke-test): fix flakiness of auto complete test * ci(ingest): pin dask dependency for feast (datahub-project#10865) * fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542) * feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829) * chore(ingest): Mypy 1.10.1 pin (datahub-project#10867) * docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852) * docs: add new js snippet (datahub-project#10846) * refactor(ingestion): remove company domain for security reason (datahub-project#10839) * fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498) Co-authored-by: Yanik Häni <[email protected]> * fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874) * fix(manage-tokens): fix manage access token policy (datahub-project#10853) * Batch get entity endpoints (datahub-project#10880) * feat(system): support conditional write semantics (datahub-project#10868) * fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890) * feat(ingest/lookml): shallow clone repos (datahub-project#10888) * fix(ingest/looker): add missing dependency (datahub-project#10876) * fix(ingest): only populate audit stamps where accurate (datahub-project#10604) * fix(ingest/dbt): always encode tag urns (datahub-project#10799) * fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727) * fix(ingestion/looker): column name missing in explore (datahub-project#10892) * fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879) * feat(conditional-writes): misc updates and fixes (datahub-project#10901) * feat(ci): update outdated action (datahub-project#10899) * feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902) Co-authored-by: Gabe Lyons <[email protected]> * feat(ingest): add snowflake-queries source (datahub-project#10835) * fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906) * docs: add new company to adoption list (datahub-project#10909) * refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870) Co-authored-by: John Joyce <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * feat(ui) Finalize support for all entity types on forms (datahub-project#10915) * Index ExecutionRequestResults status field (datahub-project#10811) * feat(ingest): grafana connector (datahub-project#10891) Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> * fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916) * feat(dataset): add support for external url in Dataset (datahub-project#10877) * docs(saas-overview) added missing features to observe section (datahub-project#10913) Co-authored-by: John Joyce <[email protected]> * fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882) * fix(structured properties): allow application of structured properties without schema file (datahub-project#10918) * fix(data-contracts-web) handle other schedule types (datahub-project#10919) * fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866) Co-authored-by: Harshal Sheth <[email protected]> * Add feature flag for view defintions (datahub-project#10914) Co-authored-by: Ethan Cartwright <[email protected]> * feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884) * fix(airflow): add error handling around render_template() (datahub-project#10907) * feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830) * feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904) * fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845) * feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864) * feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813) * fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924) * fix(build): fix lint fix web react (datahub-project#10896) * fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912) * feat(ingest): report extractor failures more loudly (datahub-project#10908) * feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905) * fix(ingest): fix docs build (datahub-project#10926) * fix(ingest/snowflake): fix test connection (datahub-project#10927) * fix(ingest/lookml): add view load failures to cache (datahub-project#10923) * docs(slack) overhauled setup instructions and screenshots (datahub-project#10922) Co-authored-by: John Joyce <[email protected]> * fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903) * fix(entityservice): fix merging sideeffects (datahub-project#10937) * feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938) Co-authored-by: John Joyce <[email protected]> * chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939) * docs: add learning center to docs (datahub-project#10921) * doc: Update hubspot form id (datahub-project#10943) * chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941) * fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895) * fix(ingest/abs): split abs utils into multiple files (datahub-project#10945) * doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950) * fix(ingest/setup): feast and abs source setup (datahub-project#10951) * fix(connections) Harden adding /gms to connections in backend (datahub-project#10942) * feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952) * fix(docs): make graphql doc gen more automated (datahub-project#10953) * feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723) * fix(spark-lineage): default timeout for future responses (datahub-project#10947) * feat(datajob/flow): add environment filter using info aspects (datahub-project#10814) * fix(ui/ingest): correct privilege used to show tab (datahub-project#10483) Co-authored-by: Kunal-kankriya <[email protected]> * feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955) * add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956) * fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966) * fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965) * fix(airflow/build): Pinning mypy (datahub-project#10972) * Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974) * fix(ingest/test): Fix for mssql integration tests (datahub-project#10978) * fix(entity-service) exist check correctly extracts status (datahub-project#10973) * fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982) * bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986) * fix(ui): Remove ant less imports (datahub-project#10988) * feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987) * feat(ingest/cli): init does not actually support environment variables (datahub-project#10989) * fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991) * feat(ingest/spark): Promote beta plugin (datahub-project#10881) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967) * feat(ingest): add `check server-config` command (datahub-project#10990) * feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466) Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig. This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig. I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled. Fixes datahub-project#9705 * fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993) * fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771) * feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985) * feat(ingest): improve `ingest deploy` command (datahub-project#10944) * fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920) - allow excluding soft-deleted entities in relationship-queries - exclude soft-deleted members of groups * fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996) * doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984) Co-authored-by: John Joyce <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Pedro Silva <[email protected]> * fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006) * fix(ui/ingest): Support invalid cron jobs (datahub-project#10998) * fix(ingest): fix graph config loading (datahub-project#11002) Co-authored-by: Pedro Silva <[email protected]> * feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968) Co-authored-by: Harshal Sheth <[email protected]> * fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011) * feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999) * feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935) * fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858) * feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016) Co-authored-by: Chris Collins <[email protected]> * docs: standardize terminology to DataHub Cloud (datahub-project#11003) * fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013) * docs(slack) troubleshoot docs (datahub-project#11014) * feat(propagation): Add graphql API (datahub-project#11030) Co-authored-by: Chris Collins <[email protected]> * feat(propagation): Add models for Action feature settings (datahub-project#11029) * docs(custom properties): Remove duplicate from sidebar (datahub-project#11033) * feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997) Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> * feat(propagation): Add Documentation Propagation Settings (datahub-project#11038) * fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040) * fix(ci): smoke test lint failures (datahub-project#11044) * docs: fix learning center color scheme & typo (datahub-project#11043) * feat: add cloud main page (datahub-project#11017) Co-authored-by: Jay <[email protected]> * feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662) Co-authored-by: John Joyce <[email protected]> * docs: fix typo (datahub-project#11046) * fix(lint): apply spotless (datahub-project#11050) * docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034) * feat(cli): Add run-id option to put sub-command (datahub-project#11023) Adds an option to assign run-id to a given put command execution. This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline. * fix(ingest): improve sql error reporting calls (datahub-project#11025) * fix(airflow): fix CI setup (datahub-project#11031) * feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039) * fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971) * (chore): Linting fix (datahub-project#11015) * chore(ci): update deprecated github actions (datahub-project#10977) * Fix ALB configuration example (datahub-project#10981) * chore(ingestion-base): bump base image packages (datahub-project#11053) * feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051) * fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008) Co-authored-by: Harshal Sheth <[email protected]> * fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910) * feat(ingest/tableau): add retry on timeout (datahub-project#10995) * change generate kafka connect properties from env (datahub-project#10545) Co-authored-by: david-leifker <[email protected]> * fix(ingest): fix oracle cronjob ingestion (datahub-project#11001) Co-authored-by: david-leifker <[email protected]> * chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062) * feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041) * build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878) Co-authored-by: david-leifker <[email protected]> * fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063) * docs(ingest): update developing-a-transformer.md (datahub-project#11019) * feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056) * feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009) Co-authored-by: Harshal Sheth <[email protected]> * docs(airflow): update min version for plugin v2 (datahub-project#11065) * doc(ingestion/tableau): doc update for derived permission (datahub-project#11054) Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Harshal Sheth <[email protected]> * fix(py): remove dep on types-pkg_resources (datahub-project#11076) * feat(ingest/mode): add option to exclude restricted (datahub-project#11081) * fix(ingest): set lastObserved in sdk when unset (datahub-project#11071) * doc(ingest): Update capabilities (datahub-project#11072) * chore(vulnerability): Log Injection (datahub-project#11090) * chore(vulnerability): Information exposure through a stack trace (datahub-project#11091) * chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089) * chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088) * chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059) * chore(vulnerability): Overly permissive regex range (datahub-project#11061) Co-authored-by: Harshal Sheth <[email protected]> * fix: update customer data (datahub-project#11075) * fix(models): fixing the datasetPartition models (datahub-project#11085) Co-authored-by: John Joyce <[email protected]> * fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084) Co-authored-by: John Joyce <[email protected]> * feat(docs-site): hiding learn more from cloud page (datahub-project#11097) * fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082) Co-authored-by: Jay <[email protected]> * fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098) * docs: Refactor customer stories section (datahub-project#10869) Co-authored-by: Jeff Merrick <[email protected]> * fix(release): fix full/slim suffix on tag (datahub-project#11087) * feat(config): support alternate hashing algorithm for doc id (datahub-project#10423) Co-authored-by: david-leifker <[email protected]> Co-authored-by: John Joyce <[email protected]> * fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007) * fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898) Co-authored-by: Harshal Sheth <[email protected]> * chore: Update contributors list in PR labeler (datahub-project#11105) * feat(ingest): tweak stale entity removal messaging (datahub-project#11064) * fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104) * fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080) * feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069) * docs: update graphql docs on forms & structured properties (datahub-project#11100) * test(search): search openAPI v3 test (datahub-project#11049) * fix(ingest/tableau): prevent empty site content urls (datahub-project#11057) Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * feat(entity-client): implement client batch interface (datahub-project#11106) * fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114) * fix(ingest): downgrade column type mapping warning to info (datahub-project#11115) * feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118) * fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111) * fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122) * fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092) * fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121) * fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366) * feat(ui): Changes to allow editable dataset name (datahub-project#10608) Co-authored-by: Jay Kadambi <[email protected]> * fix: remove saxo (datahub-project#11127) * feat(mcl-processor): Update mcl processor hooks (datahub-project#11134) * fix(openapi): fix openapi v2 endpoints & v3 documentation update * Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update" This reverts commit 573c1cb. * docs(policies): updates to policies documentation (datahub-project#11073) * fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139) * feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116) * fix(mutator): mutator hook fixes (datahub-project#11140) * feat(search): support sorting on multiple fields (datahub-project#10775) * feat(ingest): various logging improvements (datahub-project#11126) * fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079) Co-authored-by: Harshal Sheth <[email protected]> * feat(docs-site) cloud page spacing and content polishes (datahub-project#11141) * feat(ui) Enable editing structured props on fields (datahub-project#11042) * feat(tests): add md5 and last computed to testResult model (datahub-project#11117) * test(openapi): openapi regression smoke tests (datahub-project#11143) * fix(airflow): fix tox tests + update docs (datahub-project#11125) * docs: add chime to adoption stories (datahub-project#11142) * fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158) * fix(kafka-setup): add missing script to image (datahub-project#11190) * fix(config): fix hash algo config (datahub-project#11191) * test(smoke-test): updates to smoke-tests (datahub-project#11152) * fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193) * chore(kafka): kafka version bump (datahub-project#11211) * readd UsageStatsWorkUnit * fix merge problems * change logo --------- Co-authored-by: Chris Collins <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: John Joyce <[email protected]> Co-authored-by: dushayntAW <[email protected]> Co-authored-by: sagar-salvi-apptware <[email protected]> Co-authored-by: Aseem Bansal <[email protected]> Co-authored-by: Kevin Chun <[email protected]> Co-authored-by: jordanjeremy <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: Harshal Sheth <[email protected]> Co-authored-by: david-leifker <[email protected]> Co-authored-by: sid-acryl <[email protected]> Co-authored-by: Julien Jehannet <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: Hendrik Richert <[email protected]> Co-authored-by: RyanHolstien <[email protected]> Co-authored-by: Felix Lüdin <[email protected]> Co-authored-by: Pirry <[email protected]> Co-authored-by: Hyejin Yoon <[email protected]> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: cburroughs <[email protected]> Co-authored-by: ksrinath <[email protected]> Co-authored-by: Mayuri Nehate <[email protected]> Co-authored-by: Kunal-kankriya <[email protected]> Co-authored-by: Shirshanka Das <[email protected]> Co-authored-by: ipolding-cais <[email protected]> Co-authored-by: Tamas Nemeth <[email protected]> Co-authored-by: Shubham Jagtap <[email protected]> Co-authored-by: haeniya <[email protected]> Co-authored-by: Yanik Häni <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: Gabe Lyons <[email protected]> Co-authored-by: 808OVADOZE <[email protected]> Co-authored-by: noggi <[email protected]> Co-authored-by: Nicholas Pena <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: ethan-cartwright <[email protected]> Co-authored-by: Ethan Cartwright <[email protected]> Co-authored-by: Nadav Gross <[email protected]> Co-authored-by: Patrick Franco Braz <[email protected]> Co-authored-by: pie1nthesky <[email protected]> Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <[email protected]> Co-authored-by: Ellie O'Neil <[email protected]> Co-authored-by: Ajoy Majumdar <[email protected]> Co-authored-by: deepgarg-visa <[email protected]> Co-authored-by: Tristan Heisler <[email protected]> Co-authored-by: Andrew Sikowitz <[email protected]> Co-authored-by: Davi Arnaut <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: amit-apptware <[email protected]> Co-authored-by: Sam Black <[email protected]> Co-authored-by: Raj Tekal <[email protected]> Co-authored-by: Steffen Grohsschmiedt <[email protected]> Co-authored-by: jaegwon.seo <[email protected]> Co-authored-by: Renan F. Lima <[email protected]> Co-authored-by: Matt Exchange <[email protected]> Co-authored-by: Jonny Dixon <[email protected]> Co-authored-by: Pedro Silva <[email protected]> Co-authored-by: Pinaki Bhattacharjee <[email protected]> Co-authored-by: Jeff Merrick <[email protected]> Co-authored-by: skrydal <[email protected]> Co-authored-by: AndreasHegerNuritas <[email protected]> Co-authored-by: jayasimhankv <[email protected]> Co-authored-by: Jay Kadambi <[email protected]> Co-authored-by: David Leifker <[email protected]>

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 12, 2024

vercel bot had a problem deploying to Preview July 12, 2024 07:14 Failure

coderabbitai bot reviewed Jul 12, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py Outdated Show resolved Hide resolved

sagar-salvi-apptware force-pushed the fix/ING-660/column-lineage-s3-glue branch from 55c38db to daabcfb Compare July 12, 2024 07:18

vercel bot had a problem deploying to Preview July 12, 2024 07:19 Failure

coderabbitai bot reviewed Jul 12, 2024

View reviewed changes

sagar-salvi-apptware force-pushed the fix/ING-660/column-lineage-s3-glue branch 2 times, most recently from e687f8b to 16d80dd Compare July 12, 2024 08:57

coderabbitai bot reviewed Jul 12, 2024

View reviewed changes

vercel bot deployed to Preview July 12, 2024 09:11 View deployment

treff7es reviewed Jul 17, 2024

View reviewed changes

fix(Ingestor/Glue): Implement column upstream lineage between S3 and …

0d71837

…Glue

sagar-salvi-apptware force-pushed the fix/ING-660/column-lineage-s3-glue branch from 16d80dd to 6c45be8 Compare July 19, 2024 08:07

coderabbitai bot reviewed Jul 19, 2024

View reviewed changes

treff7es reviewed Jul 19, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview July 19, 2024 08:21 View deployment

fix: handle v2 vs. v1 field path conversion for lineage comparison

50e1f92

sagar-salvi-apptware force-pushed the fix/ING-660/column-lineage-s3-glue branch from 6c45be8 to 50e1f92 Compare July 19, 2024 08:26

coderabbitai bot reviewed Jul 19, 2024

View reviewed changes

treff7es approved these changes Jul 19, 2024

View reviewed changes

vercel bot deployed to Preview July 19, 2024 08:41 View deployment

anshbansal merged commit 348d449 into datahub-project:master Jul 19, 2024
58 checks passed

aviv-julienjehannet pushed a commit to aviv-julienjehannet/datahub that referenced this pull request Jul 25, 2024

fix(ingest/Glue): column upstream lineage between S3 and Glue (datahu…

76c5307

…b-project#10895)

jjoyce0510 reviewed Aug 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(Ingestor/Glue): Implement column upstream lineage between S3 and Glue #10895

fix(Ingestor/Glue): Implement column upstream lineage between S3 and Glue #10895

sagar-salvi-apptware commented Jul 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 12, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

treff7es Jul 17, 2024

treff7es Jul 17, 2024

sagar-salvi-apptware Jul 17, 2024 •

edited

Loading

treff7es Jul 17, 2024

sagar-salvi-apptware Jul 18, 2024 •

edited

Loading

sagar-salvi-apptware Jul 19, 2024

sagar-salvi-apptware Jul 19, 2024

treff7es Jul 19, 2024

sagar-salvi-apptware Jul 19, 2024

treff7es Jul 19, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

jjoyce0510 Aug 20, 2024

fix(Ingestor/Glue): Implement column upstream lineage between S3 and Glue #10895

fix(Ingestor/Glue): Implement column upstream lineage between S3 and Glue #10895

Conversation

sagar-salvi-apptware commented Jul 12, 2024 • edited by coderabbitai bot Loading

Checklist

Summary by CodeRabbit

coderabbitai bot commented Jul 12, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagar-salvi-apptware Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagar-salvi-apptware Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sagar-salvi-apptware commented Jul 12, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 12, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

sagar-salvi-apptware Jul 17, 2024 •

edited

Loading

sagar-salvi-apptware Jul 18, 2024 •

edited

Loading