feat(ingest): add bigquery-queries source #10994

mayurinehate · 2024-07-25T11:30:53Z

Uses queries from INFORMATION_SCHEMA.JOBS along with SqlParsingAggregator to generate "Query" entity and its aspects, Dataset's datasetUsageStatistics, lineage and operation aspects.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

New Features
- Introduced a new query log extractor for Google BigQuery to enhance metadata ingestion and support detailed analytics.
- Added functionality to track and manage observed queries with additional attributes for enhanced query logging.
- Enhanced BigQuery configuration management with new classes for identifiers and filters, improving data handling.
Bug Fixes
- Enhanced error handling and logging for query log retrieval processes in both BigQuery and Snowflake.
Documentation
- Updated documentation to reflect the new functionalities and improvements in logging for better user guidance.
Tests
- Added new tests to validate SQL parsing capabilities, enhancing test coverage for query operations.
- Modified existing tests to utilize a structured identifier builder for improved URN generation.

coderabbitai · 2024-07-25T11:30:58Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The recent changes significantly enhance metadata ingestion processes for Google BigQuery and Snowflake. A new queries_extractor.py module was introduced for BigQuery, enabling robust extraction and processing of query logs, with improvements in logging and SQL parsing capabilities. Additional refinements streamline the management of query operations, enriching performance monitoring and overall analytics within the DataHub ecosystem.

Changes

Files and Directories	Change Summary
`.../bigquery_v2/queries_extractor.py`	Introduced a new module for extracting and processing BigQuery query logs with `BigQueryQueriesExtractor`, enhancing logging, deduplication, and metadata reporting features.
`.../snowflake/snowflake_queries.py`	Enhanced logging in `get_workunits_internal` and `fetch_query_log` to improve visibility during query processing without altering the core functionality.
`.../sql_parsing/sql_parsing_aggregator.py`	Added `ObservedQuery` class for enhanced tracking of queries; updated methods to incorporate observed queries and optimize fingerprint generation.
`.../tests/unit/sql_parsing/test_sql_aggregator.py`	Introduced new tests for validating `SqlParsingAggregator` functionality with create table queries and lineage tracking through temporary tables, increasing test coverage.
`.../tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json`	Added JSON file defining dataset operations for create table queries in BigQuery.
`.../tests/unit/sql_parsing/aggregator_goldens/test_lineage_via_temp_table_disordered_add.json`	New JSON structure for improved dataset and query lineage tracking in Redshift, enhancing data transformation governance.
`.../setup.py`	Updated mappings in `setup.py` for BigQuery-related libraries and entry points, enhancing integration for handling BigQuery queries.
`.../bigquery_v2/bigquery.py`	Refactored project retrieval by removing `_get_projects` and related methods, simplifying the fetching process and enhancing code clarity.
`.../bigquery_v2/bigquery_audit.py`	Removed `to_urn` method; refactored parameter assignments in multiple methods for better readability.
`.../bigquery_v2/bigquery_config.py`	Enhanced configuration classes for sharded tables and credentials management; introduced new classes for filtering and identifier handling.

Poem

🐇 In fields of data, I hop with glee,
BigQuery and Snowflake, oh what a spree!
Logs now dance in a rhythmic flow,
With every query, watch progress grow.
Happy metrics, bright as the sun,
In the world of data, we’ve just begun! 🌼

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (5)

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (1)
138-142: Ensure email domain validation.

The email_domain field is used in various parts of the codebase, but there is no explicit validation to ensure it is in a valid format. Please add validation logic to ensure that the email domain, if provided, is valid.

Files to review:

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

Other related configuration and usage files where email_domain is used.

Analysis chain

Ensure email domain validation.

The email_domain field is optional and used for user display purposes. Ensure that the domain is valid if provided.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that email domains are valid if provided.

# Test: Search for email domain usage. Expect: Valid domain format.
rg --type python -A 5 $'email_domain'
Length of output: 67

Script:
#!/bin/bash
# Description: Verify that email domains are valid if provided.

# Test: Search for email domain usage. Expect: Valid domain format.
rg --type py -A 5 $'email_domain'
Length of output: 22863
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (3)
61-65: Deprecation warning for sharded_table_pattern.

The sharded_table_pattern field is deprecated. Ensure that users are aware of this and provide alternatives if necessary.
- sharded_table_pattern: str = Field(
-     deprecated=True,
-     default=_BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX,
-     description="The regex pattern to match sharded tables and group as one table. This is a very low level config parameter, only change if you know what you are doing, ",
- )
+ sharded_table_pattern: str = Field(
+     deprecated=True,
+     default=_BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX,
+     description="The regex pattern to match sharded tables and group as one table. This is a very low level config parameter, only change if you know what you are doing. Please use the new configuration options provided.",
+ )
232-244: Handle deprecated schema_pattern field.

The schema_pattern field is deprecated in favor of dataset_pattern. Ensure that users are aware of this and provide alternatives if necessary.
- if (
-     dataset_pattern == AllowDenyPattern.allow_all()
-     and schema_pattern != AllowDenyPattern.allow_all()
- ):
-     logging.warning(
-         "dataset_pattern is not set but schema_pattern is set, using schema_pattern as dataset_pattern. schema_pattern will be deprecated, please use dataset_pattern instead."
-     )
-     values["dataset_pattern"] = schema_pattern
-     dataset_pattern = schema_pattern
+ if (
+     dataset_pattern == AllowDenyPattern.allow_all()
+     and schema_pattern != AllowDenyPattern.allow_all()
+ ):
+     logging.warning(
+         "dataset_pattern is not set but schema_pattern is set, using schema_pattern as dataset_pattern. schema_pattern will be deprecated, please use dataset_pattern instead. Please update your configuration to use dataset_pattern."
+     )
+     values["dataset_pattern"] = schema_pattern
+     dataset_pattern = schema_pattern
140-145: Ensure secure handling of temporary credential files.

The create_credential_temp_file method creates a temporary file for credentials, but there is no evidence of secure deletion of these files after use. Please ensure that the temporary credential files are securely deleted after they are no longer needed.

File: metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

Lines: 140-145

Analysis chain

Ensure secure handling of temporary credential files.

The create_credential_temp_file method creates a temporary file for credentials. Ensure that the file is securely deleted after use.

Let's correct the rg command to properly search for Python files and check for secure deletion of temporary files created by the create_credential_temp_file method.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify secure handling of temporary credential files.

# Test: Search for secure deletion of temporary credential files. Expect: Secure deletion after use.
rg --type python -A 5 $'create_credential_temp_file'
Length of output: 82

Script:
#!/bin/bash
# Description: Verify secure handling of temporary credential files.

# Test: Search for secure deletion of temporary credential files. Expect: Secure deletion after use.
rg -t py -A 5 'create_credential_temp_file'
Length of output: 1545
metadata-ingestion/setup.py (1)
662-662: Potential Issue: Unused Entry Point

The bigquery-queries entry point is defined in setup.py and the BigQueryQueriesSource class is implemented in bigquery_queries.py, but there are no instances of this class being used elsewhere in the codebase.

metadata-ingestion/setup.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py

Please review and ensure that the BigQueryQueriesSource class is utilized correctly within the codebase.

Analysis chain

LGTM! But verify the usage of the entry point in the codebase.

The addition of the bigquery-queries entry point is approved.

However, ensure that the entry point is used correctly in the codebase.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `bigquery-queries` entry point.

# Test: Search for the entry point usage. Expect: Correct usage of the added entry point.
rg --type python -A 5 $'bigquery-queries'
Length of output: 71

Script:
#!/bin/bash
# Description: Verify the usage of the `bigquery-queries` entry point.

# Test: Search for the entry point usage. Expect: Correct usage of the added entry point.
rg --type py -A 5 'bigquery-queries'
Length of output: 1223

Script:
#!/bin/bash
# Description: Verify the usage of the BigQueryQueriesSource class in the codebase.

# Test: Search for the usage of BigQueryQueriesSource. Expect: Correct usage of the added entry point.
rg --type py -A 5 'BigQueryQueriesSource'
Length of output: 4378

Script:
#!/bin/bash
# Description: Verify the usage of the BigQueryQueriesSource class in the codebase.

# Test: Search for the usage of BigQueryQueriesSource. Expect: Correct usage of the added entry point.
rg --type py -A 5 'BigQueryQueriesSource' | grep -v 'bigquery_queries.py'
Length of output: 754

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 82412e7 and 6dcd898.

Files selected for processing (22)

metadata-ingestion/setup.py (2 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (9 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (5 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (8 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_report.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py (16 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py (7 hunks)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (2 hunks)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1 hunks)
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (2 hunks)
metadata-ingestion/tests/integration/fivetran/test_fivetran.py (1 hunks)
metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py (2 hunks)
metadata-ingestion/tests/unit/test_bigquery_lineage.py (3 hunks)
metadata-ingestion/tests/unit/test_bigquery_source.py (11 hunks)
metadata-ingestion/tests/unit/test_bigquery_usage.py (21 hunks)
metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py (2 hunks)

Files skipped from review due to trivial changes (1)

metadata-ingestion/tests/integration/fivetran/test_fivetran.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

72-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Gitleaks

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

109-109: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

Additional comments not posted (82)

metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py (2)

14-14: Import BigQueryIdentifierBuilder

The import of BigQueryIdentifierBuilder is necessary for the changes made in the run_test function.

52-52: Utilize BigQueryIdentifierBuilder for identifier generation

The usage_extractor now uses BigQueryIdentifierBuilder for generating identifiers, which improves the clarity and maintainability of the code.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py (7)

1-7: New imports

The imports are necessary for the new classes and methods introduced in this file.

34-40: Define BigQueryQueriesSourceReport class

The BigQueryQueriesSourceReport class extends SourceReport and includes additional fields specific to BigQuery queries.

43-48: Define BigQueryQueriesSourceConfig class

The BigQueryQueriesSourceConfig class extends multiple configuration classes and includes a connection configuration field.

51-72: Define BigQueryQueriesSource class

The BigQueryQueriesSource class initializes various components, including the BigQueryQueriesExtractor, and sets up the configuration and connection.

73-76: Implement create method

The create method parses the configuration dictionary and returns an instance of BigQueryQueriesSource.

78-83: Implement get_workunits_internal method

The get_workunits_internal method retrieves work units from the queries_extractor.

85-86: Implement get_report method

The get_report method returns the report generated during the ingestion process.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py (10)

1-2: New imports

The imports are necessary for the new classes and methods introduced in this file.

30-42: Define BigQueryIdentifierBuilder class

The BigQueryIdentifierBuilder class encapsulates methods for generating various URNs related to BigQuery datasets and users.

43-56: Implement gen_dataset_urn method

The gen_dataset_urn method generates a dataset URN based on the provided parameters.

57-63: Implement gen_dataset_urn_from_raw_ref method

The gen_dataset_urn_from_raw_ref method generates a dataset URN from a raw reference.

65-67: Implement gen_user_urn method

The gen_user_urn method generates a user URN based on the provided email.

68-70: Implement make_data_platform_urn method

The make_data_platform_urn method generates a data platform URN.

71-76: Implement make_dataplatform_instance_urn method

The make_dataplatform_instance_urn method generates a data platform instance URN based on the project ID and configuration.

79-85: Define BigQueryFilter class

The BigQueryFilter class handles filtering logic for tables based on the provided configuration.

86-98: Implement is_allowed method

The is_allowed method determines if a table is allowed based on project IDs and dataset patterns.

100-103: Implement is_project_allowed method

The is_project_allowed method determines if a project is allowed based on the provided configuration.
metadata-ingestion/tests/unit/test_bigquery_lineage.py (2)
111-111: LGTM! Verify the usage of BigQueryIdentifierBuilder.

The change to use BigQueryIdentifierBuilder enhances readability and maintainability.

Ensure that BigQueryIdentifierBuilder is correctly utilized across the codebase.

12-12: LGTM! Verify the usage of BigQueryIdentifierBuilder.

The change to use BigQueryIdentifierBuilder enhances readability and maintainability.

Ensure that BigQueryIdentifierBuilder is correctly utilized across the codebase.

Also applies to: 86-86

Verification successful

Verified: Consistent usage of BigQueryIdentifierBuilder

The BigQueryIdentifierBuilder is used consistently across multiple files in the codebase, indicating that the change is well-integrated and should not cause any issues.

metadata-ingestion/tests/unit/test_bigquery_usage.py

metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py

metadata-ingestion/tests/unit/test_bigquery_lineage.py

metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `BigQueryIdentifierBuilder` across the codebase.

# Test: Search for the usage of `BigQueryIdentifierBuilder`. Expect: Consistent usage across the codebase.
rg --type python -A 5 $'BigQueryIdentifierBuilder'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify the usage of `BigQueryIdentifierBuilder` across the codebase.

# Test: Search for the usage of `BigQueryIdentifierBuilder`. Expect: Consistent usage across the codebase.
rg --type py -A 5 $'BigQueryIdentifierBuilder'
Length of output: 18757
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_report.py (2)

34-35: LGTM!

The renaming of performance timers enhances readability and understanding of their purpose.

173-174: LGTM!

The addition of the sql_aggregator attribute enhances the reporting features related to SQL data ingestion.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py (2)

15-15: LGTM!

The change to use BigQueryIdentifierBuilder enhances the identifier handling mechanism.

Also applies to: 138-138

162-162: LGTM!

The change to use BigQueryIdentifierBuilder enhances the identifier handling mechanism.

metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py (1)

121-126: Improved readability and structure.

The changes improve the readability and structure by defining the report variable separately before being passed as an argument. This ensures that both the extractor and the identifier builder share the same report instance.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (2)

123-124: Improved modularity and separation of concerns.

The changes improve modularity and separation of concerns by instantiating BigQueryFilter and BigQueryIdentifierBuilder in the constructor.

234-238: Improved efficiency and clarity in project retrieval.

The changes improve efficiency and clarity by using the new get_projects function, which integrates filtering logic into the project retrieval process.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (7)

1-3: Ensure proper import ordering and usage.

The imports are correctly ordered and necessary for the functionality.

78-88: Ensure proper configuration handling.

The BigQueryQueriesExtractorConfig class handles various configurations, including time windows and local paths for audit logs. Ensure that all configurations are properly documented and validated.

110-122: Well-structured class initialization.

The BigQueryQueriesExtractor class initialization is well-structured, ensuring all necessary components are properly instantiated.

186-220: Efficient work unit generation.

The get_workunits_internal method efficiently generates metadata work units by leveraging cached audit logs and fetching query logs as needed.

221-260: Robust query log fetching.

The fetch_query_log method robustly fetches query logs from BigQuery, handling errors and logging progress.

261-282: Accurate audit log row parsing.

The _parse_audit_log_row method accurately parses audit log rows into ObservedQuery or PreparsedQuery objects, ensuring proper handling of timestamps and user information.

285-323: Comprehensive query log query building.

The _build_enriched_query_log_query function comprehensively builds query log queries, filtering by creation time and excluding information schema queries.
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (1)
143-145: Clarify the behavior of apply_view_usage_to_tables.

The apply_view_usage_to_tables field determines whether usage statistics from views should be applied to their base tables. Clarify if this behavior is consistent across different use cases.

Verification successful

Consistent behavior of apply_view_usage_to_tables confirmed.

The apply_view_usage_to_tables field is used across different configurations and tests, indicating consistent behavior. It determines whether usage statistics from views should be applied to their base tables, and this behavior is documented and tested in both Snowflake and BigQuery configurations.

Files and lines:

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py: Lines 143-145

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py: Similar configuration for BigQuery

Tests:

metadata-ingestion/tests/unit/test_bigquery_usage.py

metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_usage_v2.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the behavior of `apply_view_usage_to_tables`.

# Test: Search for the usage of `apply_view_usage_to_tables`. Expect: Consistent behavior across use cases.
rg --type python -A 5 $'apply_view_usage_to_tables'
Length of output: 81

Script:
#!/bin/bash
# Description: Verify the behavior of `apply_view_usage_to_tables`.

# Test: Search for the usage of `apply_view_usage_to_tables`. Expect: Consistent behavior across use cases.
rg --type-add 'py:*.py' -A 5 'apply_view_usage_to_tables'
Length of output: 4777
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (1)

222-224: Enhance logging for better observability.

The logging statement provides progress updates every 1000 queries. Ensure that this granularity is appropriate for the expected query volume.
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (4)
47-50: Ensure rate limiting is handled correctly.

The rate_limit field controls whether API requests should be rate-limited. Ensure that the rate limiting logic is implemented correctly.

Verification successful

Let's correct the rg command and re-run the script to search for the rate_limit logic in Python files.

Rate limiting logic is correctly implemented.

The rate_limit field in BigQueryBaseConfig is used to control API request rates, and the logic for rate limiting is correctly implemented using the RateLimiter class across various files.

bigquery_audit_log_api.py

bigquery_schema_gen.py

bigquery_schema.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify rate limiting logic.

# Test: Search for rate limiting logic. Expect: Correct implementation of rate limiting.
rg --type python -A 5 $'rate_limit'
Length of output: 65

Script:
#!/bin/bash
# Description: Verify rate limiting logic.

# Test: Search for rate limiting logic. Expect: Correct implementation of rate limiting.
rg --type-add 'python:*.py' -A 5 'rate_limit'
Length of output: 17093
202-210: Ensure proper filtering of project IDs.

The project_ids field specifies project IDs to be ingested. Ensure that the filtering logic is correctly implemented.

Verification successful

Verified proper filtering of project IDs.

The project_ids field is correctly integrated into the filtering logic across multiple components, including unit tests and schema generation.

Unit Tests: test_bigquery_source.py

Schema Generation: bigquery_schema_gen.py

Common Filtering Logic: common.py

Configuration Handling: bigquery_config.py

Connection Testing: bigquery_test_connection.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify filtering logic for project IDs.

# Test: Search for filtering logic for project IDs. Expect: Correct implementation of filtering logic.
rg --type python -A 5 $'project_ids'
Length of output: 66

Script:
#!/bin/bash
# Description: Verify filtering logic for project IDs.

# Test: Search for filtering logic for project IDs. Expect: Correct implementation of filtering logic.
rg --type py -A 5 $'project_ids'
Length of output: 16739
303-304: Ensure proper validation of bigquery_audit_metadata_datasets.

The validate_bigquery_audit_metadata_datasets method validates the bigquery_audit_metadata_datasets field. Ensure that the validation logic is correctly implemented.

Verification successful

Validation logic for bigquery_audit_metadata_datasets is correctly implemented.

The validate_bigquery_audit_metadata_datasets method ensures that when use_exported_bigquery_audit_metadata is set to True, the bigquery_audit_metadata_datasets must be a non-empty list. This validation logic is correctly implemented.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py:

Lines: 303-304
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify validation logic for `bigquery_audit_metadata_datasets`.

# Test: Search for validation logic for `bigquery_audit_metadata_datasets`. Expect: Correct implementation of validation logic.
rg --type python -A 5 $'validate_bigquery_audit_metadata_datasets'
Length of output: 96

Script:
#!/bin/bash
# Description: Verify validation logic for `bigquery_audit_metadata_datasets`.

# Test: Search for validation logic for `bigquery_audit_metadata_datasets`. Expect: Correct implementation of validation logic.
rg --type py -A 5 'def validate_bigquery_audit_metadata_datasets'
Length of output: 789

Script:
#!/bin/bash
# Description: Retrieve the complete implementation of the `validate_bigquery_audit_metadata_datasets` method.

# Test: Search for the complete implementation of `validate_bigquery_audit_metadata_datasets`.
rg --type py -A 20 'def validate_bigquery_audit_metadata_datasets' metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py
Length of output: 757
303-304: Consolidate filtering capabilities.

The BigQueryV2Config class now inherits from BigQueryFilterConfig, consolidating filtering capabilities. Ensure that the new filtering logic is correctly implemented.
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (3)
379-398: LGTM! But verify the usage of the method in the codebase.

The code changes improve readability and are approved.

However, ensure that the method from_exported_bigquery_audit_metadata is used correctly in the codebase.

Verification successful

The usage of the from_exported_bigquery_audit_metadata method has been verified in the codebase.

The method is used in the following files:

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

The changes improve readability and are consistent with the updated method implementation.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `from_exported_bigquery_audit_metadata` method.

# Test: Search for the method usage. Expect: Correct usage of the updated method.
rg --type python -A 5 $'from_exported_bigquery_audit_metadata'
Length of output: 92

Script:
#!/bin/bash
# Description: Verify the usage of the `from_exported_bigquery_audit_metadata` method.

# Test: Search for the method usage. Expect: Correct usage of the updated method.
rg --type py -A 5 $'from_exported_bigquery_audit_metadata'
Length of output: 3598
289-309: LGTM! But verify the usage of the method in the codebase.

The code changes improve readability and are approved.

However, ensure that the method from_entry is used correctly in the codebase.

456-475: LGTM! But verify the usage of the method in the codebase.

The code changes improve readability and are approved.

However, ensure that the method from_entry_v2 is used correctly in the codebase.

Verification successful

The usage of the from_entry_v2 method is correct and consistent with the recent changes.

The method is invoked appropriately in both lineage.py and usage.py files, ensuring that the readability improvements do not impact functionality.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `from_entry_v2` method.

# Test: Search for the method usage. Expect: Correct usage of the updated method.
rg --type python -A 5 $'from_entry_v2'
Length of output: 68

Script:
#!/bin/bash
# Description: Verify the usage of the `from_entry_v2` method.

# Test: Search for the method usage. Expect: Correct usage of the updated method.
rg --type py -A 5 from_entry_v2
Length of output: 1980

Script:
#!/bin/bash
# Description: Extract relevant lines from `lineage.py` and `usage.py` to verify the usage of `from_entry_v2`.

# Extract lines around the usage of `from_entry_v2` in `lineage.py`
rg --type py -A 10 -B 5 'from_entry_v2' metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

# Extract lines around the usage of `from_entry_v2` in `usage.py`
rg --type py -A 10 -B 5 'from_entry_v2' metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py
Length of output: 1447
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py (2)
586-616: LGTM! But verify the usage of the function in the codebase.

The code changes improve error handling and robustness and are approved.

However, ensure that the function query_project_list is used correctly in the codebase.

Verification successful

The function query_project_list is used correctly within the codebase.

The usage aligns with the intended functionality and the changes made to the function.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `query_project_list` function.

# Test: Search for the function usage. Expect: Correct usage of the updated function.
rg --type python -A 5 $'query_project_list'
Length of output: 73

Script:
#!/bin/bash
# Description: Verify the usage of the `query_project_list` function.

# Test: Search for the function usage. Expect: Correct usage of the updated function.
rg --type py -A 5 $'query_project_list'
Length of output: 1277

Script:
#!/bin/bash
# Description: Extract surrounding lines of the `query_project_list` usage to verify correctness.

# Test: Search for the function usage and extract additional context.
rg --type py -A 10 -B 10 $'query_project_list' metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py
Length of output: 1304
618-636: LGTM! But verify the usage of the function in the codebase.

The code changes improve modularity and reusability and are approved.

However, ensure that the function get_projects is used correctly in the codebase.

Verification successful

The get_projects function is used correctly in the codebase.

The function is integrated and tested in multiple places, indicating its correct usage.

metadata-ingestion/tests/unit/test_bigquery_source.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `get_projects` function.

# Test: Search for the function usage. Expect: Correct usage of the updated function.
rg --type python -A 5 $'get_projects'
Length of output: 67

Script:
#!/bin/bash
# Description: Verify the usage of the `get_projects` function.

# Test: Search for the function usage. Expect: Correct usage of the updated function.
rg --type py -A 5 $'get_projects'
Length of output: 15417
metadata-ingestion/setup.py (1)

354-354: LGTM! But verify the usage of the dependency in the codebase.

The addition of the bigquery-queries dependency is approved.

However, ensure that the dependency is used correctly in the codebase.
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py (3)
231-236: Verify the correctness of the new identifiers parameter.

Ensure that the BigQueryIdentifierBuilder class is correctly instantiated and used throughout the class methods.

Verification successful

The new identifiers parameter is correctly instantiated and used.

The BigQueryIdentifierBuilder class is properly instantiated and utilized across various files, including tests and source files, confirming the correctness of the changes.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the instantiation and usage of the `BigQueryIdentifierBuilder` class.

# Test: Search for the instantiation of `BigQueryIdentifierBuilder`. Expect: Correct instantiation.
rg --type python -A 5 $'BigQueryIdentifierBuilder'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify the instantiation and usage of the `BigQueryIdentifierBuilder` class.

# Test: Search for the instantiation of `BigQueryIdentifierBuilder`. Expect: Correct instantiation.
rg 'BigQueryIdentifierBuilder' -t py -A 5
Length of output: 18753
433-433: Verify the correctness of the new URN generation method.

Ensure that self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function.

Verification successful

The new URN generation method is correctly integrated and used.

The method self.identifiers.gen_dataset_urn_from_raw_ref is consistently used across the codebase and correctly replaces the old callable function.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/tests/unit/test_bigquery_usage.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg 'gen_dataset_urn_from_raw_ref' --glob '*.py' -A 5
Length of output: 7930
876-878: Verify the correctness of the new URN generation method.

Ensure that self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function.

Verification successful

Let's correct the file type and re-run the search for the gen_dataset_urn_from_raw_ref method usage.

To ensure the correctness of the new URN generation method, let's verify the definition and usage of the gen_dataset_urn method in common.py.

Verification successful for the new URN generation method.

The method self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function and is used appropriately across the codebase.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type py -A 5 'gen_dataset_urn_from_raw_ref'
Length of output: 7926

Script:
#!/bin/bash
# Description: Verify the definition and usage of `gen_dataset_urn` method.

# Test: Search for the definition and usage of `gen_dataset_urn`. Expect: Correct definition and usage.
rg --type py -A 5 'def gen_dataset_urn'
Length of output: 4331
metadata-ingestion/tests/unit/test_bigquery_usage.py (4)
171-171: Verify the correctness of the new URN generation method.

Ensure that identifiers.gen_dataset_urn_from_raw_ref is correctly used to generate URNs.

Verification successful

The gen_dataset_urn_from_raw_ref method is defined and used in multiple places within the codebase. Here are the key points of interest:

Definition:

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py: def gen_dataset_urn_from_raw_ref(self, ref: BigQueryTableRef) -> str:

This method generates a dataset URN from a raw reference.

Usages:

metadata-ingestion/tests/unit/test_bigquery_usage.py (multiple lines)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

The method appears to be correctly defined and used across the codebase. Based on the gathered information, the usage of gen_dataset_urn_from_raw_ref in test_bigquery_usage.py seems consistent with its intended purpose.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method in `make_usage_workunit`.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method in `make_usage_workunit`.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type py -A 5 'gen_dataset_urn_from_raw_ref'
Length of output: 7926
178-178: Ensure the new resource_urn parameter is correctly used.

Verify that the resource_urn parameter is correctly passed and used in the function.

Verification successful

The resource_urn parameter is correctly used in the make_operational_workunit function.

The resource_urn is correctly passed as the entityUrn in the MetadataChangeProposalWrapper.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `resource_urn` parameter in `make_operational_workunit`.

# Test: Search for the usage of `resource_urn` in `make_operational_workunit`. Expect: Correct usage.
rg --type python -A 5 $'def make_operational_workunit'
Length of output: 86

Script:
#!/bin/bash
# Description: Verify the usage of the `resource_urn` parameter in `make_operational_workunit`.

# Test: Search for the usage of `resource_urn` in `make_operational_workunit`. Expect: Correct usage.
rg --type py -A 5 $'def make_operational_workunit'
Length of output: 594
214-217: Ensure the new identifiers parameter is correctly used.

Verify that the identifiers parameter is correctly passed and used in the function.

Verification successful

The new identifiers parameter is correctly used in the make_zero_usage_workunit function.

The identifiers parameter is passed to the make_usage_workunit function, ensuring proper usage.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter in `make_zero_usage_workunit`.

# Test: Search for the usage of `identifiers` in `make_zero_usage_workunit`. Expect: Correct usage.
rg --type python -A 5 $'def make_zero_usage_workunit'
Length of output: 85

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter in `make_zero_usage_workunit`.

# Test: Search for the usage of `identifiers` in `make_zero_usage_workunit`. Expect: Correct usage.
rg --type py -A 5 $'def make_zero_usage_workunit'
Length of output: 561

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter in `make_zero_usage_workunit`.

# Test: Search for the usage of `identifiers` in `make_zero_usage_workunit` and capture more lines of context.
rg --type py -A 50 $'def make_zero_usage_workunit'
Length of output: 4302
165-167: Ensure the new identifiers parameter is correctly used.

Verify that the identifiers parameter is correctly passed and used in the function.

Verification successful

The new identifiers parameter is correctly used.

The identifiers parameter is utilized to generate the entityUrn within the make_usage_workunit function, ensuring correct and consistent behavior.

metadata-ingestion/tests/unit/test_bigquery_usage.py, lines 165-167
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter in `make_usage_workunit`.

# Test: Search for the usage of `identifiers` in `make_usage_workunit`. Expect: Correct usage.
rg --type python -A 5 $'def make_usage_workunit'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter in `make_usage_workunit`.

# Test: Search for the usage of `identifiers` in `make_usage_workunit`. Expect: Correct usage.
rg --type py -A 5 $'def make_usage_workunit'
Length of output: 1853

Script:
#!/bin/bash
# Description: Verify the usage of the `identifiers` parameter within the `make_usage_workunit` function.

# Test: Search for the usage of `identifiers` within the `make_usage_workunit` function in `metadata-ingestion/tests/unit/test_bigquery_usage.py`.
rg --type py -A 20 'def make_usage_workunit' metadata-ingestion/tests/unit/test_bigquery_usage.py
Length of output: 850
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py (6)
409-411: Verify the correctness of the new URN generation method.

Ensure that self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function.

Verification successful

The new URN generation method gen_dataset_urn_from_raw_ref is correctly used and consistent across the codebase.

The method is defined in common.py and is used appropriately in usage.py and other relevant files.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py: Method definition.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py: Multiple usages.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py: Multiple usages.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type py -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 7926
542-542: Verify the correctness of the new URN generation method.

Ensure that self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function.

Verification successful

Verify the correctness of the new URN generation method.

The method self.identifiers.gen_dataset_urn_from_raw_ref is used correctly and consistently across the codebase, confirming that it correctly replaces the old callable function.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py

metadata-ingestion/tests/unit/test_bigquery_usage.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg -A 5 'gen_dataset_urn_from_raw_ref'
Length of output: 7916
Line range hint 1117-1117: Verify the correctness of the new method usage.

Ensure that _get_parsed_bigquery_log_events correctly replaces the old get_exported_log_entries method.

Verification successful

Verification successful: The _get_parsed_bigquery_log_events method is used correctly in the metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py file, replacing the old get_exported_log_entries method. The method is defined and utilized properly within the context.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `_get_parsed_bigquery_log_events` method.

# Test: Search for the usage of `_get_parsed_bigquery_log_events`. Expect: Correct usage.
rg --type python -A 5 $'_get_parsed_bigquery_log_events'
Length of output: 86

Script:
#!/bin/bash
# Description: Verify the usage of `_get_parsed_bigquery_log_events` method.

# Test: Search for the usage of `_get_parsed_bigquery_log_events`. Expect: Correct usage.
rg -A 5 '_get_parsed_bigquery_log_events'
Length of output: 2177
718-719: Verify the correctness of the new URN generation method.

Ensure that self.identifiers.gen_dataset_urn_from_raw_ref correctly replaces the old callable function.

Verification successful

The method gen_dataset_urn_from_raw_ref is correctly replacing the old callable function.

The method is defined in bigquery_v2/common.py and is used consistently across the codebase, including in unit tests, indicating proper integration.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/tests/unit/test_bigquery_usage.py
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg --type python -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 83

Script:
#!/bin/bash
# Description: Verify the usage of `gen_dataset_urn_from_raw_ref` method.

# Test: Search for the usage of `gen_dataset_urn_from_raw_ref`. Expect: Correct usage.
rg -A 5 $'gen_dataset_urn_from_raw_ref'
Length of output: 7916
318-324: Verify the correctness of the new identifiers parameter.

Ensure that the BigQueryIdentifierBuilder class is correctly instantiated and used throughout the class methods.

Verification successful

The new identifiers parameter is correctly instantiated and used throughout the class methods.

The BigQueryIdentifierBuilder class is consistently instantiated and utilized across various files in the codebase.

Files:

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py

Various test files
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the instantiation and usage of the `BigQueryIdentifierBuilder` class.

# Test: Search for the instantiation of `BigQueryIdentifierBuilder`. Expect: Correct instantiation.
rg --type python -A 5 $'BigQueryIdentifierBuilder'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify the instantiation and usage of the `BigQueryIdentifierBuilder` class.

# Test: Search for the instantiation of `BigQueryIdentifierBuilder`. Expect: Correct instantiation.
rg --type py -A 5 'BigQueryIdentifierBuilder'
Length of output: 18757
724-724: Verify the correctness of the new user URN generation method.

Ensure that self.identifiers.gen_user_urn correctly replaces the old make_user_urn function.

Verification successful

Verified the correctness of the new user URN generation method.

The gen_user_urn methods in both unity/source.py and bigquery_v2/common.py correctly preprocess the input and call the make_user_urn function to generate the user URN.

gen_user_urn in metadata-ingestion/src/datahub/ingestion/source/unity/source.py

gen_user_urn in metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py

Both methods appropriately replace the old make_user_urn function by adding necessary preprocessing steps before calling make_user_urn.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `gen_user_urn` method.

# Test: Search for the usage of `gen_user_urn`. Expect: Correct usage.
rg --type python -A 5 $'gen_user_urn'
Length of output: 67

Script:
#!/bin/bash
# Description: Find the implementation of `gen_user_urn` method in the codebase.

# Search for the definition of `gen_user_urn` method
ast-grep --lang python --pattern 'def gen_user_urn($_$) { $$$ }'
Length of output: 518

Script:
#!/bin/bash
# Description: Find the implementation of `make_user_urn` function in the codebase.

# Search for the definition of `make_user_urn` function
ast-grep --lang python --pattern 'def make_user_urn($_$) { $$$ }'
Length of output: 520

Script:
#!/bin/bash
# Description: Find the implementation of `gen_user_urn` method in the codebase.

# Search for the definition of `gen_user_urn` method
ast-grep --lang python --pattern 'def gen_user_urn($_$): $$$'
Length of output: 63

Script:
#!/bin/bash
# Description: Find the implementation of `make_user_urn` function in the codebase.

# Search for the definition of `make_user_urn` function
ast-grep --lang python --pattern 'def make_user_urn($_$): $$$'
Length of output: 64

Script:
#!/bin/bash
# Description: Find the implementation of `gen_user_urn` method in the codebase.

# Search for the definition of `gen_user_urn` method
rg 'def gen_user_urn' --type py -A 5
Length of output: 1305

Script:
#!/bin/bash
# Description: Find the implementation of `make_user_urn` function in the codebase.

# Search for the definition of `make_user_urn` function
rg 'def make_user_urn' --type py -A 5
Length of output: 540
metadata-ingestion/tests/unit/test_bigquery_source.py (10)

183-187: LGTM!

The function correctly calls get_projects and asserts the expected output.

197-201: LGTM!

The function correctly calls get_projects and asserts the expected output for the overridden project ID pattern.

219-223: LGTM!

The function correctly calls get_projects and asserts the expected output.

236-239: LGTM!

The function correctly initializes BigQueryV2Config and asserts the expected output.

286-290: LGTM!

The function correctly calls get_projects and asserts the expected output.

322-326: LGTM!

The function correctly calls get_projects and asserts the expected output.

347-351: LGTM!

The function correctly calls get_projects and asserts the expected output.

366-370: LGTM!

The function correctly calls get_projects and asserts the expected output.

391-395: LGTM!

The function correctly calls get_projects and asserts the expected output.

411-415: LGTM!

The function correctly calls get_projects and asserts the expected output.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py (10)

160-168: LGTM!

The __init__ method is correctly updated to use BigQueryIdentifierBuilder, improving code clarity and maintainability.

204-205: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating URNs, improving code clarity and maintainability.

215-215: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating platform URNs, improving code clarity and maintainability.

223-223: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating platform URNs, improving code clarity and maintainability.

738-740: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating dataset URNs, improving code clarity and maintainability.

779-781: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating dataset URNs, improving code clarity and maintainability.

949-949: LGTM!

The function correctly uses BigQueryIdentifierBuilder for generating platform URNs, improving code clarity and maintainability.

Line range hint 994-1004:
LGTM!

The function correctly uses BigQuerySchemaApi for retrieving tables, improving code clarity and maintainability.

1023-1023: LGTM!

The function correctly uses BigQuerySchemaApi for listing tables, improving code clarity and maintainability.

280-281: LGTM!

The function correctly uses BigQuerySchemaApi for retrieving datasets, improving code clarity and maintainability.

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (2)

86-86: LGTM! The alias ObservedQuery improves code clarity.

The alias ObservedQuery for LoggedQuery enhances readability and simplifies type annotations.

475-494: LGTM! The add method now supports ObservedQuery.

The modification to accept ObservedQuery and handle it by invoking add_observed_query is consistent with the existing pattern.

Verify the correctness of add_observed_query handling.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 6dcd898 and 4f36f40.

Files selected for processing (2)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2 hunks)

Files skipped from review due to trivial changes (1)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4f36f40 and 12fea30.

Files selected for processing (1)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

coderabbitai

Actionable comments posted: 2

Outside diff range, codebase verification and nitpick comments (1)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1)
12-16: Consider grouping related imports together.

Group related imports for better readability.
- from datahub.configuration.time_window_config import (
-     BaseTimeWindowConfig,
-     get_time_bucket,
- )
+ from datahub.configuration.time_window_config import BaseTimeWindowConfig, get_time_bucket

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 12fea30 and cdfb94f.

Files selected for processing (3)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (5 hunks)
metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (1 hunks)

Additional comments not posted (20)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (14)

54-57: LGTM!

The BigQueryTableReference class definition looks good.

60-63: LGTM!

The DMLJobStatistics class definition looks good.

66-84: LGTM!

The BigQueryJob class definition looks good.

87-113: LGTM!

The BigQueryQueriesExtractorConfig class definition looks good.

116-126: LGTM!

The BigQueryQueriesExtractorReport class definition looks good.

128-174: Ensure proper handling of missing configurations.

The constructor initializes various attributes but does not handle potential missing configurations, which could lead to runtime errors.

Ensure that all required configurations are provided and handle missing configurations appropriately.

177-180: LGTM!

The structured_report property looks good.

192-199: LGTM!

The is_temp_table method looks good.

201-209: LGTM!

The is_allowed_table method looks good.

211-257: Ensure proper handling of stale audit logs.

The get_workunits_internal method mentions a TODO comment about checking if the cached audit log is stale. This should be addressed to avoid potential issues with stale data.

Ensure that logic is added to check if the cached audit log is stale.

259-287: LGTM!

The deduplicate_queries method looks good.

289-324: LGTM!

The fetch_query_log method looks good.

326-346: LGTM!

The _parse_audit_log_row method looks good.

349-389: LGTM!

The _build_enriched_query_log_query function looks good.

metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (2)

Line range hint 1-13: LGTM!

The imports look good.

504-526: LGTM!

The test_create_table_query_mcps function looks good.

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (4)

86-90: LGTM!

The ObservedQuery class definition looks good.

478-499: LGTM!

The add method implementation looks good.

Line range hint 642-686: LGTM!

The add_observed_query method implementation looks good.

1158-1160: LGTM!

The guard clause in the _gen_lineage_for_downstream method looks good.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between cdfb94f and fa2db0a.

Files selected for processing (4)

metadata-ingestion/setup.py (2 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (9 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (5 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (8 hunks)

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

72-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Gitleaks

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

109-109: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

Additional comments not posted (14)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (5)

123-124: Initialization of BigQueryFilter and BigQueryIdentifierBuilder looks good.

The changes are consistent and correctly initialize the new objects.

234-238: Call to get_projects and handling of projects list looks good.

The changes correctly call get_projects and handle the projects list as intended.

Line range hint 15-15: Verify the removal of gen_dataset_urn.

The method gen_dataset_urn has been removed. Ensure that this change is intentional and does not break any functionality.

Line range hint 15-15: Verify the removal of gen_dataset_urn_from_raw_ref.

The method gen_dataset_urn_from_raw_ref has been removed. Ensure that this change is intentional and does not break any functionality.

Line range hint 15-15: Verify the removal of _get_projects and _query_project_list.

The methods _get_projects and _query_project_list have been removed. Ensure that this change is intentional and does not break any functionality.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (4)

47-75: New fields and validator method in BigQueryBaseConfig look good.

The new fields are correctly defined and the validator method is correctly implemented.

Tools

Ruff

72-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

105-145: New fields and methods in BigQueryCredential look good.

The new fields are correctly defined and the methods are correctly implemented.

Tools

Gitleaks

109-109: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

202-281: New fields and root validator method in BigQueryFilterConfig look good.

The new fields are correctly defined and the root validator method is correctly implemented.

Line range hint 299-372: New fields and root validator method in BigQueryV2Config look good.

The new fields are correctly defined and the root validator method is correctly implemented.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (3)

289-309: Changes to parameter assignments in from_entry look good.

The changes are correctly implemented and maintain the intended functionality.

379-398: Changes to parameter assignments in from_exported_bigquery_audit_metadata look good.

The changes are correctly implemented and maintain the intended functionality.

456-475: Changes to parameter assignments in from_entry_v2 look good.

The changes are correctly implemented and maintain the intended functionality.

metadata-ingestion/setup.py (2)

354-354: Approved: Addition of bigquery-queries plugin.

The addition of the bigquery-queries plugin with the appropriate dependencies (sql_common, bigquery_common, sqlglot_lib) is consistent with the goal of enhancing BigQuery query handling.

662-662: Approved: Addition of bigquery-queries entry point.

The addition of the bigquery-queries entry point, linking to BigQueryQueriesSource, is consistent with the goal of enhancing BigQuery query handling.

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between fa2db0a and a569bca.

Files selected for processing (4)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json (1 hunks)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_lineage_via_temp_table_disordered_add.json (1 hunks)
metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (1 hunks)

Files skipped from review as they are similar to previous changes (1)

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

Additional comments not posted (4)

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json (1)

1-21: JSON structure is valid and well-formed.

The JSON structure for the metadata change proposal (MCP) is correctly formatted and aligns with the expected schema for representing dataset operations. It includes necessary fields such as entityType, entityUrn, changeType, aspectName, and aspect.

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_lineage_via_temp_table_disordered_add.json (1)

1-78: JSON structure is valid and well-formed.

The JSON structure for the metadata change proposals (MCPs) is correctly formatted and aligns with the expected schema for representing lineage and query properties. It includes necessary fields such as entityType, entityUrn, changeType, aspectName, and aspect.

metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (2)

505-526: New test function test_create_table_query_mcps is well-structured.

The test function validates the SqlParsingAggregator for processing a SQL query that creates a table in a BigQuery environment. It correctly initializes the aggregator, adds an observed query, and checks the generated metadata against a golden file. The test enhances coverage for create table operations.

530-559: New test function test_lineage_via_temp_table_disordered_add is well-structured.

The test function validates the SqlParsingAggregator for handling lineage through temporary tables in a Redshift environment. It correctly sets up schema information, adds observed queries, and checks the generated metadata against a golden file. The test enhances coverage for lineage tracking through temporary tables.

- update filtering in bigquery-queries source to use allow deny patterns

otherwise exception is thrown during parsing.e.g. list index out of range

coderabbitai

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a569bca and d997bb6.

Files selected for processing (25)

metadata-ingestion/setup.py (2 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (9 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py (5 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (8 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_report.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py (16 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py (3 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py (4 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py (1 hunks)
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py (7 hunks)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (2 hunks)
metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2 hunks)
metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (5 hunks)
metadata-ingestion/tests/integration/fivetran/test_fivetran.py (1 hunks)
metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py (2 hunks)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json (1 hunks)
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_lineage_via_temp_table_disordered_add.json (1 hunks)
metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py (1 hunks)
metadata-ingestion/tests/unit/test_bigquery_lineage.py (3 hunks)
metadata-ingestion/tests/unit/test_bigquery_source.py (11 hunks)
metadata-ingestion/tests/unit/test_bigquery_usage.py (21 hunks)
metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py (2 hunks)

Files skipped from review due to trivial changes (2)

metadata-ingestion/tests/integration/fivetran/test_fivetran.py
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_lineage_via_temp_table_disordered_add.json

Files skipped from review as they are similar to previous changes (5)

metadata-ingestion/setup.py
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_audit.py
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py
metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json
metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py

Additional context used

Ruff

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

72-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

Gitleaks

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

109-109: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

Additional comments not posted (55)

metadata-ingestion/tests/performance/bigquery/test_bigquery_usage.py (1)
52-52: Refactor approved: Use of BigQueryIdentifierBuilder.

The refactor enhances maintainability and readability by encapsulating identifier logic within BigQueryIdentifierBuilder.

Ensure that the integration of BigQueryIdentifierBuilder is consistent and correct throughout the codebase.

Verification successful

Integration of BigQueryIdentifierBuilder is consistent and correct.

The BigQueryIdentifierBuilder is widely used across multiple files, including both test and source files, indicating a well-integrated refactor.

Verified files include:

metadata-ingestion/tests/unit/test_bigquery_usage.py

metadata-ingestion/tests/unit/test_bigquery_lineage.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py

And several others.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the integration of `BigQueryIdentifierBuilder` across the codebase.

# Test: Search for the class usage. Expect: Consistent and correct instantiation and usage.
rg --type python -A 5 $'BigQueryIdentifierBuilder'
Length of output: 80

Script:
#!/bin/bash
# Description: Verify the integration of `BigQueryIdentifierBuilder` across the codebase.

# Test: Search for the class usage. Expect: Consistent and correct instantiation and usage.
rg --glob '*.py' -A 5 'BigQueryIdentifierBuilder'
Length of output: 18036
metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py (2)

34-40: Class structure approved: BigQueryQueriesSourceReport.

The class is well-structured for encapsulating query extraction and schema performance metrics.

51-55: Class design approved: BigQueryQueriesSource.

The class effectively manages the extraction process with clear separation of concerns.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/common.py (2)

30-55: Class approved: BigQueryIdentifierBuilder.

The class provides a comprehensive approach to URN generation, ensuring consistent identifier handling.

79-103: Class approved: BigQueryFilter.

The filtering logic is robust, effectively utilizing patterns and configurations to enforce rules.

Ensure that all necessary filtering criteria are covered and correctly implemented.

metadata-ingestion/tests/unit/test_bigquery_lineage.py (2)

86-86: Use of BigQueryIdentifierBuilder improves maintainability.

The introduction of BigQueryIdentifierBuilder encapsulates URN generation, enhancing clarity and maintainability.

111-111: Consistent use of BigQueryIdentifierBuilder enhances code clarity.

The change aligns with the goal of encapsulating identifier logic within a dedicated class.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_report.py (2)

34-35: Renaming attributes improves clarity.

Renaming list_projects and list_datasets to list_projects_timer and list_datasets_timer clarifies their purpose as performance timers.

174-174: Addition of sql_aggregator enhances reporting capabilities.

The sql_aggregator attribute is a valuable addition for SQL query aggregation and reporting.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_test_connection.py (2)

138-138: Use of BigQueryIdentifierBuilder improves maintainability.

The introduction of BigQueryIdentifierBuilder encapsulates identifier logic, enhancing clarity and maintainability.

162-162: Consistent use of BigQueryIdentifierBuilder enhances code clarity.

The change aligns with the goal of encapsulating identifier logic within a dedicated class.

metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py (1)

121-126: Improved readability and functionality in test setup.

The refactoring to reuse the report instance and update the identifiers parameter with BigQueryIdentifierBuilder enhances the readability and functionality of the test setup.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery.py (2)

123-124: Centralized filtering and identification logic.

The introduction of BigQueryFilter and BigQueryIdentifierBuilder centralizes the logic for filtering and identification, enhancing code clarity and maintainability.

234-238: Simplified project retrieval logic.

The direct invocation of get_projects streamlines the project retrieval process, reducing redundancy and improving code clarity.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py (1)

138-146: Enhanced configurability with SnowflakeUsageConfig.

The addition of SnowflakeUsageConfig with fields email_domain and apply_view_usage_to_tables enhances configurability for Snowflake usage settings, allowing for more tailored tracking.

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_queries.py (2)

222-224: Improved logging granularity in get_workunits_internal.

The logging now tracks the number of query log entries added to the SQL aggregator every 1000 entries, providing better visibility into the process.

280-281: Modify logging condition in fetch_query_log.

The logging now starts after the first row, reducing unnecessary log entries and focusing on the progress of subsequent rows.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py (6)

38-44: New regex pattern for sharded tables.

The _BIGQUERY_DEFAULT_SHARDED_TABLE_REGEX provides a pattern to identify sharded tables by checking for valid date suffixes. This improves the detection of sharded tables.

47-75: Enhance exception handling in sharded_table_pattern_is_a_valid_regexp.

Consider using raise ... from err to distinguish exceptions from errors in exception handling.

Tools

Ruff

72-74: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

105-110: Security concern: Identified a private key.

Ensure that the private_key field is handled securely and not exposed in logs or error messages.

Tools

Gitleaks

109-109: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

202-231: Introduce BigQueryFilterConfig for flexible filtering.

The BigQueryFilterConfig class provides regex patterns for filtering projects, datasets, and table snapshots, enhancing flexibility in data ingestion configurations.

284-297: Add BigQueryIdentifierConfig for identifier management.

The class introduces fields for managing data platform instances and legacy sharded table support, improving identifier configuration.

299-303: Update BigQueryV2Config to include new configurations.

The BigQueryV2Config class now includes BigQueryFilterConfig and BigQueryIdentifierConfig, enhancing its configuration capabilities.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema.py (2)

586-616: Add query_project_list function for project retrieval.

This function retrieves a list of projects with error handling and filtering based on project ID patterns. It enhances the robustness of project data retrieval.

618-636: Introduce get_projects function for simplified project access.

The function provides a straightforward interface to obtain projects, either by specific IDs or by querying the project list, centralizing project retrieval logic.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/lineage.py (3)

231-236: Constructor Update: Use BigQueryIdentifierBuilder.

The constructor now uses BigQueryIdentifierBuilder for generating identifiers. This change improves modularity and encapsulates identifier logic, enhancing maintainability.

433-433: Use identifiers for URN generation.

The line now uses identifiers.gen_dataset_urn_from_raw_ref(table_ref) to generate dataset URNs. This centralizes URN generation logic, improving consistency and readability.

876-878: Use identifiers for upstream table URN generation.

The use of identifiers.gen_dataset_urn_from_raw_ref(upstream_table) ensures consistent URN generation for upstream tables, aligning with the new identifier management approach.

metadata-ingestion/tests/unit/test_bigquery_usage.py (7)

165-167: Add identifiers parameter to make_usage_workunit.

The function now requires identifiers to generate URNs, enhancing consistency with the new identifier management approach.

178-181: Update make_operational_workunit to use resource_urn.

The function now takes resource_urn directly, simplifying the interface and aligning with the new URN generation strategy.

214-217: Add identifiers parameter to make_zero_usage_workunit.

The function now includes identifiers for URN generation, ensuring consistency with the updated URN management approach.

209-209: Instantiate BigQueryIdentifierBuilder in usage_extractor.

The usage_extractor fixture now includes identifiers, aligning with the new URN generation strategy and ensuring consistent identifier management.

301-304: Update test case to use identifiers.

The test case now includes identifiers when calling make_usage_workunit, reflecting the updated function signature and ensuring proper URN generation.

385-385: Update test cases to use identifiers.

Test cases are updated to include identifiers when calling make_usage_workunit, ensuring consistency with the new function signature.

Also applies to: 413-413, 445-445, 490-490, 511-511, 545-545, 636-636, 679-679, 729-729, 781-781, 811-811, 890-890, 1010-1010

1056-1060: Use identifiers for URN generation in operational stats test.

The test now uses identifiers.gen_dataset_urn_from_raw_ref to generate URNs, aligning with the new identifier management strategy.

Also applies to: 1077-1081, 1088-1090

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/usage.py (4)

318-324: Constructor Update: Use BigQueryIdentifierBuilder.

The constructor now uses BigQueryIdentifierBuilder, centralizing URN generation logic and improving maintainability.

409-411: Use identifiers for URN generation in _get_workunits_internal.

The method now generates dataset URNs using identifiers.gen_dataset_urn_from_raw_ref, ensuring consistency and readability.

717-719: Use identifiers for URN generation in _create_operation_workunit.

The method now uses identifiers for generating URNs for affected datasets and destination tables, aligning with the new identifier management strategy.

Also applies to: 738-738

724-724: Use identifiers for user URN generation.

The method now uses identifiers.gen_user_urn for generating user URNs, centralizing identifier logic and improving consistency.

metadata-ingestion/tests/unit/test_bigquery_source.py (8)

183-187: LGTM!

The test correctly verifies the behavior of get_projects with project_ids, ensuring no unnecessary API calls are made.

219-223: LGTM!

The test accurately checks the override behavior of project_ids over project_id_pattern.

236-239: LGTM!

The test correctly verifies the backward compatibility of project_ids with project_id.

286-290: LGTM!

The test correctly verifies the behavior of get_projects with a single project_id.

322-326: LGTM!

The test correctly verifies the behavior of get_projects with a paginated list of projects.

347-351: LGTM!

The test accurately checks the filtering behavior of get_projects using project_id_pattern.

366-370: LGTM!

The test correctly verifies the behavior of get_projects when no projects are returned.

391-395: LGTM!

The test correctly verifies the error handling behavior of get_projects during API call failures.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_schema_gen.py (4)

160-168: LGTM!

The refactoring to use an identifiers object centralizes identifier generation, improving maintainability and readability.

204-205: LGTM!

The use of the identifiers object for URN generation aligns with the refactoring goals, maintaining functionality and improving clarity.

215-215: LGTM!

The method's use of the identifiers object for platform information is consistent with the refactoring goals.

223-223: LGTM!

The method's use of the identifiers object for platform information is consistent with the refactoring goals.

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py (4)

86-89: LGTM! The ObservedQuery class is well-defined.

The class correctly extends LoggedQuery and introduces new attributes with appropriate defaults.

478-499: LGTM! The add method handles ObservedQuery instances appropriately.

The method correctly delegates processing to add_observed_query for ObservedQuery instances.

642-642: LGTM! The add_observed_query method is effectively optimized.

The inclusion of query_hash for conditional assignment of the query fingerprint enhances the method's efficiency.

1158-1160: LGTM! The guard clause in _gen_lineage_for_downstream enhances robustness.

The addition prevents unnecessary processing when there are no upstream aspects or fine-grained lineages.

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py

hsheth2

Overall looking pretty good

I would like to think about how we might effectively test this code. I suspect the local_temp_path might come in handy

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_config.py

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py

metadata-ingestion/tests/unit/sql_parsing/aggregator_goldens/test_create_table_query_mcps.json

metadata-ingestion/tests/unit/sql_parsing/test_sql_aggregator.py

metadata-ingestion/src/datahub/sql_parsing/sql_parsing_aggregator.py

metadata-ingestion/src/datahub/emitter/mce_builder.py

metadata-ingestion/tests/integration/bigquery_v2/audit_log.sqlite

- address review comments - support adding extra_info for debugging with queries - fix usage issue, add unit test for sql aggregator usage

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 25, 2024

vercel bot deployed to Preview July 25, 2024 11:45 View deployment

mayurinehate requested a review from hsheth2 July 25, 2024 13:36

mayurinehate force-pushed the bigquery-queries branch 2 times, most recently from 505911c to 9e1b62b Compare July 26, 2024 09:29

vercel bot deployed to Preview July 26, 2024 09:43 View deployment

vercel bot deployed to Preview July 26, 2024 10:42 View deployment

mayurinehate marked this pull request as ready for review July 26, 2024 10:56

coderabbitai bot reviewed Jul 26, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_config.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jul 26, 2024

View reviewed changes

vercel bot deployed to Preview July 26, 2024 21:41 View deployment

coderabbitai bot reviewed Jul 29, 2024

View reviewed changes

vercel bot deployed to Preview July 29, 2024 05:56 View deployment

coderabbitai bot reviewed Aug 7, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py Show resolved Hide resolved

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/queries_extractor.py Outdated Show resolved Hide resolved

vercel bot had a problem deploying to Preview August 7, 2024 14:43 Failure

mayurinehate force-pushed the bigquery-queries branch from cdfb94f to fa2db0a Compare August 7, 2024 15:30

coderabbitai bot reviewed Aug 7, 2024

View reviewed changes

vercel bot deployed to Preview August 7, 2024 15:44 View deployment

coderabbitai bot reviewed Aug 8, 2024

View reviewed changes

vercel bot deployed to Preview August 8, 2024 11:38 View deployment

mayurinehate and others added 9 commits August 8, 2024 18:20

feat(ingest): add bigquery-queries source

d33bc97

refractor to separate BigQueryFilter, BigQueryIdentifierBuilder

19d9b61

- update filtering in bigquery-queries source to use allow deny patterns

minor changes

139c040

fix user urn, more reporting, some refractors

ae21deb

more logs, sharded table legacy support

13d5f9a

add list projects timer details

e96febb

tweak log statement

a7af11e

safeguard boolean checks

04257a8

otherwise exception is thrown during parsing.e.g. list index out of range

deduplicate queries, fix empty upstream lineage

8f8b7f5

add golden file, refractors

d997bb6

mayurinehate force-pushed the bigquery-queries branch from a569bca to d997bb6 Compare August 8, 2024 12:51

coderabbitai bot reviewed Aug 8, 2024

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/bigquery_v2/bigquery_queries.py Show resolved Hide resolved

vercel bot deployed to Preview August 8, 2024 13:06 View deployment

hsheth2 reviewed Aug 9, 2024

View reviewed changes

vercel bot had a problem deploying to Preview August 13, 2024 12:57 Failure

review comments + queries for operation without lineage

4153ac5

mayurinehate force-pushed the bigquery-queries branch from 6d14175 to 4153ac5 Compare August 13, 2024 13:03

vercel bot had a problem deploying to Preview August 13, 2024 13:21 Failure

Merge branch 'master' into bigquery-queries

d5c05ed

vercel bot deployed to Preview August 14, 2024 10:39 View deployment

fix import

a6e455f

vercel bot deployed to Preview August 14, 2024 14:30 View deployment

mayurinehate added 2 commits August 19, 2024 17:00

add queries test using cached audit log

ff2cb20

Merge remote-tracking branch 'datahub-oss/master' into bigquery-queries

fe8c392

vercel bot deployed to Preview August 19, 2024 12:00 View deployment

remove unneeded log line

ecbd7fe

vercel bot deployed to Preview August 20, 2024 08:44 View deployment

hsheth2 reviewed Aug 21, 2024

View reviewed changes

metadata-ingestion/src/datahub/emitter/mce_builder.py Outdated Show resolved Hide resolved

metadata-ingestion/tests/integration/bigquery_v2/audit_log.sqlite Outdated Show resolved Hide resolved

more changes

c90bcb9

- address review comments - support adding extra_info for debugging with queries - fix usage issue, add unit test for sql aggregator usage

vercel bot deployed to Preview August 22, 2024 13:26 View deployment

mayurinehate and others added 2 commits August 23, 2024 18:28

Merge remote-tracking branch 'origin/master' into bigquery-queries

e0f8434

Merge branch 'master' into bigquery-queries

6355683

vercel bot deployed to Preview August 23, 2024 13:31 View deployment

hsheth2 approved these changes Aug 23, 2024

View reviewed changes

fix test

e8fcdcb

vercel bot deployed to Preview August 26, 2024 05:31 View deployment

hsheth2 merged commit 223650d into datahub-project:master Aug 26, 2024
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): add bigquery-queries source #10994

feat(ingest): add bigquery-queries source #10994

mayurinehate commented Jul 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 25, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

hsheth2 left a comment

feat(ingest): add bigquery-queries source #10994

feat(ingest): add bigquery-queries source #10994

Conversation

mayurinehate commented Jul 25, 2024 • edited by coderabbitai bot Loading

Checklist

Summary by CodeRabbit

coderabbitai bot commented Jul 25, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

mayurinehate commented Jul 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 25, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)