-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: deprecate client.dataset()
part 1
#9032
Conversation
sync forks
*.rst + *.py + test + conf +
Methods were divided into 3 files: - add label - get labels - delete labels *.rst - docs updated tests passed successfully
minor corrections, 'dataset_exists' moved to the 'Getting a Dataset' section
grammar fix
Deleted extra 'samples/'
chged quotes + added dots
import updates
Chged assertion unit
corrected the test asserts
* deleted unnecessary schema (add_empty_column) * added 'get_dataset' method to check that dataset actually was updated (label_dataset & delete_dataset_labels) * chged name of the fixture on more suitable (conftest & test_delete_dataset_labels & test_get_dataset_labels & list_datasets_by_label) * deleted unnecessary 'labels' variable (test_label_dataset)
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
bigquery/samples/label_dataset.py
Outdated
|
||
dataset = client.update_dataset(dataset, ["labels"]) | ||
|
||
dataset = client.get_dataset(dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line was additionally added to check that changes were uploaded to the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_dataset
returns a full dataset resource, so this should be unnecessary in the sample. I agree that it makes sense to have this in the test for the sample.
dataset = client.get_dataset(dataset_id) |
|
||
dataset = client.update_dataset(dataset, ["labels"]) | ||
|
||
dataset = client.get_dataset(dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to the sample test. Since update_dataset
returns the full resource, get_dataset
after update_dataset
is unnecessary for normal use.
dataset = client.get_dataset(dataset_id) |
@googlebot I consent |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. Looking pretty good, but I'd like to make some improvements to the samples while we are making these changes.
Thank you for your patience regarding the code review.
bigquery/samples/label_dataset.py
Outdated
|
||
dataset = client.update_dataset(dataset, ["labels"]) | ||
|
||
dataset = client.get_dataset(dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update_dataset
returns a full dataset resource, so this should be unnecessary in the sample. I agree that it makes sense to have this in the test for the sample.
dataset = client.get_dataset(dataset_id) |
bigquery/samples/label_dataset.py
Outdated
# dataset_id = "your-project.your_dataset" | ||
|
||
dataset = client.get_dataset(dataset_id) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too much whitespace for a code sample.
bigquery/samples/label_dataset.py
Outdated
dataset = client.get_dataset(dataset_id) | ||
|
||
dataset.labels = {"color": "green"} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to group all these lines together.
|
||
dataset = client.update_dataset(dataset, ["labels"]) | ||
|
||
dataset = client.get_dataset(dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to the sample test. Since update_dataset
returns the full resource, get_dataset
after update_dataset
is unnecessary for normal use.
dataset = client.get_dataset(dataset_id) |
|
||
# Load all rows from a table | ||
rows = client.list_rows(table) | ||
if len(list(rows)) == table.num_rows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(list(rows)) == table.num_rows: | |
# Iterate over rows to make the API requests to fetch row data. | |
rows = list(rows_iter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we had asserts in these samples before, but I've changed my mind regarding these statements. I think this causes more confusion than it explains and should be removed from the sample.
|
||
def test_get_dataset_labels(capsys, client, dataset_id, dataset_with_labels_id): | ||
|
||
get_dataset_labels.get_dataset_labels(client, dataset_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be dataset_with_labels_id
? The dataset_id
fixture is not needed, right?
bigquery/samples/tests/conftest.py
Outdated
@@ -78,6 +87,13 @@ def table_id(client, dataset_id): | |||
client.delete_table(table, not_found_ok=True) | |||
|
|||
|
|||
@pytest.fixture | |||
def table_w_data(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's call this table_with_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this fixture to return the table_id
instead of the table reference. That's why I assume better name is gonna be table_with_data_id
.
bigquery/samples/tests/conftest.py
Outdated
@@ -65,6 +65,15 @@ def dataset_id(client): | |||
client.delete_dataset(dataset, delete_contents=True, not_found_ok=True) | |||
|
|||
|
|||
@pytest.fixture | |||
def dataset_with_labels_id(client, dataset_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: Rather than make this fixture, we could combine all the "labels" tests into a common test file. See the models samples for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to follow your advice and combined all dataset_label
tests into one.
|
||
dataset = client.get_dataset(dataset_id) | ||
|
||
print("Dataset ID: {}".format(dataset_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really want to be showing how to print the labels every time? It feels redundant.
Let's instead make a call to get_dataset
in the tests for this sample and make sure the color label has been removed.
print("First {} rows of the table are loaded".format(number_of_rows)) | ||
|
||
# Specify selected fields to limit the results to certain columns | ||
fields = table.schema[:2] # first two columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast Most of the changes are clear to me, but I assume it's gonna be a problem here. Since I removed table = client.get_table(table_id)
, I can't find the way how to select only 2 columns to insert them into the selected_fields
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Yes, we can keep the get_table
call for this part of the sample.
bigquery/samples/tests/conftest.py
Outdated
@@ -65,6 +65,15 @@ def dataset_id(client): | |||
client.delete_dataset(dataset, delete_contents=True, not_found_ok=True) | |||
|
|||
|
|||
@pytest.fixture | |||
def dataset_with_labels_id(client, dataset_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to follow your advice and combined all dataset_label
tests into one.
bigquery/samples/tests/conftest.py
Outdated
@@ -78,6 +87,13 @@ def table_id(client, dataset_id): | |||
client.delete_table(table, not_found_ok=True) | |||
|
|||
|
|||
@pytest.fixture | |||
def table_w_data(client): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this fixture to return the table_id
instead of the table reference. That's why I assume better name is gonna be table_with_data_id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Thank you for making those updates. Just a few more suggestions.
print("First {} rows of the table are loaded".format(number_of_rows)) | ||
|
||
# Specify selected fields to limit the results to certain columns | ||
fields = table.schema[:2] # first two columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Yes, we can keep the get_table
call for this part of the sample.
bigquery/samples/tests/conftest.py
Outdated
def table_with_data_id(client): | ||
dataset = client.get_dataset("bigquery-public-data.samples") | ||
table = dataset.table("shakespeare") | ||
return "{}.{}.{}".format(table.project, table.dataset_id, table.table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return "{}.{}.{}".format(table.project, table.dataset_id, table.table_id) | |
return "bigquery-public-data.samples.shakespeare" |
bigquery/samples/tests/conftest.py
Outdated
@pytest.fixture | ||
def table_with_data_id(client): | ||
dataset = client.get_dataset("bigquery-public-data.samples") | ||
table = dataset.table("shakespeare") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table = dataset.table("shakespeare") |
bigquery/samples/tests/conftest.py
Outdated
@@ -78,6 +78,13 @@ def table_id(client, dataset_id): | |||
client.delete_table(table, not_found_ok=True) | |||
|
|||
|
|||
@pytest.fixture | |||
def table_with_data_id(client): | |||
dataset = client.get_dataset("bigquery-public-data.samples") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API call to get_dataset
is unnecessary.
dataset = client.get_dataset("bigquery-public-data.samples") |
|
||
if datasets: | ||
print("Datasets filtered by {}:".format(label_filter)) | ||
for dataset in datasets: # API request(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you've already called list
on the return value, this actually doesn't make any additional API requests.
for dataset in datasets: # API request(s) | |
for dataset in datasets: |
* rewrote table_with_data_id fixture * removed extra "# API request"
|
||
dataset = client.update_dataset(dataset, ["labels"]) | ||
print("Labels deleted from {}".format(dataset_id)) | ||
# [END bigquery_delete_label_dataset] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have this function return dataset
after the # [END ...]
line, so it's outside the sample. Then update the test for this sample to verify that dataset.labels.get("color") == None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for your patience with the review.
You might have to update to master
to get the fix for the failing unit tests (due to resumable upload package release).
@tswast, all checks are finished with OK, so I think it can be merged |
Deprecate `client.dataset()` part 1
Deprecate `client.dataset()` part 1
Deprecate `client.dataset()` part 1
Towards #8989
This PR contains five snippets: