Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for estimated_cell_count in project.json (#3299) #3361

Merged

Conversation

dsotirho-ucsc
Copy link
Contributor

@dsotirho-ucsc dsotirho-ucsc commented Aug 23, 2021

#3299

Author

  • PR title references issue
  • Title of main commit references issue
  • PR is connected to Zenhub issue and description links to issue

Author (reindex)

  • Added r tag to commit title or this PR does not require reindexing
  • Added reindex label to PR or this PR does not require reindexing

Author (freebies & chains)

  • Freebies are blocked on this PR or there are no freebies in this PR
  • Freebies are referenced in commit titles or there are no freebies in this PR
  • This PR is blocked by previous PR in the chain or this PR is not chained to another PR
  • Added chain label to the blocking PR or this PR is not chained to another PR

Author (upgrading)

  • Documented upgrading of deployments in UPGRADING.rst or this PR does not require upgrading
  • Added u tag to commit title or this PR does not require upgrading
  • Added upgrade label to PR or this PR does not require upgrading
  • Added announcement to PR description or this PR does not require announcement

Author (requirements, before every review)

  • Ran make requirements_update or this PR leaves requirements*.txt, common.mk and Makefile untouched
  • Added R tag to commit title or this PR leaves requirements*.txt untouched
  • Added reqs label to PR or this PR leaves requirements*.txt untouched

Author (before every review)

  • make integration_test passes in personal deployment or this PR does not touch functionality that could break the IT
  • Rebased branch on develop, squashed old fixups

Primary reviewer (after approval)

  • Commented in issue about demo expectations or labelled issue as no demo
  • Decided if PR can be labeled no sandbox
  • PR title is appropriate as title of merge commit
  • Moved ticket to Approved column
  • Assigned PR to an operator

Operator (before pushing merge the commit)

  • Checked reindex label and r commit title tag
  • Checked that demo expectations are clear or issue is labeled as no demo
  • Rebased and squashed branch
  • Sanity-checked history
  • Pushed PR branch to Github
  • Branch pushed to Gitlab and added sandbox label or PR is labeled no sandbox
  • Build passed in sandbox or PR is labeled no sandbox
  • Started reindex in sandbox or this PR does not require reindexing sandbox
  • Checked for failures in sandbox or this PR does not require reindexing sandbox
  • Added PR reference to merge commit title
  • Collected commit title tags in merge commit title
  • Moved linked issue to Merged column
  • Pushed merge commit to Github

Operator (after pushing the merge commit)

  • Made announcement requested by author or PR description does not contain an announcement
  • Moved freebies to Merged column or there are no freebies in this PR
  • Shortened the PR chain or this PR is not the base of another PR
  • Verified that N reviews labelling is accurate
  • Pushed merge commit to Gitlab or this changes can be pushed later, together with another PR
  • Deleted PR branch from Github and Gitlab

Operator (reindex)

  • Started reindex in dev or this PR does not require reindexing or does not target dev
  • Checked for failures in dev or this PR does not require reindexing or does not target dev
  • Started reindex in prod or this PR does not require reindexing or does not target prod
  • Checked for failures in prod or this PR does not require reindexing or does not target prod

Operator

  • Unassigned PR
  • Assigned PR back to author

Author

  • Verify demoability in dev before stand-up on 10/11/2021
  • Report PR landing on dev to Slack thread in dcp-2 channel

@github-actions github-actions bot added the orange [process] Done by the Azul team label Aug 23, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 89e6522 to 4f0b34c Compare August 23, 2021 17:39
@dsotirho-ucsc dsotirho-ucsc added the reindex:dev [process] PR requires reindexing dev label Aug 23, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 2 times, most recently from 9ed8787 to b28ca61 Compare August 23, 2021 18:18
@codecov
Copy link

codecov bot commented Aug 23, 2021

Codecov Report

Merging #3361 (dca5464) into develop (614f551) will increase coverage by 0.21%.
The diff coverage is 100.00%.

❗ Current head dca5464 differs from pull request most recent head 75eeb03. Consider uploading reports for the commit 75eeb03 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3361      +/-   ##
===========================================
+ Coverage    82.18%   82.39%   +0.21%     
===========================================
  Files          124      123       -1     
  Lines        14456    14192     -264     
===========================================
- Hits         11880    11693     -187     
+ Misses        2576     2499      -77     
Impacted Files Coverage Δ
src/azul/plugins/metadata/hca/__init__.py 100.00% <ø> (ø)
src/azul/plugins/metadata/hca/transform.py 98.62% <ø> (-0.06%) ⬇️
src/azul/service/index_query_service.py 89.69% <ø> (ø)
test/azul_test_case.py 74.50% <ø> (-2.53%) ⬇️
test/service/test_repository_projects.py 100.00% <ø> (ø)
src/azul/plugins/metadata/hca/aggregate.py 97.82% <100.00%> (-0.07%) ⬇️
src/azul/service/avro_pfb.py 96.92% <100.00%> (+0.82%) ⬆️
src/azul/service/elasticsearch_service.py 82.75% <100.00%> (-0.47%) ⬇️
src/azul/service/hca_response_v5.py 92.54% <100.00%> (ø)
test/indexer/test_hca_indexer.py 99.31% <100.00%> (+0.24%) ⬆️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 614f551...75eeb03. Read the comment docs.

@coveralls
Copy link

coveralls commented Aug 23, 2021

Coverage Status

Coverage increased (+0.08%) to 82.501% when pulling 75eeb03 on issues/danielsotirhos/3299-project-estimated-cell-count into 614f551 on develop.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 3 times, most recently from f578358 to aa29734 Compare August 25, 2021 06:37
Copy link
Contributor

@jessebrennan jessebrennan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just one question

src/azul/plugins/metadata/hca/aggregate.py Show resolved Hide resolved
@jessebrennan jessebrennan removed their assignment Aug 25, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from aa29734 to 950392c Compare August 25, 2021 23:59
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename the can? Is the bundle UUID a hash of something?

lambdas/service/app.py Show resolved Hide resolved
@@ -163,6 +163,8 @@ def _get_accumulator(self, field) -> Optional[Accumulator]:
'contributors',
'publications'):
return None
elif field == 'estimated_cell_count':
return SumAccumulator()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ticket specifies max for this. If two bundles contribute the same project entity with this property set to 1000, we want the result to be 1000 not 2000.

# Add a project cell count aggregate
es_search.aggs.metric(
'projectEstimatedCellCount',
'max',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want the maximum when aggregating over multiple projects? Here we want the sum.

@@ -165,6 +165,7 @@ class SummaryRepresentation(JsonObject):
donorCount = IntegerProperty()
labCount = IntegerProperty()
totalCellCount = FloatProperty()
projectEstimatedCellCount = FloatProperty() # 'max' aggregations use floats
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, quirks like this can't be exposed on the API.


@classmethod
def bundles(cls) -> List[BundleFQID]:
return super().bundles() + [
Copy link
Member

@hannes-ucsc hannes-ucsc Aug 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you index the default bundles when you later filter them out of the response? Either assert the new property to be None for the entities from the default bundles or don't index the default bundles.

test/service/test_response.py Show resolved Hide resolved
@@ -246,8 +246,9 @@ def _shared_file_bundle(self, bundle):
"ontology_label": "lung"
}
assert isinstance(manifest, list)
return DSSBundle(fqid=self.bundle_fqid(uuid=old_to_new[bundle.uuid],
version=bundle.version),
new_bundle_fqid = self.bundle_fqid(uuid='adc92f89-b2e9-467b-af54-577d084a9eec',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename the can? Is the bundle UUID a hash of something?

The bundle UUID was changed as part of the 30/07/2021 schema-test-data release (when the project estimated_cell_count field was added).
See latest @ https://github.com/HumanCellAtlas/schema-test-data/tree/883e8c19ccad4c630b62d869df6f671628ce03fc/tests/links
vs prior @ https://github.com/HumanCellAtlas/schema-test-data/tree/2c7fbd0e0b7804abeb21ddb0ae89724950218257/tests/links

Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?

The bundle UUID (both prior to the rename and after) is also the UUID of a process in the bundle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, In the last fixup I've also added a canned bundle of the 2nd bundle in the canned staging area. It also re-uses a UUID from one of the processes in the bundle for the bundle UUID (d7b8cbff).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bring this up in stand-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parking lot, that is.

@hannes-ucsc hannes-ucsc removed their assignment Aug 26, 2021
@hannes-ucsc hannes-ucsc added the 1 review [process] Lead requested changes once label Aug 26, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 950392c to abd2729 Compare August 26, 2021 17:28
@hannes-ucsc hannes-ucsc removed their assignment Aug 31, 2021
@hannes-ucsc hannes-ucsc added 1 review [process] Lead requested changes once and removed 1 review [process] Lead requested changes once labels Aug 31, 2021
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 3 times, most recently from 95ce607 to ff14f8e Compare September 7, 2021 23:05
@hannes-ucsc hannes-ucsc removed their assignment Oct 4, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 858e280 to 3e164eb Compare October 5, 2021 16:13
for aggregate in False, True:
for entity_type in self.index_service.entity_types(self.catalog):
expected[aggregate][entity_type] = {
'projects': 10000 if aggregate and entity_type == 'projects' else 20000,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re #3361 (comment)

Ahh, so the leaves of the fingerprint should be lists

Index: test/indexer/test_hca_indexer.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/test_hca_indexer.py b/test/indexer/test_hca_indexer.py
--- a/test/indexer/test_hca_indexer.py	(revision 3e164ebd69e9a0609fca603750bd23d3bfbadda2)
+++ b/test/indexer/test_hca_indexer.py	(date 1633480153563)
@@ -1,3 +1,4 @@
+from bisect import insort
 from collections import (
     Counter,
     defaultdict,
@@ -1278,22 +1279,24 @@
             ('cell_suspensions', 'total_estimated_cells'),
             ('files', 'matrix_cell_count')
         ]
-        actual = NestedDict(2, int)
+        actual = NestedDict(2, list)
         for hit in sorted(hits, key=lambda d: d['_id']):
             entity_type, aggregate = self._parse_index_name(hit)
             contents = hit['_source']['contents']
             for inner_entity_type, field_name in field_paths:
                 for inner_entity in contents[inner_entity_type]:
                     value = inner_entity[field_name]
-                    actual[aggregate][entity_type][inner_entity_type] += value
+                    insort(actual[aggregate][entity_type][inner_entity_type], value)
 
         expected = NestedDict(1, dict)
         for aggregate in False, True:
             for entity_type in self.index_service.entity_types(self.catalog):
+                is_project_aggregate = aggregate and entity_type == 'projects'
                 expected[aggregate][entity_type] = {
-                    'projects': 10000 if aggregate and entity_type == 'projects' else 20000,
-                    'cell_suspensions': 40000,
-                    'files': 17100
+                    # estimated_cell_count is aggregated using max, not sum
+                    'projects': [10000] if is_project_aggregate else [10000, 10000],
+                    'cell_suspensions': [40000] if is_project_aggregate else [20000, 20000],
+                    'files': [17100] if is_project_aggregate else [2100, 15000]
                 }
 
         self.assertEqual(expected.to_dict(), actual.to_dict())

@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 3e164eb to a92992d Compare October 6, 2021 16:30
Copy link
Member

@hannes-ucsc hannes-ucsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conflicts.

@hannes-ucsc hannes-ucsc removed their assignment Oct 8, 2021
@dsotirho-ucsc dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from a92992d to c8edbe0 Compare October 8, 2021 16:30
@hannes-ucsc
Copy link
Member

@danielsotirhos, note the new item at the bottom of the checklist.

@jessebrennan jessebrennan force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from c8edbe0 to 75eeb03 Compare October 8, 2021 19:44
@melainalegaspi melainalegaspi added the sandbox [process] Resolution is being verified in sandbox deployment label Oct 8, 2021
@jessebrennan jessebrennan merged commit efd7b2a into develop Oct 8, 2021
@jessebrennan jessebrennan deleted the issues/danielsotirhos/3299-project-estimated-cell-count branch October 8, 2021 21:52
@dsotirho-ucsc dsotirho-ucsc removed their assignment Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4+ reviews [process] Lead requested changes four times or more orange [process] Done by the Azul team reindex:dev [process] PR requires reindexing dev sandbox [process] Resolution is being verified in sandbox deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants