Add support for estimated_cell_count in project.json (#3299) #3361

dsotirho-ucsc · 2021-08-23T17:38:20Z

#3299

PR title references issue
Title of main commit references issue
PR is connected to Zenhub issue and description links to issue

Author (reindex)

Added r tag to commit title _{or this PR does not require reindexing}
Added reindex label to PR _{or this PR does not require reindexing}

Author (freebies & chains)

Freebies are blocked on this PR _{or there are no freebies in this PR}
Freebies are referenced in commit titles _{or there are no freebies in this PR}
This PR is blocked by previous PR in the chain _{or this PR is not chained to another PR}
Added chain label to the blocking PR _{or this PR is not chained to another PR}

Author (upgrading)

Documented upgrading of deployments in UPGRADING.rst _{or this PR does not require upgrading}
Added u tag to commit title _{or this PR does not require upgrading}
Added upgrade label to PR _{or this PR does not require upgrading}
Added announcement to PR description _{or this PR does not require announcement}

Author (requirements, before every review)

Ran make requirements_update _{or this PR leaves requirements*.txt, common.mk and Makefile untouched}
Added R tag to commit title _{or this PR leaves requirements*.txt untouched}
Added reqs label to PR _{or this PR leaves requirements*.txt untouched}

Author (before every review)

make integration_test passes in personal deployment _{or this PR does not touch functionality that could break the IT}
Rebased branch on develop, squashed old fixups

Primary reviewer (after approval)

Commented in issue about demo expectations _{or labelled issue as no demo}
Decided if PR can be labeled no sandbox
PR title is appropriate as title of merge commit
Moved ticket to Approved column
Assigned PR to an operator

Operator (before pushing merge the commit)

Operator (after pushing the merge commit)

Made announcement requested by author _{or PR description does not contain an announcement}
Moved freebies to Merged column _{or there are no freebies in this PR}
Shortened the PR chain _{or this PR is not the base of another PR}
Verified that N reviews labelling is accurate
Pushed merge commit to Gitlab _{or this changes can be pushed later, together with another PR}
Deleted PR branch from Github and Gitlab

Operator (reindex)

Started reindex in dev _{or this PR does not require reindexing or does not target dev}
Checked for failures in dev _{or this PR does not require reindexing or does not target dev}
Started reindex in prod _{or this PR does not require reindexing or does not target prod}
Checked for failures in prod _{or this PR does not require reindexing or does not target prod}

Operator

~~Unassigned PR~~
Assigned PR back to author

Verify demoability in dev before stand-up on 10/11/2021
Report PR landing on dev to Slack thread in dcp-2 channel

codecov · 2021-08-23T18:35:01Z

Codecov Report

Merging #3361 (dca5464) into develop (614f551) will increase coverage by 0.21%.
The diff coverage is 100.00%.

❗ Current head dca5464 differs from pull request most recent head 75eeb03. Consider uploading reports for the commit 75eeb03 to get more accurate results

@@             Coverage Diff             @@
##           develop    #3361      +/-   ##
===========================================
+ Coverage    82.18%   82.39%   +0.21%     
===========================================
  Files          124      123       -1     
  Lines        14456    14192     -264     
===========================================
- Hits         11880    11693     -187     
+ Misses        2576     2499      -77

Impacted Files	Coverage Δ
src/azul/plugins/metadata/hca/__init__.py	`100.00% <ø> (ø)`
src/azul/plugins/metadata/hca/transform.py	`98.62% <ø> (-0.06%)`	⬇️
src/azul/service/index_query_service.py	`89.69% <ø> (ø)`
test/azul_test_case.py	`74.50% <ø> (-2.53%)`	⬇️
test/service/test_repository_projects.py	`100.00% <ø> (ø)`
src/azul/plugins/metadata/hca/aggregate.py	`97.82% <100.00%> (-0.07%)`	⬇️
src/azul/service/avro_pfb.py	`96.92% <100.00%> (+0.82%)`	⬆️
src/azul/service/elasticsearch_service.py	`82.75% <100.00%> (-0.47%)`	⬇️
src/azul/service/hca_response_v5.py	`92.54% <100.00%> (ø)`
test/indexer/test_hca_indexer.py	`99.31% <100.00%> (+0.24%)`	⬆️
... and 44 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 614f551...75eeb03. Read the comment docs.

coveralls · 2021-08-23T18:36:19Z

Coverage increased (+0.08%) to 82.501% when pulling 75eeb03 on issues/danielsotirhos/3299-project-estimated-cell-count into 614f551 on develop.

jessebrennan

Looks good to me, just one question

src/azul/plugins/metadata/hca/aggregate.py

hannes-ucsc

Why did you rename the can? Is the bundle UUID a hash of something?

lambdas/service/app.py

hannes-ucsc · 2021-08-26T00:47:00Z

src/azul/plugins/metadata/hca/aggregate.py

@@ -163,6 +163,8 @@ def _get_accumulator(self, field) -> Optional[Accumulator]:
                       'contributors',
                       'publications'):
            return None
+        elif field == 'estimated_cell_count':
+            return SumAccumulator()


The ticket specifies max for this. If two bundles contribute the same project entity with this property set to 1000, we want the result to be 1000 not 2000.

hannes-ucsc · 2021-08-26T00:48:43Z

src/azul/service/elasticsearch_service.py

+        # Add a project cell count aggregate
+        es_search.aggs.metric(
+            'projectEstimatedCellCount',
+            'max',


Why would we want the maximum when aggregating over multiple projects? Here we want the sum.

hannes-ucsc · 2021-08-26T00:49:31Z

src/azul/service/hca_response_v5.py

@@ -165,6 +165,7 @@ class SummaryRepresentation(JsonObject):
    donorCount = IntegerProperty()
    labCount = IntegerProperty()
    totalCellCount = FloatProperty()
+    projectEstimatedCellCount = FloatProperty()  # 'max' aggregations use floats


For future reference, quirks like this can't be exposed on the API.

hannes-ucsc · 2021-08-26T00:58:34Z

test/service/test_response.py

+
+    @classmethod
+    def bundles(cls) -> List[BundleFQID]:
+        return super().bundles() + [


Why do you index the default bundles when you later filter them out of the response? Either assert the new property to be None for the entities from the default bundles or don't index the default bundles.

test/service/test_response.py

hannes-ucsc · 2021-08-26T01:17:15Z

test/service/test_manifest.py

@@ -246,8 +246,9 @@ def _shared_file_bundle(self, bundle):
            "ontology_label": "lung"
        }
        assert isinstance(manifest, list)
-        return DSSBundle(fqid=self.bundle_fqid(uuid=old_to_new[bundle.uuid],
-                                               version=bundle.version),
+        new_bundle_fqid = self.bundle_fqid(uuid='adc92f89-b2e9-467b-af54-577d084a9eec',


Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?

Why did you rename the can? Is the bundle UUID a hash of something?

The bundle UUID was changed as part of the 30/07/2021 schema-test-data release (when the project estimated_cell_count field was added).
See latest @ https://github.com/HumanCellAtlas/schema-test-data/tree/883e8c19ccad4c630b62d869df6f671628ce03fc/tests/links
vs prior @ https://github.com/HumanCellAtlas/schema-test-data/tree/2c7fbd0e0b7804abeb21ddb0ae89724950218257/tests/links

Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?

The bundle UUID (both prior to the rename and after) is also the UUID of a process in the bundle.

FYI, In the last fixup I've also added a canned bundle of the 2nd bundle in the canned staging area. It also re-uses a UUID from one of the processes in the bundle for the bundle UUID (d7b8cbff).

Please bring this up in stand-up.

Parking lot, that is.

hannes-ucsc

#3361 (comment)

hannes-ucsc · 2021-10-06T00:33:02Z

test/indexer/test_hca_indexer.py

+        for aggregate in False, True:
+            for entity_type in self.index_service.entity_types(self.catalog):
+                expected[aggregate][entity_type] = {
+                    'projects': 10000 if aggregate and entity_type == 'projects' else 20000,


Re #3361 (comment)

Ahh, so the leaves of the fingerprint should be lists

Index: test/indexer/test_hca_indexer.py IDEA additional info: Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP <+>UTF-8 =================================================================== diff --git a/test/indexer/test_hca_indexer.py b/test/indexer/test_hca_indexer.py --- a/test/indexer/test_hca_indexer.py (revision 3e164ebd69e9a0609fca603750bd23d3bfbadda2) +++ b/test/indexer/test_hca_indexer.py (date 1633480153563) @@ -1,3 +1,4 @@ +from bisect import insort from collections import ( Counter, defaultdict, @@ -1278,22 +1279,24 @@ ('cell_suspensions', 'total_estimated_cells'), ('files', 'matrix_cell_count') ] - actual = NestedDict(2, int) + actual = NestedDict(2, list) for hit in sorted(hits, key=lambda d: d['_id']): entity_type, aggregate = self._parse_index_name(hit) contents = hit['_source']['contents'] for inner_entity_type, field_name in field_paths: for inner_entity in contents[inner_entity_type]: value = inner_entity[field_name] - actual[aggregate][entity_type][inner_entity_type] += value + insort(actual[aggregate][entity_type][inner_entity_type], value) expected = NestedDict(1, dict) for aggregate in False, True: for entity_type in self.index_service.entity_types(self.catalog): + is_project_aggregate = aggregate and entity_type == 'projects' expected[aggregate][entity_type] = { - 'projects': 10000 if aggregate and entity_type == 'projects' else 20000, - 'cell_suspensions': 40000, - 'files': 17100 + # estimated_cell_count is aggregated using max, not sum + 'projects': [10000] if is_project_aggregate else [10000, 10000], + 'cell_suspensions': [40000] if is_project_aggregate else [20000, 20000], + 'files': [17100] if is_project_aggregate else [2100, 15000] } self.assertEqual(expected.to_dict(), actual.to_dict())

hannes-ucsc

Conflicts.

Update canned bundle from schema-test-data

hannes-ucsc · 2021-10-08T19:36:09Z

@danielsotirhos, note the new item at the bottom of the checklist.

github-actions bot added the orange [process] Done by the Azul team label Aug 23, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 89e6522 to 4f0b34c Compare August 23, 2021 17:39

dsotirho-ucsc added the reindex:dev [process] PR requires reindexing dev label Aug 23, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 2 times, most recently from 9ed8787 to b28ca61 Compare August 23, 2021 18:18

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 3 times, most recently from f578358 to aa29734 Compare August 25, 2021 06:37

dsotirho-ucsc requested a review from jessebrennan August 25, 2021 15:54

dsotirho-ucsc assigned jessebrennan Aug 25, 2021

jessebrennan approved these changes Aug 25, 2021

View reviewed changes

src/azul/plugins/metadata/hca/aggregate.py Show resolved Hide resolved

jessebrennan removed their assignment Aug 25, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from aa29734 to 950392c Compare August 25, 2021 23:59

dsotirho-ucsc requested a review from hannes-ucsc August 26, 2021 00:32

dsotirho-ucsc assigned hannes-ucsc Aug 26, 2021

hannes-ucsc requested changes Aug 26, 2021

View reviewed changes

hannes-ucsc removed their assignment Aug 26, 2021

hannes-ucsc added the 1 review [process] Lead requested changes once label Aug 26, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 950392c to abd2729 Compare August 26, 2021 17:28

dsotirho-ucsc requested a review from hannes-ucsc August 26, 2021 18:30

dsotirho-ucsc assigned hannes-ucsc Aug 26, 2021

hannes-ucsc removed their assignment Aug 31, 2021

hannes-ucsc added 1 review [process] Lead requested changes once and removed 1 review [process] Lead requested changes once labels Aug 31, 2021

hannes-ucsc reviewed Aug 31, 2021

View reviewed changes

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch 3 times, most recently from 95ce607 to ff14f8e Compare September 7, 2021 23:05

hannes-ucsc removed their assignment Oct 4, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 858e280 to 3e164eb Compare October 5, 2021 16:13

dsotirho-ucsc requested a review from hannes-ucsc October 5, 2021 16:45

dsotirho-ucsc assigned hannes-ucsc Oct 5, 2021

hannes-ucsc requested changes Oct 6, 2021

View reviewed changes

hannes-ucsc assigned dsotirho-ucsc and unassigned hannes-ucsc Oct 6, 2021

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from 3e164eb to a92992d Compare October 6, 2021 16:30

dsotirho-ucsc requested a review from hannes-ucsc October 6, 2021 17:33

dsotirho-ucsc assigned hannes-ucsc and unassigned dsotirho-ucsc Oct 6, 2021

hannes-ucsc requested changes Oct 8, 2021

View reviewed changes

hannes-ucsc removed their assignment Oct 8, 2021

[1/2] [r] Add support for estimated_cell_count in project.json (#3299)

00a5910

Update canned bundle from schema-test-data

dsotirho-ucsc force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from a92992d to c8edbe0 Compare October 8, 2021 16:30

dsotirho-ucsc requested a review from hannes-ucsc October 8, 2021 17:01

dsotirho-ucsc assigned hannes-ucsc Oct 8, 2021

hannes-ucsc approved these changes Oct 8, 2021

View reviewed changes

hannes-ucsc assigned jessebrennan and unassigned hannes-ucsc Oct 8, 2021

dsotirho-ucsc added 2 commits October 8, 2021 12:41

[2/2] [r] Add support for estimated_cell_count in project.json (#3299)

b91dbe0

Make PFB manifest deterministic

75eeb03

jessebrennan force-pushed the issues/danielsotirhos/3299-project-estimated-cell-count branch from c8edbe0 to 75eeb03 Compare October 8, 2021 19:44

melainalegaspi added the sandbox [process] Resolution is being verified in sandbox deployment label Oct 8, 2021

jessebrennan merged commit efd7b2a into develop Oct 8, 2021

jessebrennan deleted the issues/danielsotirhos/3299-project-estimated-cell-count branch October 8, 2021 21:52

jessebrennan assigned dsotirho-ucsc and unassigned jessebrennan Oct 11, 2021

dsotirho-ucsc removed their assignment Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for estimated_cell_count in project.json (#3299) #3361

Add support for estimated_cell_count in project.json (#3299) #3361

dsotirho-ucsc commented Aug 23, 2021 •

edited

Loading

codecov bot commented Aug 23, 2021 •

edited

Loading

coveralls commented Aug 23, 2021 •

edited

Loading

jessebrennan left a comment

hannes-ucsc left a comment

hannes-ucsc Aug 26, 2021

hannes-ucsc Aug 26, 2021

hannes-ucsc Aug 26, 2021

hannes-ucsc Aug 26, 2021 •

edited

Loading

hannes-ucsc Aug 26, 2021

dsotirho-ucsc Aug 26, 2021

dsotirho-ucsc Aug 26, 2021

hannes-ucsc Aug 31, 2021

hannes-ucsc Aug 31, 2021

hannes-ucsc left a comment

hannes-ucsc Oct 6, 2021

hannes-ucsc left a comment

hannes-ucsc commented Oct 8, 2021

Add support for estimated_cell_count in project.json (#3299) #3361

Add support for estimated_cell_count in project.json (#3299) #3361

Conversation

dsotirho-ucsc commented Aug 23, 2021 • edited Loading

codecov bot commented Aug 23, 2021 • edited Loading

Codecov Report

coveralls commented Aug 23, 2021 • edited Loading

jessebrennan left a comment

Choose a reason for hiding this comment

hannes-ucsc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannes-ucsc Aug 26, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannes-ucsc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannes-ucsc left a comment

Choose a reason for hiding this comment

hannes-ucsc commented Oct 8, 2021

dsotirho-ucsc commented Aug 23, 2021 •

edited

Loading

codecov bot commented Aug 23, 2021 •

edited

Loading

coveralls commented Aug 23, 2021 •

edited

Loading

hannes-ucsc Aug 26, 2021 •

edited

Loading