-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for estimated_cell_count in project.json (#3299) #3361
Add support for estimated_cell_count in project.json (#3299) #3361
Conversation
89e6522
to
4f0b34c
Compare
9ed8787
to
b28ca61
Compare
Codecov Report
@@ Coverage Diff @@
## develop #3361 +/- ##
===========================================
+ Coverage 82.18% 82.39% +0.21%
===========================================
Files 124 123 -1
Lines 14456 14192 -264
===========================================
- Hits 11880 11693 -187
+ Misses 2576 2499 -77
Continue to review full report at Codecov.
|
f578358
to
aa29734
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just one question
aa29734
to
950392c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you rename the can? Is the bundle UUID a hash of something?
@@ -163,6 +163,8 @@ def _get_accumulator(self, field) -> Optional[Accumulator]: | |||
'contributors', | |||
'publications'): | |||
return None | |||
elif field == 'estimated_cell_count': | |||
return SumAccumulator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ticket specifies max
for this. If two bundles contribute the same project
entity with this property set to 1000, we want the result to be 1000 not 2000.
# Add a project cell count aggregate | ||
es_search.aggs.metric( | ||
'projectEstimatedCellCount', | ||
'max', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we want the maximum when aggregating over multiple projects? Here we want the sum.
src/azul/service/hca_response_v5.py
Outdated
@@ -165,6 +165,7 @@ class SummaryRepresentation(JsonObject): | |||
donorCount = IntegerProperty() | |||
labCount = IntegerProperty() | |||
totalCellCount = FloatProperty() | |||
projectEstimatedCellCount = FloatProperty() # 'max' aggregations use floats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, quirks like this can't be exposed on the API.
test/service/test_response.py
Outdated
|
||
@classmethod | ||
def bundles(cls) -> List[BundleFQID]: | ||
return super().bundles() + [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you index the default bundles when you later filter them out of the response? Either assert the new property to be None for the entities from the default bundles or don't index the default bundles.
test/service/test_manifest.py
Outdated
@@ -246,8 +246,9 @@ def _shared_file_bundle(self, bundle): | |||
"ontology_label": "lung" | |||
} | |||
assert isinstance(manifest, list) | |||
return DSSBundle(fqid=self.bundle_fqid(uuid=old_to_new[bundle.uuid], | |||
version=bundle.version), | |||
new_bundle_fqid = self.bundle_fqid(uuid='adc92f89-b2e9-467b-af54-577d084a9eec', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you rename the can? Is the bundle UUID a hash of something?
The bundle UUID was changed as part of the 30/07/2021 schema-test-data
release (when the project estimated_cell_count
field was added).
See latest @ https://github.com/HumanCellAtlas/schema-test-data/tree/883e8c19ccad4c630b62d869df6f671628ce03fc/tests/links
vs prior @ https://github.com/HumanCellAtlas/schema-test-data/tree/2c7fbd0e0b7804abeb21ddb0ae89724950218257/tests/links
Don't understand this change. Why is the old bundle UUID still mentioned in old_to_new even though the can was renamed?
The bundle UUID (both prior to the rename and after) is also the UUID of a process in the bundle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, In the last fixup I've also added a canned bundle of the 2nd bundle in the canned staging area. It also re-uses a UUID from one of the processes in the bundle for the bundle UUID (d7b8cbff).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bring this up in stand-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parking lot, that is.
950392c
to
abd2729
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
95ce607
to
ff14f8e
Compare
858e280
to
3e164eb
Compare
test/indexer/test_hca_indexer.py
Outdated
for aggregate in False, True: | ||
for entity_type in self.index_service.entity_types(self.catalog): | ||
expected[aggregate][entity_type] = { | ||
'projects': 10000 if aggregate and entity_type == 'projects' else 20000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, so the leaves of the fingerprint should be lists
Index: test/indexer/test_hca_indexer.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/test/indexer/test_hca_indexer.py b/test/indexer/test_hca_indexer.py
--- a/test/indexer/test_hca_indexer.py (revision 3e164ebd69e9a0609fca603750bd23d3bfbadda2)
+++ b/test/indexer/test_hca_indexer.py (date 1633480153563)
@@ -1,3 +1,4 @@
+from bisect import insort
from collections import (
Counter,
defaultdict,
@@ -1278,22 +1279,24 @@
('cell_suspensions', 'total_estimated_cells'),
('files', 'matrix_cell_count')
]
- actual = NestedDict(2, int)
+ actual = NestedDict(2, list)
for hit in sorted(hits, key=lambda d: d['_id']):
entity_type, aggregate = self._parse_index_name(hit)
contents = hit['_source']['contents']
for inner_entity_type, field_name in field_paths:
for inner_entity in contents[inner_entity_type]:
value = inner_entity[field_name]
- actual[aggregate][entity_type][inner_entity_type] += value
+ insort(actual[aggregate][entity_type][inner_entity_type], value)
expected = NestedDict(1, dict)
for aggregate in False, True:
for entity_type in self.index_service.entity_types(self.catalog):
+ is_project_aggregate = aggregate and entity_type == 'projects'
expected[aggregate][entity_type] = {
- 'projects': 10000 if aggregate and entity_type == 'projects' else 20000,
- 'cell_suspensions': 40000,
- 'files': 17100
+ # estimated_cell_count is aggregated using max, not sum
+ 'projects': [10000] if is_project_aggregate else [10000, 10000],
+ 'cell_suspensions': [40000] if is_project_aggregate else [20000, 20000],
+ 'files': [17100] if is_project_aggregate else [2100, 15000]
}
self.assertEqual(expected.to_dict(), actual.to_dict())
3e164eb
to
a92992d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conflicts.
Update canned bundle from schema-test-data
a92992d
to
c8edbe0
Compare
@danielsotirhos, note the new item at the bottom of the checklist. |
c8edbe0
to
75eeb03
Compare
#3299
Author
Author (reindex)
r
tag to commit title or this PR does not require reindexingreindex
label to PR or this PR does not require reindexingAuthor (freebies & chains)
chain
label to the blocking PR or this PR is not chained to another PRAuthor (upgrading)
u
tag to commit title or this PR does not require upgradingupgrade
label to PR or this PR does not require upgradingAuthor (requirements, before every review)
make requirements_update
or this PR leaves requirements*.txt, common.mk and Makefile untouchedR
tag to commit title or this PR leaves requirements*.txt untouchedreqs
label to PR or this PR leaves requirements*.txt untouchedAuthor (before every review)
make integration_test
passes in personal deployment or this PR does not touch functionality that could break the ITdevelop
, squashed old fixupsPrimary reviewer (after approval)
no demo
no sandbox
Operator (before pushing merge the commit)
reindex
label andr
commit title tagno demo
sandbox
label or PR is labeledno sandbox
no sandbox
sandbox
or this PR does not require reindexingsandbox
sandbox
or this PR does not require reindexingsandbox
Operator (after pushing the merge commit)
N reviews
labelling is accurateOperator (reindex)
dev
or this PR does not require reindexing or does not targetdev
dev
or this PR does not require reindexing or does not targetdev
prod
or this PR does not require reindexing or does not targetprod
prod
or this PR does not require reindexing or does not targetprod
Operator
Unassigned PRAuthor
dev
before stand-up on 10/11/2021dev
to Slack thread indcp-2
channel