-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New export management commands #804
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## develop #804 +/- ##
=========================================
Coverage 97.39% 97.40%
=========================================
Files 231 233 +2
Lines 13241 13349 +108
Branches 81 81
=========================================
+ Hits 12896 13002 +106
- Misses 342 344 +2
Partials 3 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the way you're leveraging the existing member export and the overlap in functionality with that and the creator export we need. This seems like a really good direction.
I think all the member-specific fields should be omitted from the creator export; although it might be nice to have an is_member
boolean or something like that.
What if you override the field list in your creator export and then in get_object_data
you can check whether a field should be populated based on that list? It's not perfect encapsulation; the other option is a base class with all the common fields used in both, but that seems like that might be more complicated than we need.
creator_ids = {c.person_id for c in Creator.objects.all()} | ||
return Person.objects.filter(pk__in=creator_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be able to get this in a single query; try this:
creator_ids = {c.person_id for c in Creator.objects.all()} | |
return Person.objects.filter(pk__in=creator_ids) | |
return Person.objects.filter(creator__isnull=False).distinct().count() |
The distinct is needed because without it we get one person record for every time they have a creator entry
@@ -22,7 +22,7 @@ class Command(BaseExport): | |||
model = Person | |||
|
|||
csv_fields = [ | |||
"uri", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to keep the uri for the member export for backwards compatibility.
I'm not sure "slug" is meaningful to an outside audience... Inclined to use URIs for creators even though they don't resolve, but I don't feel strongly about that. Do you want to use slug for creators export with a different label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on second thought - because there's overlap between members and creators, I recommend we stick with URIs so the two data exports can easily be used together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I make my own URIs for the creators then / those that don't resolve, using the slug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah! I'd forgotten that we don't generate urls for persons without accounts (and it wouldn't make sense to create member uris for them anyways)
is it weird to include a member uri in the creator export if the person is a member?
I think that yes, we should create uris for creators - this means we should create a super-minimal /creator/slug/ view - maybe it could return brief json response only? (I feel like I did this somewhere else for URIs that were only referenced in a data export... I looked quickly at mep-django urls and didn't see it.)
The additional value of a view like that it will give us a way to check for renamed slugs, assuming we adapt the same redirect logic for old slugs that we have for persons.
However, it does increase the work required to complete this export! I'd be ok with slug (but label as id) for creators.
@@ -85,8 +86,10 @@ def get_object_data(self, obj): | |||
if obj.death_year: | |||
data["death_year"] = obj.death_year | |||
# set for unique, list for json serialization | |||
data["membership_years"] = list( | |||
set(d.year for d in obj.account_set.first().event_dates) | |||
data["membership_years"] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of making this conditional on whether the person has an account, can we just skip based on whether membership years is in the list of fields to be exported? that isn't relevant for the creator export, is it? (although I know there is some overlap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this revision looks good to me!
"sort_name", | ||
"title", | ||
"gender", | ||
"is_organization", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have any creator orgs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like one:
In [2]: Person.objects.filter(creator__isnull=False, is_organization=True)
Out[2]: <PersonQuerySet [<Person pk:10524 Fabian Society>]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, fascinating; project full of edge cases
|
||
def get_queryset(self): | ||
"""filter to creators""" | ||
creator_ids = {c.person_id for c in Creator.objects.all()} | ||
return Person.objects.filter(pk__in=creator_ids) | ||
return Person.objects.filter(creator__isnull=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure you need the .distinct()
; did you test with this and see what you get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops, I forgot to include that, thanks. It's def needed: with distinct
we have 2532 creators from queryset to export and without it we have 6479.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exports are looking really good, congratulations. I have some questions and suggestions but hopefully it's just minor cleanup and finishing the unit tests
total = len( | ||
objects | ||
) # fewer assumptions, allows other (multi model/class) objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this needed? are we exporting anything other than database content in the new export scripts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see it's due to the address person/account issue; I'd like to resolve it there instead
["uri", "title"] | ||
# including "id" to store slug for exports, | ||
# given not all exported entities have a URI | ||
["id", "uri", "title"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure all books should have uris; did you encounter any that did not?
Generates a CSV and JSON file including details on every creator | ||
database, with summary details and coordinates for associated addresses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like this is slightly out of date.
maybe worth briefly clarifying how we define creator?
model = Address | ||
|
||
csv_fields = [ | ||
"id", # including "id" to store slug for exports, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
locations/addresses don't have IDs, as far as I can remember (other than db id, which we should not include in the exports)
do you think we need public ids for these? I was assuming we would not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't have URIs for locations or creators, I am including the slug, which we do have, for all 4 exports, but calling it "id" instead of slug. This allows csv-users to easily merge the datasets. That or something like it would be my recommendation but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could leave out an "id" field for locations since Location/Address doesn't have a slug, and since the output csv should still be mergeable with the others on id
in the members.csv to member_id
in the locations.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I agree that locations / address data export does not need IDs, just needs to have a member uri to join with the members dataset
- I can't remember if I suggested: I think we should include member_uri in the creators dataset, so that any authors who are also library members can be easily identified and the author and member records can be connected (this seems better than a boolean indicating membership)
- We need to keep URIs where we have them (which is everywhere but creators, right?), and merging/joining can always be done based on the URIs. I always figured short ids could be generated from URIs downstream where they are useful; not sure of the balance between redundancy and convenience in the published datasets
- How are you proposing to support merging book and creator data? We're using authorized names where we have them, so it should be possible to merge based on those, which I think we do include in both datasets. What do you think about a csv explicitly documenting the relationships? I don't love it personally but it's also more explicit, and would be more useful for some kinds of questions.
for addr in addresses.all(): | ||
persons = [addr.person] if addr.person else addr.account.persons.all() | ||
for person in persons: | ||
res.append((person, addr)) | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a lingering possibility in the db structure that we never finished cleaning up which allows an address to be directly associated with a person instead of an account but in practice we should only ever use account.
Did you find otherwise in the actual data when you were working on the exports? (maybe something changed since the last time I worked on this part)
I remember working on the cleanup but I don't remember where we left it; if need be I can look at the history and figure out more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found cases where only a person relation existed and instances where only an account relation existed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble finding the github issue, but I checked the admin interface and it's not possible to edit or add relationships directly between people and addresses. I also checked with the python console in production and I am not finding any records with that relationship - can you confirm?
FWIW, we removed it because we decided it was out of scope for the project, so even if they were there they would arguably be out of scope for this export also.
I can't remember why we removed it from the interface and not from the underlying data structure, maybe that was meant to be a secondary step once the relationships were migrated. I apologize for the confusion in the code, it's certainly not obvious there that you can ignore it.
"sort_name", | ||
"title", | ||
"gender", | ||
"is_organization", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, fascinating; project full of edge cases
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
Co-authored-by: Rebecca Sutton Koeser <[email protected]>
"birth_year", | ||
"death_year", | ||
"viaf_url", | ||
"wikipedia_url", | ||
# related country | ||
"nationalities", | ||
# generic | ||
"notes", | ||
"updated", | ||
] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #684 Josh requested/suggested creator types and associated items - creator types seems like it would be useful. I don't know if you already have a solution for associating books and creators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.
The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.
If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.
In this case, since we want a row per person per address, and if we return simply one Address per one call of |
There are also 3 addresses with 0 persons in the account:
|
Here's what happens if we use get_object_data per address object: def get_object_data(self, obj):
"""
Generate dictionary of data to export for a single
:class:`~mep.people.models.Person`
"""
addr = obj
loc = addr.location
persons = addr.account.persons.all()
# required properties
return dict(
# Member
member_id=[person.slug for person in persons],
member_uri=[absolutize_url(person.get_absolute_url()) for person in persons],
# Address data
start_date=addr.partial_start_date,
end_date=addr.partial_end_date,
care_of_person_id=addr.care_of_person.slug if addr.care_of_person else None,
# Location data
street_address=loc.street_address,
city=loc.city,
postal_code=loc.postal_code,
latitude=float(loc.latitude) if loc.latitude is not None else None,
longitude=float(loc.longitude) if loc.longitude is not None else None,
country=loc.country.name if loc.country else None,
arrondissement=loc.arrondissement(),
) -> that uses semicolons to join the lists:
Which is fine but makes it not possible to easily join the csv from members to locations. |
Thanks for finding. Please run by Josh and check if he knows whether they should be deleted or need to be associated with an account; hopefully he'll remember or be able to figure out. Maybe they were removed from an account but not deleted, or maybe they were duplicates. If need be, we can check if there's anything useful in the admin log entries. |
It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a |
Good point—we already are committed to joining multiple people in events table so for consistency let's just keep doing that. My latest commit is an example of that happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your last revision looks great to me. Let's get this merged and deliver for testing!
"latitude", | ||
] | ||
|
||
# def get_queryset(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so with the revised logic, no queryset customization is needed? or would prefetching on persons still be useful?
We need to customize/add to the JSON+CSV export commands we now have (
export_members
,export_books
,export_events
) to reflect the following:A new genre field called
category
on books, doable by modifyingexport_books
. No issue yetNew nationality and gender info on authors/creators, and indeed all info on creators, doable by writing a
export_creators
command that ideally subclasses/inherits most of the logic from theexport_members
command, since both draw on the samePerson
table. See As a content admin, I want to export information about creators from the database so I can work with the data using other tools. #684New time-sensitive location information about who lived where when, doable by a new separate command,
export_locations
. See As a content admin, I want to export member addresses with more details in a separate file so I can more easily work with member locations. #790This PR has a minimal working version of (2) using the approach described there.