Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New export management commands #804

Merged
merged 24 commits into from
Feb 20, 2024
Merged

New export management commands #804

merged 24 commits into from
Feb 20, 2024

Conversation

quadrismegistus
Copy link
Contributor

@quadrismegistus quadrismegistus commented Jan 30, 2024

We need to customize/add to the JSON+CSV export commands we now have (export_members, export_books, export_events) to reflect the following:

  1. A new genre field called category on books, doable by modifying export_books. No issue yet

  2. New nationality and gender info on authors/creators, and indeed all info on creators, doable by writing a export_creators command that ideally subclasses/inherits most of the logic from the export_members command, since both draw on the same Person table. See As a content admin, I want to export information about creators from the database so I can work with the data using other tools. #684

  3. New time-sensitive location information about who lived where when, doable by a new separate command, export_locations. See As a content admin, I want to export member addresses with more details in a separate file so I can more easily work with member locations. #790

This PR has a minimal working version of (2) using the approach described there.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

Merging #804 (f70e772) into develop (b3d034b) will increase coverage by 0.00%.
The diff coverage is 98.60%.

Additional details and impacted files
@@            Coverage Diff            @@
##           develop     #804    +/-   ##
=========================================
  Coverage    97.39%   97.40%            
=========================================
  Files          231      233     +2     
  Lines        13241    13349   +108     
  Branches        81       81            
=========================================
+ Hits         12896    13002   +106     
- Misses         342      344     +2     
  Partials         3        3            

@quadrismegistus quadrismegistus marked this pull request as draft February 1, 2024 16:50
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the way you're leveraging the existing member export and the overlap in functionality with that and the creator export we need. This seems like a really good direction.

I think all the member-specific fields should be omitted from the creator export; although it might be nice to have an is_member boolean or something like that.

What if you override the field list in your creator export and then in get_object_data you can check whether a field should be populated based on that list? It's not perfect encapsulation; the other option is a base class with all the common fields used in both, but that seems like that might be more complicated than we need.

Comment on lines 19 to 20
creator_ids = {c.person_id for c in Creator.objects.all()}
return Person.objects.filter(pk__in=creator_ids)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be able to get this in a single query; try this:

Suggested change
creator_ids = {c.person_id for c in Creator.objects.all()}
return Person.objects.filter(pk__in=creator_ids)
return Person.objects.filter(creator__isnull=False).distinct().count()

The distinct is needed because without it we get one person record for every time they have a creator entry

@@ -22,7 +22,7 @@ class Command(BaseExport):
model = Person

csv_fields = [
"uri",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to keep the uri for the member export for backwards compatibility.

I'm not sure "slug" is meaningful to an outside audience... Inclined to use URIs for creators even though they don't resolve, but I don't feel strongly about that. Do you want to use slug for creators export with a different label?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on second thought - because there's overlap between members and creators, I recommend we stick with URIs so the two data exports can easily be used together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I make my own URIs for the creators then / those that don't resolve, using the slug?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! I'd forgotten that we don't generate urls for persons without accounts (and it wouldn't make sense to create member uris for them anyways)

is it weird to include a member uri in the creator export if the person is a member?

I think that yes, we should create uris for creators - this means we should create a super-minimal /creator/slug/ view - maybe it could return brief json response only? (I feel like I did this somewhere else for URIs that were only referenced in a data export... I looked quickly at mep-django urls and didn't see it.)

The additional value of a view like that it will give us a way to check for renamed slugs, assuming we adapt the same redirect logic for old slugs that we have for persons.

However, it does increase the work required to complete this export! I'd be ok with slug (but label as id) for creators.

mep/people/management/commands/export_creators.py Outdated Show resolved Hide resolved
@@ -85,8 +86,10 @@ def get_object_data(self, obj):
if obj.death_year:
data["death_year"] = obj.death_year
# set for unique, list for json serialization
data["membership_years"] = list(
set(d.year for d in obj.account_set.first().event_dates)
data["membership_years"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of making this conditional on whether the person has an account, can we just skip based on whether membership years is in the list of fields to be exported? that isn't relevant for the creator export, is it? (although I know there is some overlap)

mep/people/management/commands/export_members.py Outdated Show resolved Hide resolved
Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this revision looks good to me!

"sort_name",
"title",
"gender",
"is_organization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any creator orgs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like one:

In [2]: Person.objects.filter(creator__isnull=False, is_organization=True)
Out[2]: <PersonQuerySet [<Person pk:10524 Fabian Society>]>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, fascinating; project full of edge cases


def get_queryset(self):
"""filter to creators"""
creator_ids = {c.person_id for c in Creator.objects.all()}
return Person.objects.filter(pk__in=creator_ids)
return Person.objects.filter(creator__isnull=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure you need the .distinct(); did you test with this and see what you get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops, I forgot to include that, thanks. It's def needed: with distinct we have 2532 creators from queryset to export and without it we have 6479.

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exports are looking really good, congratulations. I have some questions and suggestions but hopefully it's just minor cleanup and finishing the unit tests

Comment on lines +151 to +153
total = len(
objects
) # fewer assumptions, allows other (multi model/class) objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this needed? are we exporting anything other than database content in the new export scripts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see it's due to the address person/account issue; I'd like to resolve it there instead

mep/common/management/export.py Outdated Show resolved Hide resolved
["uri", "title"]
# including "id" to store slug for exports,
# given not all exported entities have a URI
["id", "uri", "title"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure all books should have uris; did you encounter any that did not?

Comment on lines 4 to 5
Generates a CSV and JSON file including details on every creator
database, with summary details and coordinates for associated addresses.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is slightly out of date.

maybe worth briefly clarifying how we define creator?

model = Address

csv_fields = [
"id", # including "id" to store slug for exports,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locations/addresses don't have IDs, as far as I can remember (other than db id, which we should not include in the exports)

do you think we need public ids for these? I was assuming we would not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have URIs for locations or creators, I am including the slug, which we do have, for all 4 exports, but calling it "id" instead of slug. This allows csv-users to easily merge the datasets. That or something like it would be my recommendation but up to you.

Copy link
Contributor Author

@quadrismegistus quadrismegistus Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could leave out an "id" field for locations since Location/Address doesn't have a slug, and since the output csv should still be mergeable with the others on id in the members.csv to member_id in the locations.csv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree that locations / address data export does not need IDs, just needs to have a member uri to join with the members dataset
  • I can't remember if I suggested: I think we should include member_uri in the creators dataset, so that any authors who are also library members can be easily identified and the author and member records can be connected (this seems better than a boolean indicating membership)
  • We need to keep URIs where we have them (which is everywhere but creators, right?), and merging/joining can always be done based on the URIs. I always figured short ids could be generated from URIs downstream where they are useful; not sure of the balance between redundancy and convenience in the published datasets
  • How are you proposing to support merging book and creator data? We're using authorized names where we have them, so it should be possible to merge based on those, which I think we do include in both datasets. What do you think about a csv explicitly documenting the relationships? I don't love it personally but it's also more explicit, and would be more useful for some kinds of questions.

Comment on lines 56 to 60
for addr in addresses.all():
persons = [addr.person] if addr.person else addr.account.persons.all()
for person in persons:
res.append((person, addr))
return res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a lingering possibility in the db structure that we never finished cleaning up which allows an address to be directly associated with a person instead of an account but in practice we should only ever use account.

Did you find otherwise in the actual data when you were working on the exports? (maybe something changed since the last time I worked on this part)

I remember working on the cleanup but I don't remember where we left it; if need be I can look at the history and figure out more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found cases where only a person relation existed and instances where only an account relation existed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having trouble finding the github issue, but I checked the admin interface and it's not possible to edit or add relationships directly between people and addresses. I also checked with the python console in production and I am not finding any records with that relationship - can you confirm?

FWIW, we removed it because we decided it was out of scope for the project, so even if they were there they would arguably be out of scope for this export also.

I can't remember why we removed it from the interface and not from the underlying data structure, maybe that was meant to be a secondary step once the relationships were migrated. I apologize for the confusion in the code, it's certainly not obvious there that you can ignore it.

mep/people/management/commands/export_locations.py Outdated Show resolved Hide resolved
mep/people/management/commands/export_locations.py Outdated Show resolved Hide resolved
mep/people/management/commands/export_locations.py Outdated Show resolved Hide resolved
"sort_name",
"title",
"gender",
"is_organization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, fascinating; project full of edge cases

@quadrismegistus quadrismegistus marked this pull request as ready for review February 9, 2024 16:36
Comment on lines +23 to +33
"birth_year",
"death_year",
"viaf_url",
"wikipedia_url",
# related country
"nationalities",
# generic
"notes",
"updated",
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #684 Josh requested/suggested creator types and associated items - creator types seems like it would be useful. I don't know if you already have a solution for associating books and creators.

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.

The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.

If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.

@quadrismegistus
Copy link
Contributor Author

All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.

The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.

If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

@quadrismegistus
Copy link
Contributor Author

There are also 3 addresses with 0 persons in the account:


In [8]: for addr in Address.objects.all():
   ...:     if len(addr.account.persons.all())==0:
   ...:         print(f'{len(addr.account.persons.all())} persons in account for address {addr}')
   ...: 
0 persons in account for address 41 avenue du Maine, Paris — Account #320: 41 avenue du Maine
0 persons in account for address 220 boulevard Raspail, Paris — Account #7544: 220 boulevard Raspail
0 persons in account for address 41 rue Censier, Paris — Account #7652: 41 rue Censier

@quadrismegistus
Copy link
Contributor Author

quadrismegistus commented Feb 12, 2024

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

Here's what happens if we use get_object_data per address object:

def get_object_data(self, obj):
        """
        Generate dictionary of data to export for a single
        :class:`~mep.people.models.Person`
        """
        addr = obj
        loc = addr.location
        persons = addr.account.persons.all()

        # required properties
        return dict(
            # Member
            member_id=[person.slug for person in persons],
            member_uri=[absolutize_url(person.get_absolute_url()) for person in persons],
            # Address data
            start_date=addr.partial_start_date,
            end_date=addr.partial_end_date,
            care_of_person_id=addr.care_of_person.slug if addr.care_of_person else None,
            # Location data
            street_address=loc.street_address,
            city=loc.city,
            postal_code=loc.postal_code,
            latitude=float(loc.latitude) if loc.latitude is not None else None,
            longitude=float(loc.longitude) if loc.longitude is not None else None,
            country=loc.country.name if loc.country else None,
            arrondissement=loc.arrondissement(),
        )

-> that uses semicolons to join the lists:

member_id	member_uri	care_of_person_id	street_address	postal_code	city	arrondissement	country	start_date	end_date	longitude	latitude
coyle-kathleen;coyle-kestrel-2	https://shakespeareandco.princeton.edu/members/coyle-kathleen/;https://shakespeareandco.princeton.edu/members/coyle-kestrel-2/		79 rue Denfert-Rochereau	75014	Paris	14	France	1/26/33	2/23/33	2.3343	48.8359

Which is fine but makes it not possible to easily join the csv from members to locations.

@rlskoeser
Copy link
Contributor

There are also 3 addresses with 0 persons in the account:

Thanks for finding. Please run by Josh and check if he knows whether they should be deleted or need to be associated with an account; hopefully he'll remember or be able to figure out. Maybe they were removed from an account but not deleted, or maybe they were duplicates. If need be, we can check if there's anything useful in the admin log entries.

@rlskoeser
Copy link
Contributor

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

@quadrismegistus
Copy link
Contributor Author

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

Good point—we already are committed to joining multiple people in events table so for consistency let's just keep doing that. My latest commit is an example of that happening.

Copy link
Contributor

@rlskoeser rlskoeser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your last revision looks great to me. Let's get this merged and deliver for testing!

"latitude",
]

# def get_queryset(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so with the revised logic, no queryset customization is needed? or would prefetching on persons still be useful?

@quadrismegistus quadrismegistus merged commit b46021a into develop Feb 20, 2024
11 checks passed
@quadrismegistus quadrismegistus deleted the new_exports branch February 20, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants