New export management commands #804

quadrismegistus · 2024-01-30T16:32:59Z

We need to customize/add to the JSON+CSV export commands we now have (export_members, export_books, export_events) to reflect the following:

A new genre field called category on books, doable by modifying export_books. No issue yet
New nationality and gender info on authors/creators, and indeed all info on creators, doable by writing a export_creators command that ideally subclasses/inherits most of the logic from the export_members command, since both draw on the same Person table. See As a content admin, I want to export information about creators from the database so I can work with the data using other tools. #684
New time-sensitive location information about who lived where when, doable by a new separate command, export_locations. See As a content admin, I want to export member addresses with more details in a separate file so I can more easily work with member locations. #790

This PR has a minimal working version of (2) using the approach described there.

codecov · 2024-01-30T16:33:41Z

Codecov Report

Merging #804 (f70e772) into develop (b3d034b) will increase coverage by 0.00%.
The diff coverage is 98.60%.

Additional details and impacted files

@@            Coverage Diff            @@
##           develop     #804    +/-   ##
=========================================
  Coverage    97.39%   97.40%            
=========================================
  Files          231      233     +2     
  Lines        13241    13349   +108     
  Branches        81       81            
=========================================
+ Hits         12896    13002   +106     
- Misses         342      344     +2     
  Partials         3        3

rlskoeser

I really like the way you're leveraging the existing member export and the overlap in functionality with that and the creator export we need. This seems like a really good direction.

I think all the member-specific fields should be omitted from the creator export; although it might be nice to have an is_member boolean or something like that.

What if you override the field list in your creator export and then in get_object_data you can check whether a field should be populated based on that list? It's not perfect encapsulation; the other option is a base class with all the common fields used in both, but that seems like that might be more complicated than we need.

rlskoeser · 2024-02-01T20:25:33Z

mep/people/management/commands/export_creators.py

+        creator_ids = {c.person_id for c in Creator.objects.all()}
+        return Person.objects.filter(pk__in=creator_ids)


should be able to get this in a single query; try this:

Suggested change

creator_ids = {c.person_id for c in Creator.objects.all()}

return Person.objects.filter(pk__in=creator_ids)

return Person.objects.filter(creator__isnull=False).distinct().count()

The distinct is needed because without it we get one person record for every time they have a creator entry

rlskoeser · 2024-02-01T20:29:00Z

mep/people/management/commands/export_members.py

@@ -22,7 +22,7 @@ class Command(BaseExport):
    model = Person

    csv_fields = [
-        "uri",


We need to keep the uri for the member export for backwards compatibility.

I'm not sure "slug" is meaningful to an outside audience... Inclined to use URIs for creators even though they don't resolve, but I don't feel strongly about that. Do you want to use slug for creators export with a different label?

on second thought - because there's overlap between members and creators, I recommend we stick with URIs so the two data exports can easily be used together

Should I make my own URIs for the creators then / those that don't resolve, using the slug?

ah! I'd forgotten that we don't generate urls for persons without accounts (and it wouldn't make sense to create member uris for them anyways)

is it weird to include a member uri in the creator export if the person is a member?

I think that yes, we should create uris for creators - this means we should create a super-minimal /creator/slug/ view - maybe it could return brief json response only? (I feel like I did this somewhere else for URIs that were only referenced in a data export... I looked quickly at mep-django urls and didn't see it.)

The additional value of a view like that it will give us a way to check for renamed slugs, assuming we adapt the same redirect logic for old slugs that we have for persons.

However, it does increase the work required to complete this export! I'd be ok with slug (but label as id) for creators.

mep/people/management/commands/export_creators.py

rlskoeser · 2024-02-01T20:31:20Z

mep/people/management/commands/export_members.py

@@ -85,8 +86,10 @@ def get_object_data(self, obj):
        if obj.death_year:
            data["death_year"] = obj.death_year
        # set for unique, list for json serialization
-        data["membership_years"] = list(
-            set(d.year for d in obj.account_set.first().event_dates)
+        data["membership_years"] = (


instead of making this conditional on whether the person has an account, can we just skip based on whether membership years is in the list of fields to be exported? that isn't relevant for the creator export, is it? (although I know there is some overlap)

mep/people/management/commands/export_members.py

rlskoeser

this revision looks good to me!

rlskoeser · 2024-02-01T21:51:50Z

mep/people/management/commands/export_creators.py

+        "sort_name",
+        "title",
+        "gender",
+        "is_organization",


do we have any creator orgs?

Looks like one:

In [2]: Person.objects.filter(creator__isnull=False, is_organization=True) Out[2]: <PersonQuerySet [<Person pk:10524 Fabian Society>]>

wow, fascinating; project full of edge cases

rlskoeser · 2024-02-01T21:53:10Z

mep/people/management/commands/export_creators.py


    def get_queryset(self):
        """filter to creators"""
-        creator_ids = {c.person_id for c in Creator.objects.all()}
-        return Person.objects.filter(pk__in=creator_ids)
+        return Person.objects.filter(creator__isnull=False)


pretty sure you need the .distinct(); did you test with this and see what you get?

woops, I forgot to include that, thanks. It's def needed: with distinct we have 2532 creators from queryset to export and without it we have 6479.

rlskoeser

These exports are looking really good, congratulations. I have some questions and suggestions but hopefully it's just minor cleanup and finishing the unit tests

rlskoeser · 2024-02-08T21:23:04Z

mep/common/management/export.py

+        total = len(
+            objects
+        )  # fewer assumptions, allows other (multi model/class) objects


was this needed? are we exporting anything other than database content in the new export scripts?

ah, I see it's due to the address person/account issue; I'd like to resolve it there instead

mep/common/management/export.py

rlskoeser · 2024-02-08T21:24:41Z

mep/books/management/commands/export_books.py

-        ["uri", "title"]
+        # including "id" to store slug for exports,
+        # given not all exported entities have a URI
+        ["id", "uri", "title"]


pretty sure all books should have uris; did you encounter any that did not?

rlskoeser · 2024-02-08T21:25:35Z

mep/people/management/commands/export_creators.py

+Generates a CSV and JSON file including details on every creator
+database, with summary details and coordinates for associated addresses.


looks like this is slightly out of date.

maybe worth briefly clarifying how we define creator?

rlskoeser · 2024-02-08T21:27:06Z

mep/people/management/commands/export_locations.py

+    model = Address
+
+    csv_fields = [
+        "id",  # including "id" to store slug for exports,


locations/addresses don't have IDs, as far as I can remember (other than db id, which we should not include in the exports)

do you think we need public ids for these? I was assuming we would not

Because we don't have URIs for locations or creators, I am including the slug, which we do have, for all 4 exports, but calling it "id" instead of slug. This allows csv-users to easily merge the datasets. That or something like it would be my recommendation but up to you.

But we could leave out an "id" field for locations since Location/Address doesn't have a slug, and since the output csv should still be mergeable with the others on id in the members.csv to member_id in the locations.csv

I agree that locations / address data export does not need IDs, just needs to have a member uri to join with the members dataset

I can't remember if I suggested: I think we should include member_uri in the creators dataset, so that any authors who are also library members can be easily identified and the author and member records can be connected (this seems better than a boolean indicating membership)

We need to keep URIs where we have them (which is everywhere but creators, right?), and merging/joining can always be done based on the URIs. I always figured short ids could be generated from URIs downstream where they are useful; not sure of the balance between redundancy and convenience in the published datasets

How are you proposing to support merging book and creator data? We're using authorized names where we have them, so it should be possible to merge based on those, which I think we do include in both datasets. What do you think about a csv explicitly documenting the relationships? I don't love it personally but it's also more explicit, and would be more useful for some kinds of questions.

rlskoeser · 2024-02-08T21:33:54Z

mep/people/management/commands/export_locations.py

+        for addr in addresses.all():
+            persons = [addr.person] if addr.person else addr.account.persons.all()
+            for person in persons:
+                res.append((person, addr))
+        return res


I think there is a lingering possibility in the db structure that we never finished cleaning up which allows an address to be directly associated with a person instead of an account but in practice we should only ever use account.

Did you find otherwise in the actual data when you were working on the exports? (maybe something changed since the last time I worked on this part)

I remember working on the cleanup but I don't remember where we left it; if need be I can look at the history and figure out more.

I found cases where only a person relation existed and instances where only an account relation existed

I'm having trouble finding the github issue, but I checked the admin interface and it's not possible to edit or add relationships directly between people and addresses. I also checked with the python console in production and I am not finding any records with that relationship - can you confirm?

FWIW, we removed it because we decided it was out of scope for the project, so even if they were there they would arguably be out of scope for this export also.

I can't remember why we removed it from the interface and not from the underlying data structure, maybe that was meant to be a secondary step once the relationships were migrated. I apologize for the confusion in the code, it's certainly not obvious there that you can ignore it.

mep/people/management/commands/export_locations.py

rlskoeser · 2024-02-08T21:37:55Z

mep/people/management/commands/export_creators.py

+        "sort_name",
+        "title",
+        "gender",
+        "is_organization",


wow, fascinating; project full of edge cases

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

rlskoeser · 2024-02-09T20:23:23Z

mep/people/management/commands/export_creators.py

+        "birth_year",
+        "death_year",
+        "viaf_url",
+        "wikipedia_url",
+        # related country
+        "nationalities",
+        # generic
+        "notes",
+        "updated",
+    ]
+


In #684 Josh requested/suggested creator types and associated items - creator types seems like it would be useful. I don't know if you already have a solution for associating books and creators.

rlskoeser

All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.

The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.

If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.

quadrismegistus · 2024-02-12T15:29:23Z

All the code changes look good, I think we mainly have questions about finalizing field specifics. I'm inclined to get this in front of Josh sooner rather than later while we decide the picky details. Should be straightforward to make remaining field tweaks after that, I think.

The one change I'd like to see before we merge for a first round of testing is to simplify the location export code to use an Address queryset and only use related accounts.

If you find any instances of addresses associated with person instead of account in production data, let's treat it as a data cleanup problem.

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

quadrismegistus · 2024-02-12T15:37:21Z

There are also 3 addresses with 0 persons in the account:


In [8]: for addr in Address.objects.all():
   ...:     if len(addr.account.persons.all())==0:
   ...:         print(f'{len(addr.account.persons.all())} persons in account for address {addr}')
   ...: 
0 persons in account for address 41 avenue du Maine, Paris — Account #320: 41 avenue du Maine
0 persons in account for address 220 boulevard Raspail, Paris — Account #7544: 220 boulevard Raspail
0 persons in account for address 41 rue Censier, Paris — Account #7652: 41 rue Censier

quadrismegistus · 2024-02-12T15:45:25Z

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

Here's what happens if we use get_object_data per address object:

def get_object_data(self, obj):
        """
        Generate dictionary of data to export for a single
        :class:`~mep.people.models.Person`
        """
        addr = obj
        loc = addr.location
        persons = addr.account.persons.all()

        # required properties
        return dict(
            # Member
            member_id=[person.slug for person in persons],
            member_uri=[absolutize_url(person.get_absolute_url()) for person in persons],
            # Address data
            start_date=addr.partial_start_date,
            end_date=addr.partial_end_date,
            care_of_person_id=addr.care_of_person.slug if addr.care_of_person else None,
            # Location data
            street_address=loc.street_address,
            city=loc.city,
            postal_code=loc.postal_code,
            latitude=float(loc.latitude) if loc.latitude is not None else None,
            longitude=float(loc.longitude) if loc.longitude is not None else None,
            country=loc.country.name if loc.country else None,
            arrondissement=loc.arrondissement(),
        )

-> that uses semicolons to join the lists:

member_id	member_uri	care_of_person_id	street_address	postal_code	city	arrondissement	country	start_date	end_date	longitude	latitude
coyle-kathleen;coyle-kestrel-2	https://shakespeareandco.princeton.edu/members/coyle-kathleen/;https://shakespeareandco.princeton.edu/members/coyle-kestrel-2/		79 rue Denfert-Rochereau	75014	Paris	14	France	1/26/33	2/23/33	2.3343	48.8359

Which is fine but makes it not possible to easily join the csv from members to locations.

rlskoeser · 2024-02-12T17:06:25Z

There are also 3 addresses with 0 persons in the account:

Thanks for finding. Please run by Josh and check if he knows whether they should be deleted or need to be associated with an account; hopefully he'll remember or be able to figure out. Maybe they were removed from an account but not deleted, or maybe they were duplicates. If need be, we can check if there's anything useful in the admin log entries.

rlskoeser · 2024-02-12T17:15:59Z

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

quadrismegistus · 2024-02-12T18:21:44Z

In this case, since we want a row per person per address, and if we return simply one Address per one call of get_export_data() -> dict, and since addresses can contain more than one person, I'm not sure how we can guarantee to get one person per exported dict. There are 53 addresses with more than one person in the associated account.

It occurred to me after our last back and forth about this, addresses are like events - they are linked to accounts, not to people, and as you've noted we have a subset of accounts shared by people. My solution in the events dataset was a member_uris field that included delimited member uris for multi-person accounts - my intent was that labeling this field should make it clear you have to choose how to handle shared accounts. Whatever we do, the events and addresses datasets should work the same way. I think Josh and I documented this pretty clearly in the dataset essay (both events dataset and the fact of shared accounts).

Good point—we already are committed to joining multiple people in events table so for consistency let's just keep doing that. My latest commit is an example of that happening.

rlskoeser

Your last revision looks great to me. Let's get this merged and deliver for testing!

rlskoeser · 2024-02-15T15:13:05Z

mep/accounts/management/commands/export_locations.py

+        "latitude",
+    ]
+
+    # def get_queryset(self):


so with the revised logic, no queryset customization is needed? or would prefetching on persons still be useful?

minimalist export_creators

ca5ba7c

minor

ed2385c

quadrismegistus marked this pull request as draft February 1, 2024 16:50

quadrismegistus requested a review from rlskoeser February 1, 2024 16:50

rlskoeser reviewed Feb 1, 2024

View reviewed changes

quadrismegistus added 3 commits February 1, 2024 16:45

more flexibly subclassing

901a9b9

more flexibly subclassing (m)

5bf80b4

more flexibly subclassing (m)

bb362ff

rlskoeser reviewed Feb 1, 2024

View reviewed changes

quadrismegistus added 5 commits February 1, 2024 17:05

more flexibly subclassing (m)

046db90

forgot distinct in queryset

cbc3578

books

7fc8279

these 3 working

0db71e6

locations working

7f207a5

quadrismegistus requested a review from rlskoeser February 7, 2024 20:25

rlskoeser reviewed Feb 8, 2024

View reviewed changes

quadrismegistus and others added 7 commits February 8, 2024 17:19

Update mep/people/management/commands/export_locations.py

46da0d8

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

Update mep/people/management/commands/export_locations.py

3eb1121

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

Update mep/people/management/commands/export_locations.py

bf3cdec

Co-authored-by: Rebecca Sutton Koeser <[email protected]>

cleanup

5c3ccff

cleanup 2

3cb1c64

book tests

88c2d0e

other tests

f11f0c2

quadrismegistus marked this pull request as ready for review February 9, 2024 16:36

quadrismegistus requested a review from rlskoeser February 9, 2024 16:36

other tests (cleanup)

7fd0853

rlskoeser reviewed Feb 9, 2024

View reviewed changes

version with default get_object_data logic

5c7600c

rlskoeser approved these changes Feb 15, 2024

View reviewed changes

quadrismegistus added 5 commits February 20, 2024 16:42

fixes to a few errors and loose ends

c1cbbbe

forgot .distinct()

cfa6f33

updated tests to match new export_locations logic

1381093

Merge branch 'develop' into new_exports

5e3eb4c

adapting fixtures and tests to plural categories

f70e772

quadrismegistus merged commit b46021a into develop Feb 20, 2024
11 checks passed

quadrismegistus deleted the new_exports branch February 20, 2024 22:39

		creator_ids = {c.person_id for c in Creator.objects.all()}
		return Person.objects.filter(pk__in=creator_ids)

	creator_ids = {c.person_id for c in Creator.objects.all()}
	return Person.objects.filter(pk__in=creator_ids)
	return Person.objects.filter(creator__isnull=False).distinct().count()

		Generates a CSV and JSON file including details on every creator
		database, with summary details and coordinates for associated addresses.

New export management commands #804

New export management commands #804

Conversation

quadrismegistus commented Jan 30, 2024 • edited Loading

codecov bot commented Jan 30, 2024 • edited Loading

Codecov Report

rlskoeser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quadrismegistus Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlskoeser left a comment

Choose a reason for hiding this comment

quadrismegistus commented Feb 12, 2024

quadrismegistus commented Feb 12, 2024

quadrismegistus commented Feb 12, 2024 • edited Loading

rlskoeser commented Feb 12, 2024

rlskoeser commented Feb 12, 2024

quadrismegistus commented Feb 12, 2024

rlskoeser left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quadrismegistus commented Jan 30, 2024 •

edited

Loading

codecov bot commented Jan 30, 2024 •

edited

Loading

quadrismegistus Feb 9, 2024 •

edited

Loading

quadrismegistus commented Feb 12, 2024 •

edited

Loading