Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maui Nui thoughts #3

Open
sunray1 opened this issue Feb 20, 2025 · 9 comments
Open

Maui Nui thoughts #3

sunray1 opened this issue Feb 20, 2025 · 9 comments

Comments

@sunray1
Copy link

sunray1 commented Feb 20, 2025

I've been chatting about the state of the current vocabularies with some folks I'm working with out in Maui Nui and getting a sense of why they need to retract data and when - here are some thoughts from them:

  • Generally, it needs to be possible to redact any field (CE: I think this is possible already, but just a confirmation)
  • Coordinates can be fuzzed per usual, but also location may need to be redacted
  • Often landholders don't want folks to know what is on their land or even that their land has been surveyed
  • Habitat type often also needs to be redacted (i.e., it's pretty easy to work out where the one bog on Kauai is)
  • Redacting common names as well as scientific names (CE: I don't remember if there's a term for common names in DwC)
  • Collector names often need to be redacted - they've had concerns with being targeted by some groups (think cat colonies)
  • The use of redacting based on dates is useful (i.e., a data embargo on tracking live organisms - think introduced species)

And some questions:

  • Is it possible to reduce the resolution of the reason? i.e., not publish subcategories but just the categories?
  • Is it possible to fuzz extension data or just core? You'd be able to figure out what a species is based on DNA or something.
  • Is there a difference between extreme/high or high/moderate and how you treat the data? Or is that just a division for internal use?
@ben-norton
Copy link
Member

ben-norton commented Feb 20, 2025

@sunray1
A common traditional workflow for redacting information for sensitive species in a DWC dataset is as follows. Note. This is done prior to publication by the publishing institution.

  1. A set of sensitive species is identified by scientific name and stored as a list.
  2. The sensitive species list is checked against a DWC compliant dataset prior to publication.
  3. DWC records with scientific names that match an item on the sensitive species list are flagged.
  4. For flagged records, information in a specific set of fields is set to null.
  5. A notification is added to the informationWitheld field that states something similar to the following: "location information for this record was been removed to protect a sensitive species. Please contact xxx for location information".
  6. The dataset is then published via an IPT.

Which fields are set to null is up to the individual institution. Null is not a requirement. It's just a common practice. For coordinates, fuzzy generalization is a common course of action. Ultimately, it's up to the provider.

For datasets that I've published, we would delete decimalLatitude, decimalLongitude, county, locality, localityRemarks, verbatimLocality, and, sometimes, stateProvince for sensitive species.

Determing which species are deemed "sensitive" is trickier than it may seem. Many default to IUCN, but that often omits local sensitivity. Again, this is ultimately up to the provider. For this reason (and a few others) global automated redaction has proven problematic.

Ita also worth noting that redaction has different implications for different types of datasets. Camera trap data without coordinates is often not suitable for Ecological modelling, which is its primary use. Coordinates are fuzzy instead of removed for this reason. Also, images (and the associated metadata) of park rangers are never published. It can be a matter of life or death for those individuals.

@TaniaGLaity
Copy link
Collaborator

HI Chandra
Thanks very much for passing this info onto us!

Multiple attributes in a record will be able to be obfuscated / withheld using the proposed model. I think that answers a few of the questions / use cases above.

DWC uses vernacularName for common name

regarding the questions:

  1. for various reasons we don't want to be able to allow users to select category by default as we may get fewer more detailed reasons this way. A work around for this would be to add an extra subcategory e.g. species regarded as sensitive because of threat but taxa not individually assessed (ie a catch-all subcategory for the category). This has been brought up by other task group members - will discuss at the March meeting.
  2. I believe so - another one for discussion but don't think it should be an issue - in theory we could apply this to any attribute in DWC and extensions
  3. Generally yes in Australia - High would be Extreme would be withheld, High would be obfuscated to 1 decimal place / ~10km and medium would be obfuscated to 2 decimal places / ~1km

@TaniaGLaity
Copy link
Collaborator

@sunray1 A common traditional workflow for redacting information for sensitive species in a DWC dataset is as follows. Note. This is done prior to publication by the publishing institution.

  1. A set of sensitive species is identified by scientific name and stored as a list.
  2. The sensitive species list is checked against a DWC compliant dataset prior to publication.
  3. DWC records with scientific names that match an item on the sensitive species list are flagged.
  4. For flagged records, information in a specific set of fields is set to null.
  5. A notification is added to the informationWitheld field that states something similar to the following: "location information for this record was been removed to protect a sensitive species. Please contact xxx for location information".
  6. The dataset is then published via an IPT.

Which fields are set to null is up to the individual institution. Null is not a requirement. It's just a common practice. For coordinates, fuzzy generalization is a common course of action. Ultimately, it's up to the provider.

For datasets that I've published, we would delete decimalLatitude, decimalLongitude, county, locality, localityRemarks, verbatimLocality, and, sometimes, stateProvince for sensitive species.

Determing which species are deemed "sensitive" is trickier than it may seem. Many default to IUCN, but that often omits local sensitivity. Again, this is ultimately up to the provider. For this reason (and a few others) global automated redaction has proven problematic.

Ita also worth noting that redaction has different implications for different types of datasets. Camera trap data without coordinates is often not suitable for Ecological modelling, which is its primary use. Coordinates are fuzzy instead of removed for this reason. Also, images (and the associated metadata) of park rangers are never published. It can be a matter of life or death for those individuals.

Thanks Ben.
for context, we made a call out for case studies for testing our draft sensitivity treatments and reasons for our next Task Group meeting. thanks for outlining the traditional workflow - that's useful for the Task Group to understand!

@sunray1
Copy link
Author

sunray1 commented Feb 21, 2025

HI Chandra
Thanks very much for passing this info onto us!

Multiple attributes in a record will be able to be obfuscated / withheld using the proposed model. I think that answers a few of the questions / use cases above.

That's what I thought, perfect! Definitely allows for all of the use cases I think

DWC uses vernacularName for common name

regarding the questions:

  1. for various reasons we don't want to be able to allow users to select category by default as we may get fewer more detailed reasons this way. A work around for this would be to add an extra subcategory e.g. species regarded as sensitive because of threat but taxa not individually assessed (ie a catch-all subcategory for the category). This has been brought up by other task group members - will discuss at the March meeting.

I think the concern here was more about the fact that adding a more specific reason might unintentionally draw more attention to those records vs not having enough information to distinguish a subcategory.

  1. I believe so - another one for discussion but don't think it should be an issue - in theory we could apply this to any attribute in DWC and extensions

Perfect

  1. Generally yes in Australia - High would be Extreme would be withheld, High would be obfuscated to 1 decimal place / ~10km and medium would be obfuscated to 2 decimal places / ~1km

Interesting, did not know that! Perhaps this could be noted somewhere in the docs.

@ArthurChapman
Copy link

Of course just using a taxonomic name alone for a sensitive species is problematic. It must also include a region of sensitivity. Some taxa may be highly sensitive in one area (e.g. a species of Hakea in Australia) but may be a invasive species in another (e.g. South Africa). In the area where it is invasive, one would need to know precise localities, etc.

@TaniaGLaity
Copy link
Collaborator

@ArthurChapman agreed. I'm guessing when we get to the discussion about how we represent what changes have been made to the data - we have to indicate the record is sensitive according to XX list. That's kind of how we do it in the ALA. but a standard set of words or examples would be a good thing to have I think

@tucotuco
Copy link
Member

I haven't heard mention of dwc:dataGeneralizations in the conversation so far, but it is relevant.

@TaniaGLaity
Copy link
Collaborator

I haven't heard mention of dwc:dataGeneralizations in the conversation so far, but it is relevant.

Definitely agree - we haven't got to the actual implementation yet. We're just trying to start to define the vocabularies first and test that they meet most scenarios using case studies. We might need to enlist your help once we get to that John!

@tucotuco
Copy link
Member

Ready and willing to be helpful when needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants