Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

201 bugfix remove references to sites where data originated from #229

Conversation

pcw24601
Copy link
Collaborator

@pcw24601 pcw24601 commented Apr 7, 2022

Remove references to site locations from file/directory names and anonymises all dicom files in tests\ directory.

@pcw24601 pcw24601 self-assigned this Apr 7, 2022
@pcw24601 pcw24601 linked an issue Apr 7, 2022 that may be closed by this pull request
Previous commit kept original files as well as anonymised versions.
@laurencejackson
Copy link
Collaborator

Hi @pcw24601, thanks for this contribution - could you please change the target branch of this PR to release/0.7.0? This is the branch where we will stage releases and will merge the changes together into the main branch as each release is finalised.

@pcw24601 pcw24601 changed the base branch from main to release/0.7.0 April 14, 2022 14:57
@pcw24601
Copy link
Collaborator Author

Hi @laurencejackson, I think I've done that now--let me know if it's OK (I'm off for a week now so there may be a delay in getting a reply).

@github-actions
Copy link

github-actions bot commented May 3, 2022

Coverage

Coverage Report
FileStmtsMissCoverMissing
hazenlib
   ACRObject.py1011090%59–61, 66–69, 107, 123–124
   HazenTask.py25388%32–34
   __init__.py541769%125–133, 145, 179–181, 184–186, 193–196, 207
   exceptions.py21481%17–21
   relaxometry.py2858670%182–200, 377, 423–425, 492, 566–588, 606–621, 1037–1040, 1046–1052, 1085–1130
   utils.py1894377%61, 65, 75, 80, 117, 124–129, 140, 143–150, 170–172, 190–192, 211–213, 222, 227, 233, 284, 287, 295–300, 303, 346, 355, 371
hazenlib/tasks
   acr_geometric_accuracy.py1216447%41–73, 121–146, 160–194
   acr_ghosting.py1094361%34–49, 75–77, 107–109, 145–186
   acr_slice_position.py1354765%46–61, 187–233
   acr_slice_thickness.py1345857%34–48, 157–216
   acr_snr.py1356353%43–83, 93, 163–173, 207–222, 255–272
   acr_spatial_resolution.py2097166%58–83, 128, 171, 184–193, 275–326
   acr_uniformity.py833360%35–51, 108–133
   ghosting.py1525365%18–35, 50, 112–113, 117, 127–128, 154–156, 173–175, 221–259
   relaxometry.py770%1–11
   slice_position.py1172380%28, 37–38, 49, 103–104, 130, 210, 217–234
   slice_width.py3575385%34–37, 41, 109, 168–188, 453, 458–459, 465, 470, 532–533, 782–823
   snr.py1666760%51, 68–73, 167–185, 200–209, 227–237, 264–274, 279–289, 320–333, 338–346, 375–388
   snr_map.py104199%291
   spatial_resolution.py2474582%36–39, 43, 64, 149, 208, 334–370
   uniformity.py792075%42–45, 51, 93–94, 101, 135–149
TOTAL285581172% 

Tests Skipped Failures Errors Time
201 0 💤 0 ❌ 0 🔥 2m 46s ⏱️

@tomaroberts
Copy link
Collaborator

@pcw24601

Thanks for this. Did you anonymise every DICOM in tests/data? What did you use to anonymise them and which specific tags did you alter?

@pcw24601
Copy link
Collaborator Author

pcw24601 commented May 9, 2022

@tomaroberts

Yes--I anonymised every file using the algorithm below (very slightly adapted from Dicognito). A couple of points worth noting:

  1. Files no longer strictly meet the DICOM standard as I created UI prefixes which I don't own and made no effort to maintain consistency with them--if two fields contained the same UI in the original files, they will have different UIs in the anonymised versions.

  2. These original DICOM files are all currently in the public domain in the git history, so anonymising now is fairly weak, although better than nothing.

Algorithm

Files are anonymised by:
    
    1. Replacing [UI] fields (except 'ClassUID' and 'TransferSyntaxUID')
        with new values of the form '2.datetime.counter.ramdom'.
        
    2. Setting all [PN] fields to 'Anon'
    
    3. Removing the following fields (if they exist): 
        "BranchOfService",
        "Occupation",
        "MedicalRecordLocator",
        "MilitaryRank",
        "PatientInsurancePlanCodeSequence",
        "PatientReligiousPreference",
        "PatientTelecomInformation",
        "PatientTelephoneNumbers",
        "ReferencedPatientPhotoSequence",
        "ResponsibleOrganization"
        
    4. Set the following fields to 'Anon.random_string' (if they exist):
        "AccessionNumber",
        "OtherPatientIDs",
        "FillerOrderNumberImagingServiceRequest",
        "FillerOrderNumberImagingServiceRequestRetired",
        "FillerOrderNumberProcedure",
        "PatientID",
        "PerformedProcedureStepID",
        "PlacerOrderNumberImagingServiceRequest",
        "PlacerOrderNumberImagingServiceRequestRetired",
        "PlacerOrderNumberProcedure",
        "RequestedProcedureID",
        "ScheduledProcedureStepID",
        "StudyID"
        
    5. Set the following fields to 'Anon' (if they exist):
        "RequestingService",
        "CurrentPatientLocation",
        "PatientAddress",
        "RegionOfResidence",
        "CountryOfResidence",
        "InstitutionName",
        "InstitutionAddress",
        "InstitutionalDepartmentName",
        "StationName"
        "DeviceSerialNumber"
        "ProtocolName"
        "StudyDescription"
        "PerformedProcedureStepDescription"
        "PerformedStationName"
        "PerformedStationAETitle"
        
    6. Set "DeidentificationMethod" to 'DICOGNITO_BHSCT'
        (or append if value exists).
        
    7. DICOMDIR files are skipped

@tomaroberts
Copy link
Collaborator

Thanks for info Paul.

Nice, I've never seen dicognito before. Just used pydicom in the past when anonymising my own stuff.

True re: Github history, but definitely better to sort before hazen becomes more widely adopted so fewer people don't have the files on their machines.

@pcw24601
Copy link
Collaborator Author

There are definite advantages to anonymising by hand, for example, you can hunt through the private fields. However, it's a slow process especially for phantom data were it's unlikely patient data will be exposed.

@tomaroberts tomaroberts changed the base branch from release/0.7.0 to main July 17, 2023 13:13
@tomaroberts tomaroberts force-pushed the 201-bugfix-remove-references-to-sites-where-data-originated-from branch from 820ffbe to 992574c Compare July 17, 2023 13:39
@tomaroberts
Copy link
Collaborator

tomaroberts commented Jul 17, 2023

Error on my part – merged main into here, forgetting that Paul had updated the files in this PR. Hence, the files he changed were overwritten with newer files. I've force-pushed back to his previous most recent commit.

@tomaroberts tomaroberts changed the base branch from main to release/0.7.0 July 17, 2023 13:42
@tomaroberts
Copy link
Collaborator

@tomaroberts – reminder to self, merge all the other PRs into main for the next release, then merge this one last. Should be easier to confirm that file changes are stable.

@tomaroberts tomaroberts changed the base branch from release/0.7.0 to main July 28, 2023 13:54
@tomaroberts tomaroberts requested review from tomaroberts and removed request for Lucrezia-Cester July 28, 2023 15:52
Copy link
Collaborator

@tomaroberts tomaroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked across repo to ensure any old filenames have been adjusted correctly. Tested locally with pytest. GHA tests passing. LGTM.

@tomaroberts tomaroberts merged commit 72aace8 into main Jul 28, 2023
@tomaroberts tomaroberts deleted the 201-bugfix-remove-references-to-sites-where-data-originated-from branch July 28, 2023 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bugfix: remove references to sites where data originated from
3 participants