Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create shelves files for pds4 #57

Draft
wants to merge 57 commits into
base: main
Choose a base branch
from
Draft

Conversation

juzen2003
Copy link
Collaborator

@juzen2003 juzen2003 commented Sep 23, 2024

Current status of creating shevles files for pds4:

  • Create files in checksums-* directory (pds4checksums.py)

    • Modification made:
      • update BUNDLENAME_REGEX
    • Example command:
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
  • Create files in _infoshelf-* directory (pds4infoshelf.py), corresponding checksums files from the above steps are required

    • Modification made:
      • properly import pds4checksums
    • Example command:
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
      • python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
  • Create files in _indexshelf-metadata (pds4indexshelf.py)

    • Modification made:
      • Put BUNDLENAME_REGEX to Pds3File & Pds4File classes since they are different for pds3 & pds4
      • Add IDX_EXT and LBL_EXT to Pds3File & Pds4File to replace '.tab' & '.lbl' in pdsfile.py
        • pds4 label extension is .xml and idx extension is .csv
        • pds3 label extension is .lbl and idx extension is .tab
    • Pending items:
      • Wait for label files (.xml) in metadata
  • Create files in _linkshelf-* directory (pds4linkshelf.py)

    • Modification made:
      • remove .TXT in EXTS_WO_LABELS, .TXT could have a label in pds4
      • Add the intelligence to link a file to its correspsonding label if the file is in that label's file_name tags.
      • Add the intelligence to identify files like errata.txt, or checksum files that don't exist in the label nor exist in the csv. They are not part of the archive, so they don't have labels.
    • Example command:
      • python holdings_maintenance/pds4/pds4linkshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
    • Pending items:
      • To create linkshelf-metadata, wait for label files (.xml) in metadata
  • Create archive files (pds4archives.py)

    • Modification made:
      • Add ARCHIVE_PATHS and ARCHIVE_DIRS rules to determine the archive file names and the included directories for each archive file. (each bundle set has its own rules)
        • ARCHIVE_PATHS: map a bundle set or a bundle to a list of logical paths of the archive file names.
        • ARCHIVE_DIRS: map a logical path of an archive file name to a list of logical paths of the included directories.
      • Example command:
        • python holdings_maintenance/pds4/pds4archives.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
        • python holdings_maintenance/pds4/pds4archives.py --init /Volumes/rms-holdings/pds4-holdings/bundles/cassini_iss/cassini_iss_cruise
      • Pending items:
        • Current archive files or info are upload to Dropbox Pds4FileTest/archive-bundles for review
      • Note:
        • Uranus occs has one archive file uranus_occs_earthbased.tar.gz with bundle set as the file name, and all bundles are included in the archive file.
        • Cassini iss cruise has:
          • bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
          • All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
            • collection_browse_raw.csv/xml exist in every .tar.gz file
          • All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
            • collection_data_raw.csv/xml exist in every .tar.gz file
        • Cassini iss satrun has:
          • bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
          • All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
            • collection_browse_raw.csv/xml exist in every .tar.gz file
          • All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
            • collection_data_raw.csv/xml exist in every .tar.gz file
        • Cassini vims cruise has one archive file cassini_vims_cruise.tar.gz, and all files are included in the archive file.
        • Cassini vims saturn has:
          • bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
          • All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
            • collection_browse_raw.csv/xml exist in every .tar.gz file
          • All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
            • collection_data_raw.csv/xml exist in every .tar.gz file
        • Cassini uvis solarocc beckerjarmak2023 has one archive file cassini_uvis_solarocc_beckerjarmak2023.tar.gz, and all files are included in the archive file.

Pending items:

  • Check rms-pdstable (pds3, pdstable) repo, create a pds4 version of it to read the pds4 table.
  • Work on rules for pds4 archive (first draft of archive file & rules are ready for review)
    • .tar.gz files are uploaded to Dropbox Pds4FileTest/archive-bundles for review
    • Rules related to archive files are under pds4file/rules/
  • Add a rule to map a file path to its corresponding .tar.gz file for viewmaster & validation
  • Once metadata labels are added, update pds4indexshelf.py to create _indexshelf-metadata for pds4
  • Once metadata labels are added, update pds4linkshelf.py to create _linkshelf-metadata for pds4

Note:

  • These directories are generated using full holdings and uploaded to Dropbox
  • We don't bypass any directories now, ring_models and _support are included when running the scripts.
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased
  • In pds4file, we can instantiate with use_shelves_only set to True now.
(venv) yu-jenchang create_shelves_files_for_pds4 rms-pdsfile $ ipython
Python 3.9.6 (default, Feb  3 2024, 15:58:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pdsfile
   ...: pdsfile.pds4file.Pds4File.use_shelves_only(True)
   ...: pdsfile.pds4file.Pds4File.preload('/Users/yu-jenchang/Dropbox (SETI Institute)/Pds4FileTest/pds4
   ...: -holdings')
In [2]: b = pdsfile.pds4file.Pds4File.from_abspath('/Users/yu-jenchang/Dropbox (SETI Institute)/Pds4File
   ...: Test/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u2_teide_155cm/data/rings/u2_teide_
   ...: 155cm_880nm_radius_delta_ingress_100m.tab')
In [3]: import os
In [4]: os.path.exists(b.abspath)
Out[4]: False
  • opus_products output is in alphabetical order now (work with 1384 sort filenames on details tab rms-opus#1396)
    • for the same opus type (header), combining different lists of the same version to one sublist
    • sorting each sublist by filepath (alphabetical order)
    • sorting the list of sublists by version (in the order of decreasing version)

@juzen2003 juzen2003 marked this pull request as draft September 23, 2024 22:53
@rfrenchseti
Copy link
Collaborator

The --volume argument really is supposed to be just volume so that you put the positional argument on the command line without specifying any flag in front of it.

when trying to parse each entry to get the basename of a file in the
archive.
@juzen2003
Copy link
Collaborator Author

Update the latest status, the top comments are also updated (10/22/24)

  • Update maintenance tools under holdings_maintenance/pds4 to generate checksums, infoshelf, and linkshelf for PDS4 bundles
  • These Newly generated shelf files are uploaded to Dropbox
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased

- volset_abspath to bundleset_abspath
- volset_pdsfile to bundleset_pdsfile
- volume_abspath to bundle_abspath
- volume_pdsfile to bundle_pdsfile
- voltype_ to bundletype_
- volume_publication_date to bundle_publication_date
- volume_version_id to bundle_version_id
names and directories included in one archive by doing these:
- Add rules ARCHIVE_PATHS & ARCHIVE_DIRS in pds4 to specify the mapping
  for a bundle set path to its corresponding archive files, and the
  mapping for a archive file to its included directories. (line 602-614
  in rules/__init__.py)
- Add bundle set specific archive_paths & archive_dirs rules for
  uranus occs (line 430-447 in rules/uranus_occs_earthbased.py)
- Add pds4 specific functions: (line 189-218 in pds4file/__init__.py)
    - archive_paths: return the absolute path to the archive file
      associated with this bundleset.
    - archive_dirs: Return a dictionary that is keyed by a archive
      path and the list of directories included in that archive path as
      the value.
- Modify write_archive to get tarpath and its included dirs from
  archive_paths & archive_dirs (line 181-203 in pds4/pds4archives.py)
functions in pds4 to use glob_glob to properly get the abspath for
included archive dirs.
cassini_iss_saturn. (line 292-376 in rules/cassini_iss.py)
@juzen2003
Copy link
Collaborator Author

Status update on 11/04/24 (top comments are updated):

  • ARCHIVE_PATHS and ARCHIVE_DIRS rules are added to determine the archive file names and the included directories for each archive file. (each bundle set has its own rules)
    • ARCHIVE_PATHS: map a bundle set or a bundle to a list of logical paths of the archive file names.
    • ARCHIVE_DIRS: map a logical path of an archive file name to a list of logical paths of the included directories.
  • Archive files for cassini_iss_cruise and uranus_occs_earthbased are upload to Dropbox Pds4FileTest/archive-bundles for review
  • Uranus occs has one archive file uranus_occs_earthbased.tar.gz with bundle set as the file name, and all bundles are included in the archive file.
    • This archive bundle file is given in the full holdings disk, so instead of having multiple archive files for each bundle inside the bundle set, we only have one archive file. The rules are added correspondingly.
    • For reference here is the original size of each directory and the size of each .tar.gz
yu-jenchang  uranus_occs_earthbased $ du -sch */               
1.8M	checksums_uranus_occs_earthbased/
580M	superseded/
 13M	uranus_occ_support/
388M	uranus_occ_u0201_palomar_508cm/
524M	uranus_occ_u0_kao_91cm/
475M	uranus_occ_u102a_irtf_320cm/
463M	uranus_occ_u102b_irtf_320cm/
374M	uranus_occ_u103_eso_220cm/
307M	uranus_occ_u103_palomar_508cm/
191M	uranus_occ_u1052_irtf_320cm/
132M	uranus_occ_u11_ctio_400cm/
429M	uranus_occ_u12_ctio_400cm/
221M	uranus_occ_u12_eso_360cm/
274M	uranus_occ_u12_lco_250cm/
577M	uranus_occ_u134_saao_188cm/
 97M	uranus_occ_u137_hst_fos/
606M	uranus_occ_u137_irtf_320cm/
102M	uranus_occ_u138_hst_fos/
302M	uranus_occ_u138_palomar_508cm/
206M	uranus_occ_u13_sso_390cm/
 92M	uranus_occ_u144_caha_123cm/
206M	uranus_occ_u144_saao_188cm/
124M	uranus_occ_u149_irtf_320cm/
195M	uranus_occ_u149_lowell_180cm/
509M	uranus_occ_u14_ctio_150cm/
532M	uranus_occ_u14_ctio_400cm/
139M	uranus_occ_u14_eso_104cm/
284M	uranus_occ_u14_lco_100cm/
365M	uranus_occ_u14_lco_250cm/
 43M	uranus_occ_u14_opmt_106cm/
233M	uranus_occ_u14_opmt_200cm/
344M	uranus_occ_u14_teide_155cm/
267M	uranus_occ_u15_mso_190cm/
771M	uranus_occ_u16_palomar_508cm/
263M	uranus_occ_u17b_saao_188cm/
1.2G	uranus_occ_u23_ctio_400cm/
112M	uranus_occ_u23_mcdonald_270cm/
109M	uranus_occ_u23_teide_155cm/
175M	uranus_occ_u25_ctio_400cm/
220M	uranus_occ_u25_mcdonald_270cm/
1.1G	uranus_occ_u25_palomar_508cm/
1.3G	uranus_occ_u28_irtf_320cm/
 31M	uranus_occ_u2_teide_155cm/
1.9G	uranus_occ_u34_irtf_320cm/
1.1G	uranus_occ_u36_ctio_400cm/
5.4G	uranus_occ_u36_irtf_320cm/
1.4G	uranus_occ_u36_maunakea_380cm/
 70M	uranus_occ_u36_sso_230cm/
307M	uranus_occ_u36_sso_390cm/
187M	uranus_occ_u5_lco_250cm/
1.3G	uranus_occ_u65_irtf_320cm/
2.8G	uranus_occ_u83_irtf_320cm/
2.6G	uranus_occ_u84_irtf_320cm/
487M	uranus_occ_u9539_ctio_400cm/
134M	uranus_occ_u9_lco_250cm/
 32G	total
yu-jenchang  uranus_occs_earthbased $ ls -hl | cut -d' ' -f6-  

  6.0G Nov  4 16:12 uranus_occs_earthbased.tar.gz
  • Cassini iss cruise has:
    • bundle.xml and non data_raw and browse_raw files in one archive (bundle_xml_non_data_browse_collections.tar.gz)
    • All files in browse raw collection in one archive (browse_raw.tar.gz)
    • All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
    • For reference here is the original size of each directory and the size of each .tar.gz
yu-jenchang  cassini_iss_cruise $ du -sch */                       
7.8G	browse_raw/
 64K	context/
 37G	data_raw/
7.2M	document/
 64K	xml_schema/
 45G	total
yu-jenchang  cassini_iss_cruise $ ls -hl | cut -d' ' -f6-

  6.1G Nov  2 13:32 browse_raw.tar.gz
  6.2M Nov  2 13:26 bundle_xml_non_data_browse_collections.tar.gz
   33M Nov  2 13:36 data_raw_129xxxxxxx.tar.gz
  4.8M Nov  2 13:37 data_raw_130xxxxxxx.tar.gz
   14M Nov  2 13:38 data_raw_131xxxxxxx.tar.gz
   49M Nov  2 13:45 data_raw_132xxxxxxx.tar.gz
  141M Nov  2 13:59 data_raw_133xxxxxxx.tar.gz
  167M Nov  2 14:19 data_raw_134xxxxxxx.tar.gz
  3.7G Nov  2 18:13 data_raw_135xxxxxxx.tar.gz
  747M Nov  2 19:32 data_raw_136xxxxxxx.tar.gz
  147M Nov  2 19:52 data_raw_137xxxxxxx.tar.gz
   35M Nov  2 19:55 data_raw_138xxxxxxx.tar.gz
   32M Nov  2 19:57 data_raw_139xxxxxxx.tar.gz
  103M Nov  2 20:02 data_raw_140xxxxxxx.tar.gz
  189M Nov  2 20:26 data_raw_141xxxxxxx.tar.gz
   42M Nov  2 20:32 data_raw_142xxxxxxx.tar.gz
  161M Nov  2 20:50 data_raw_143xxxxxxx.tar.gz
  480M Nov  2 21:48 data_raw_144xxxxxxx.tar.gz
   13M Nov  2 21:50 data_raw_145xxxxxxx.tar.gz
   20M Nov  2 21:50 data_raw_col_xml_csv_metadata.tar.gz
yu-jenchang  cassini_iss_cruise $

@juzen2003
Copy link
Collaborator Author

Updates on 11/12/24:

  • First draft of archive files info for all current bundle sets are uploaded to Dropbox (SETI Institute)/Pds4FileTest/archives-bundles for review. (Note: The size of total archive files are too big, so only the archive files of cassini iss cruise and uranus occs are uploaded. Info related to the naming and size of the archive files, and also the original bundle size are recorded in *_info files for all bundle set.)
yu-jenchang  archives-bundles $ ls -l |grep ._info
-rw-r--r--@ 1 yu-jenchang  staff  2025 Nov  8 11:53 cassini_iss_cruise_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  5984 Nov  8 11:54 cassini_iss_satrun_archive_info
-rw-r--r--@ 1 yu-jenchang  staff   994 Nov 12 09:48 cassini_uvis_solarocc_beckerjarmak2023_archive_info
-rw-r--r--@ 1 yu-jenchang  staff   846 Nov 12 09:42 cassini_vims_cruise_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  6013 Nov 12 09:46 cassini_vims_saturn_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  2546 Nov  8 11:55 uranus_occs_earthbased_archive_info
yu-jenchang  archives-bundles $ 
  • The latest archive_paths and archive_dirs rules are updated in the pull.

@matthewtiscareno
Copy link
Collaborator

Here is what I'm seeing:

  • cassini_uvis_solarocc_beckerjarmak2023 has one archive file. This is fine.
  • uranus_occs_earthbased has one archive file. This is fine.
  • The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)
  • For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?
  • For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?
  • For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.
  • For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.
  • The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles
  • Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

@rfrenchseti
Copy link
Collaborator

  • The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)

This is just a text file Dave created by hand for our review. It's not part of the tool chain.

  • For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

  • For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?

I'm assuming you mean 6 GB, not 6 TB? If the entire collection compresses to only 6 GB, then I agree one file is a good choice.

  • For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.

I'm assuming you mean 4 GB, not 4 TB? If the entire collection compresses to only 4 GB, then I agree one file is a good choice.

  • For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.
  • The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

  • Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

Seems like a good idea.

@juzen2003
Copy link
Collaborator Author

  • The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)

This is just a text file Dave created by hand for our review. It's not part of the tool chain.

*_info files are manually created with .tar.gz files info of different bundles/bundle set for review, I'll fixed the file name typo.

  • For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

I'll break the browse_raw the same way as the data_raw, and put collection_browse_raw.csv/xml to inside all the browse_raw .tar.gz files as discussed in the meeting.

  • For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?

I'm assuming you mean 6 GB, not 6 TB? If the entire collection compresses to only 6 GB, then I agree one file is a good choice.

The total sizes of archive files for cassini_iss_cruise is around 12-13GB, do we just want to have one archive file for the cassini_iss_cruise? It will be around 12-13GB.

  • For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.

I'm assuming you mean 4 GB, not 4 TB? If the entire collection compresses to only 4 GB, then I agree one file is a good choice.

The entire cassini_vims_cruise only has one archive file, cassini_vims_cruise.tar.gz, which is 1.9GB

  • For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.
  • The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I'll update *_info files later

  • Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

Seems like a good idea.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

@matthewtiscareno
Copy link
Collaborator

  • For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

  • The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I understand that. I just want to have some place where I can be informed about what rules are being set up. If it's not these files, then don't bother changing these files.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

We will be removing metadata directories from these places in the PDS4 holdings, so you should not spend time setting up a special rule to ignore them, right?

@juzen2003
Copy link
Collaborator Author

  • For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

They are encoded in these two variables under the Archives area of the rules files:
archive_paths: Map a bundle set or a bundle to a list of logical paths of the archive file names.
archive_dirs: Map a logical path of an archive file name to a list of logical paths of the included directories.
They are in these files: (you can click the Files changed tab to see these changes in the pull request as well)

pdsfile/pds4file/rules/cassini_iss.py
pdsfile/pds4file/rules/cassini_vims.py
pdsfile/pds4file/rules/uranus_occs_earthbased.py
pdsfile/pds4file/rules/cassini_uvis_solarocc_beckerjarmak2023.py

For reference, we can also call .archive_paths and .archive_dirs on a pds4file instance to get the corresponding archive info, here is the log from ipython:

In [1]: import pdsfile
   ...: pdsfile.pds4file.Pds4File.use_shelves_only(False)
   ...: pdsfile.pds4file.Pds4File.preload('/Volumes/rms-holdings/pds4-holdings')
   ...: pdsfile.pds4file.Pds4File.preload('/Users/yu-jenchang/Dropbox (SETI Institute)/Shared-OPUS/pdsda
   ...: ta/pds4-holdings')

In [2]: b = pdsfile.pds4file.Pds4File.from_abspath('/Volumes/rms-holdings/pds4-holdings/bundles/uranus_o
   ...: ccs_earthbased/')

In [3]: b.archive_paths()
Out[3]: ['/Volumes/rms-holdings/pds4-holdings/archives-bundles/uranus_occs_earthbased/uranus_occs_earthbased.tar.gz']

In [4]: b.archive_dirs()
Out[4]: {'/Volumes/rms-holdings/pds4-holdings/archives-bundles/uranus_occs_earthbased/uranus_occs_earthbased.tar.gz': ['/Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased']}

In [5]: 
  • The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I understand that. I just want to have some place where I can be informed about what rules are being set up. If it's not these files, then don't bother changing these files.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

We will be removing metadata directories from these places in the PDS4 holdings, so you should not spend time setting up a special rule to ignore them, right?

- Split browse_raw into multiple archive files based on sclk for
  cassini_iss_cruise
- Include collection_*_raw.csv/xml in every browse_raw & data_raw
  .tar.gz files based on sclk
- Remove data_raw_col_xml_csv_metadata.tar.gz since metadata directory
  will be removed and all collection_*_raw.csv/xml are included in
  every brwose_raw & data_raw archive files based on sclk
@juzen2003
Copy link
Collaborator Author

Updates:

  • Update *_info files on dropbox
  • Update the archive rules by:
    • Split browse_raw into multiple archive files based on sclk for cassini_iss_cruise.
    • Include collection_*_raw.csv/xml in every browse_raw & data_raw .tar.gz files based on sclk.
    • Remove data_raw_col_xml_csv_metadata.tar.gz since the metadata directory will be removed and collection_*_raw.csv/xml are included in every browse_raw & data_raw archive files based on sclk.

opus_products functino (line 4792-4795, pdsfile.py)
- Sort the list of sublists by version and filepath (in the order of
  decreasing version, or reversed alphabetical order if version is the
  same)
- Sort the sublist by filepath (alphabetical order)
0 when there is 'REDO', 'TIRETRACK', or 'REPAIRED' substring in one of
the path in a sublist so that prioritizer can be properly sorted.
@matthewtiscareno
Copy link
Collaborator

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

They are encoded in these two variables under the Archives area of the rules files:

Okay, thanks for that. I see that everything seems to be hardcoded in an ad hoc manner. Maybe that's what we want to do. A more systematic approach could conceivably be attractive but might be more trouble than it's worth.

- for the same opus type (header), combining different lists of the
  same version to one sublist
- sorting each sublist by filepath (alphabetical order)
- sorting the list of sublists by version (in the order of decreasing
  version)
@rfrenchseti rfrenchseti marked this pull request as ready for review December 12, 2024 22:37
@rfrenchseti rfrenchseti self-requested a review December 12, 2024 22:37
@rfrenchseti rfrenchseti marked this pull request as draft December 12, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants