Create shelves files for pds4 #57

juzen2003 · 2024-09-23T22:53:23Z

Current status of creating shevles files for pds4:

Create files in checksums-* directory (pds4checksums.py)
- Modification made:
  - update BUNDLENAME_REGEX
- Example command:
  - python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
  - python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
  - python holdings_maintenance/pds4/pds4checksums.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
Create files in _infoshelf-* directory (pds4infoshelf.py), corresponding checksums files from the above steps are required
- Modification made:
  - properly import pds4checksums
- Example command:
  - python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
  - python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/metadata/uranus_occs_earthbased
  - python holdings_maintenance/pds4/pds4infoshelf.py --init /Volumes/rms-holdings/pds4-holdings/diagrams/uranus_occs_earthbased
Create files in _indexshelf-metadata (pds4indexshelf.py)
- Modification made:
  - Put BUNDLENAME_REGEX to Pds3File & Pds4File classes since they are different for pds3 & pds4
  - Add IDX_EXT and LBL_EXT to Pds3File & Pds4File to replace '.tab' & '.lbl' in pdsfile.py
    - pds4 label extension is .xml and idx extension is .csv
    - pds3 label extension is .lbl and idx extension is .tab
- Pending items:
  - Wait for label files (.xml) in metadata
Create files in _linkshelf-* directory (pds4linkshelf.py)
- Modification made:
  - remove .TXT in EXTS_WO_LABELS, .TXT could have a label in pds4
  - Add the intelligence to link a file to its correspsonding label if the file is in that label's file_name tags.
  - Add the intelligence to identify files like errata.txt, or checksum files that don't exist in the label nor exist in the csv. They are not part of the archive, so they don't have labels.
- Example command:
  - python holdings_maintenance/pds4/pds4linkshelf.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
- Pending items:
  - To create linkshelf-metadata, wait for label files (.xml) in metadata
Create archive files (pds4archives.py)
- Modification made:
  - Add ARCHIVE_PATHS and ARCHIVE_DIRS rules to determine the archive file names and the included directories for each archive file. (each bundle set has its own rules)
    - ARCHIVE_PATHS: map a bundle set or a bundle to a list of logical paths of the archive file names.
    - ARCHIVE_DIRS: map a logical path of an archive file name to a list of logical paths of the included directories.
  - Example command:
    - python holdings_maintenance/pds4/pds4archives.py --init /Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased
    - python holdings_maintenance/pds4/pds4archives.py --init /Volumes/rms-holdings/pds4-holdings/bundles/cassini_iss/cassini_iss_cruise
  - Pending items:
    - Current archive files or info are upload to Dropbox Pds4FileTest/archive-bundles for review
  - Note:
    - Uranus occs has one archive file uranus_occs_earthbased.tar.gz with bundle set as the file name, and all bundles are included in the archive file.
    - Cassini iss cruise has:
      - bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
      - All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
        
        collection_browse_raw.csv/xml exist in every .tar.gz file
      - All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
        
        collection_data_raw.csv/xml exist in every .tar.gz file
    - Cassini iss satrun has:
      - bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
      - All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
        
        collection_browse_raw.csv/xml exist in every .tar.gz file
      - All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
        
        collection_data_raw.csv/xml exist in every .tar.gz file
    - Cassini vims cruise has one archive file cassini_vims_cruise.tar.gz, and all files are included in the archive file.
    - Cassini vims saturn has:
      - bundle.xml and non data_raw and browse_raw files in one archive(bundle_xml_non_data_browse_collections.tar.gz)
      - All browse directories of different sclk in browse_raw collection have their own archive files with the same file name as the sclk directories (browse_raw_.*.tar.gz)
        
        collection_browse_raw.csv/xml exist in every .tar.gz file
      - All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
        
        collection_data_raw.csv/xml exist in every .tar.gz file
    - Cassini uvis solarocc beckerjarmak2023 has one archive file cassini_uvis_solarocc_beckerjarmak2023.tar.gz, and all files are included in the archive file.

Pending items:

Check rms-pdstable (pds3, pdstable) repo, create a pds4 version of it to read the pds4 table.
Work on rules for pds4 archive (first draft of archive file & rules are ready for review)
- .tar.gz files are uploaded to Dropbox Pds4FileTest/archive-bundles for review
- Rules related to archive files are under pds4file/rules/
Add a rule to map a file path to its corresponding .tar.gz file for viewmaster & validation
Once metadata labels are added, update pds4indexshelf.py to create _indexshelf-metadata for pds4
Once metadata labels are added, update pds4linkshelf.py to create _linkshelf-metadata for pds4

Note:

These directories are generated using full holdings and uploaded to Dropbox
We don't bypass any directories now, ring_models and _support are included when running the scripts.

Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased

In pds4file, we can instantiate with use_shelves_only set to True now.

(venv) yu-jenchang create_shelves_files_for_pds4 rms-pdsfile $ ipython
Python 3.9.6 (default, Feb  3 2024, 15:58:27)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.18.1 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pdsfile
   ...: pdsfile.pds4file.Pds4File.use_shelves_only(True)
   ...: pdsfile.pds4file.Pds4File.preload('/Users/yu-jenchang/Dropbox (SETI Institute)/Pds4FileTest/pds4
   ...: -holdings')
In [2]: b = pdsfile.pds4file.Pds4File.from_abspath('/Users/yu-jenchang/Dropbox (SETI Institute)/Pds4File
   ...: Test/pds4-holdings/bundles/uranus_occs_earthbased/uranus_occ_u2_teide_155cm/data/rings/u2_teide_
   ...: 155cm_880nm_radius_delta_ingress_100m.tab')
In [3]: import os
In [4]: os.path.exists(b.abspath)
Out[4]: False

opus_products output is in alphabetical order now (work with 1384 sort filenames on details tab rms-opus#1396)
- for the same opus type (header), combining different lists of the same version to one sublist
- sorting each sublist by filepath (alphabetical order)
- sorting the list of sublists by version (in the order of decreasing version)

absolute path of a filespec.

…for_pds4

checksums-bundles directory for pds4

…for_pds4

_infoshelf-bundles

…for_pds4

_linkshelf-* directory

rfrenchseti · 2024-09-23T22:57:25Z

The --volume argument really is supposed to be just volume so that you put the positional argument on the command line without specifying any flag in front of it.

when trying to parse each entry to get the basename of a file in the archive.

juzen2003 · 2024-10-21T19:39:45Z

Update the latest status, the top comments are also updated (10/22/24)

Update maintenance tools under holdings_maintenance/pds4 to generate checksums, infoshelf, and linkshelf for PDS4 bundles
These Newly generated shelf files are uploaded to Dropbox

Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/checksums-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/checksums-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-bundles/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-diagrams/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_infoshelf-metadata/uranus_occs_earthbased
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_iss
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_uvis_solarocc_beckerjarmak2023
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/cassini_vims
Shared-OPUS/pdsdata/pds4-holdings/_linkshelf-bundles/uranus_occs_earthbased

pds4checksums.py

- volset_abspath to bundleset_abspath - volset_pdsfile to bundleset_pdsfile - volume_abspath to bundle_abspath - volume_pdsfile to bundle_pdsfile - voltype_ to bundletype_ - volume_publication_date to bundle_publication_date - volume_version_id to bundle_version_id

names and directories included in one archive by doing these: - Add rules ARCHIVE_PATHS & ARCHIVE_DIRS in pds4 to specify the mapping for a bundle set path to its corresponding archive files, and the mapping for a archive file to its included directories. (line 602-614 in rules/__init__.py) - Add bundle set specific archive_paths & archive_dirs rules for uranus occs (line 430-447 in rules/uranus_occs_earthbased.py) - Add pds4 specific functions: (line 189-218 in pds4file/__init__.py) - archive_paths: return the absolute path to the archive file associated with this bundleset. - archive_dirs: Return a dictionary that is keyed by a archive path and the list of directories included in that archive path as the value. - Modify write_archive to get tarpath and its included dirs from archive_paths & archive_dirs (line 181-203 in pds4/pds4archives.py)

functions in pds4 to use glob_glob to properly get the abspath for included archive dirs.

cassini_iss_saturn. (line 292-376 in rules/cassini_iss.py)

…for_pds4

juzen2003 · 2024-11-04T18:19:50Z

Status update on 11/04/24 (top comments are updated):

ARCHIVE_PATHS and ARCHIVE_DIRS rules are added to determine the archive file names and the included directories for each archive file. (each bundle set has its own rules)
- ARCHIVE_PATHS: map a bundle set or a bundle to a list of logical paths of the archive file names.
- ARCHIVE_DIRS: map a logical path of an archive file name to a list of logical paths of the included directories.
Archive files for cassini_iss_cruise and uranus_occs_earthbased are upload to Dropbox Pds4FileTest/archive-bundles for review
Uranus occs has one archive file uranus_occs_earthbased.tar.gz with bundle set as the file name, and all bundles are included in the archive file.
- This archive bundle file is given in the full holdings disk, so instead of having multiple archive files for each bundle inside the bundle set, we only have one archive file. The rules are added correspondingly.
- For reference here is the original size of each directory and the size of each .tar.gz

yu-jenchang  uranus_occs_earthbased $ du -sch */               
1.8M	checksums_uranus_occs_earthbased/
580M	superseded/
 13M	uranus_occ_support/
388M	uranus_occ_u0201_palomar_508cm/
524M	uranus_occ_u0_kao_91cm/
475M	uranus_occ_u102a_irtf_320cm/
463M	uranus_occ_u102b_irtf_320cm/
374M	uranus_occ_u103_eso_220cm/
307M	uranus_occ_u103_palomar_508cm/
191M	uranus_occ_u1052_irtf_320cm/
132M	uranus_occ_u11_ctio_400cm/
429M	uranus_occ_u12_ctio_400cm/
221M	uranus_occ_u12_eso_360cm/
274M	uranus_occ_u12_lco_250cm/
577M	uranus_occ_u134_saao_188cm/
 97M	uranus_occ_u137_hst_fos/
606M	uranus_occ_u137_irtf_320cm/
102M	uranus_occ_u138_hst_fos/
302M	uranus_occ_u138_palomar_508cm/
206M	uranus_occ_u13_sso_390cm/
 92M	uranus_occ_u144_caha_123cm/
206M	uranus_occ_u144_saao_188cm/
124M	uranus_occ_u149_irtf_320cm/
195M	uranus_occ_u149_lowell_180cm/
509M	uranus_occ_u14_ctio_150cm/
532M	uranus_occ_u14_ctio_400cm/
139M	uranus_occ_u14_eso_104cm/
284M	uranus_occ_u14_lco_100cm/
365M	uranus_occ_u14_lco_250cm/
 43M	uranus_occ_u14_opmt_106cm/
233M	uranus_occ_u14_opmt_200cm/
344M	uranus_occ_u14_teide_155cm/
267M	uranus_occ_u15_mso_190cm/
771M	uranus_occ_u16_palomar_508cm/
263M	uranus_occ_u17b_saao_188cm/
1.2G	uranus_occ_u23_ctio_400cm/
112M	uranus_occ_u23_mcdonald_270cm/
109M	uranus_occ_u23_teide_155cm/
175M	uranus_occ_u25_ctio_400cm/
220M	uranus_occ_u25_mcdonald_270cm/
1.1G	uranus_occ_u25_palomar_508cm/
1.3G	uranus_occ_u28_irtf_320cm/
 31M	uranus_occ_u2_teide_155cm/
1.9G	uranus_occ_u34_irtf_320cm/
1.1G	uranus_occ_u36_ctio_400cm/
5.4G	uranus_occ_u36_irtf_320cm/
1.4G	uranus_occ_u36_maunakea_380cm/
 70M	uranus_occ_u36_sso_230cm/
307M	uranus_occ_u36_sso_390cm/
187M	uranus_occ_u5_lco_250cm/
1.3G	uranus_occ_u65_irtf_320cm/
2.8G	uranus_occ_u83_irtf_320cm/
2.6G	uranus_occ_u84_irtf_320cm/
487M	uranus_occ_u9539_ctio_400cm/
134M	uranus_occ_u9_lco_250cm/
 32G	total

yu-jenchang  uranus_occs_earthbased $ ls -hl | cut -d' ' -f6-  

  6.0G Nov  4 16:12 uranus_occs_earthbased.tar.gz

Cassini iss cruise has:
- bundle.xml and non data_raw and browse_raw files in one archive (bundle_xml_non_data_browse_collections.tar.gz)
- All files in browse raw collection in one archive (browse_raw.tar.gz)
- All data directories of different sclk in data_raw collection have their own archive files with the same file name as the sclk directories (data_raw_.*.tar.gz)
- For reference here is the original size of each directory and the size of each .tar.gz

yu-jenchang  cassini_iss_cruise $ du -sch */                       
7.8G	browse_raw/
 64K	context/
 37G	data_raw/
7.2M	document/
 64K	xml_schema/
 45G	total

yu-jenchang  cassini_iss_cruise $ ls -hl | cut -d' ' -f6-

  6.1G Nov  2 13:32 browse_raw.tar.gz
  6.2M Nov  2 13:26 bundle_xml_non_data_browse_collections.tar.gz
   33M Nov  2 13:36 data_raw_129xxxxxxx.tar.gz
  4.8M Nov  2 13:37 data_raw_130xxxxxxx.tar.gz
   14M Nov  2 13:38 data_raw_131xxxxxxx.tar.gz
   49M Nov  2 13:45 data_raw_132xxxxxxx.tar.gz
  141M Nov  2 13:59 data_raw_133xxxxxxx.tar.gz
  167M Nov  2 14:19 data_raw_134xxxxxxx.tar.gz
  3.7G Nov  2 18:13 data_raw_135xxxxxxx.tar.gz
  747M Nov  2 19:32 data_raw_136xxxxxxx.tar.gz
  147M Nov  2 19:52 data_raw_137xxxxxxx.tar.gz
   35M Nov  2 19:55 data_raw_138xxxxxxx.tar.gz
   32M Nov  2 19:57 data_raw_139xxxxxxx.tar.gz
  103M Nov  2 20:02 data_raw_140xxxxxxx.tar.gz
  189M Nov  2 20:26 data_raw_141xxxxxxx.tar.gz
   42M Nov  2 20:32 data_raw_142xxxxxxx.tar.gz
  161M Nov  2 20:50 data_raw_143xxxxxxx.tar.gz
  480M Nov  2 21:48 data_raw_144xxxxxxx.tar.gz
   13M Nov  2 21:50 data_raw_145xxxxxxx.tar.gz
   20M Nov  2 21:50 data_raw_col_xml_csv_metadata.tar.gz
yu-jenchang  cassini_iss_cruise $

cassini_iss_saturn

and archive_dirs rules for cassini_uvis_solarocc_beckerjarmak2023

juzen2003 · 2024-11-12T02:19:50Z

Updates on 11/12/24:

First draft of archive files info for all current bundle sets are uploaded to Dropbox (SETI Institute)/Pds4FileTest/archives-bundles for review. (Note: The size of total archive files are too big, so only the archive files of cassini iss cruise and uranus occs are uploaded. Info related to the naming and size of the archive files, and also the original bundle size are recorded in *_info files for all bundle set.)

yu-jenchang  archives-bundles $ ls -l |grep ._info
-rw-r--r--@ 1 yu-jenchang  staff  2025 Nov  8 11:53 cassini_iss_cruise_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  5984 Nov  8 11:54 cassini_iss_satrun_archive_info
-rw-r--r--@ 1 yu-jenchang  staff   994 Nov 12 09:48 cassini_uvis_solarocc_beckerjarmak2023_archive_info
-rw-r--r--@ 1 yu-jenchang  staff   846 Nov 12 09:42 cassini_vims_cruise_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  6013 Nov 12 09:46 cassini_vims_saturn_archive_info
-rw-r--r--@ 1 yu-jenchang  staff  2546 Nov  8 11:55 uranus_occs_earthbased_archive_info
yu-jenchang  archives-bundles $

The latest archive_paths and archive_dirs rules are updated in the pull.

matthewtiscareno · 2024-11-13T06:39:33Z

Here is what I'm seeing:

cassini_uvis_solarocc_beckerjarmak2023 has one archive file. This is fine.
uranus_occs_earthbased has one archive file. This is fine.
The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)
For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?
For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?
For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.
For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.
The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles
Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

rfrenchseti · 2024-11-13T18:42:07Z

The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)

This is just a text file Dave created by hand for our review. It's not part of the tool chain.

For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?

I'm assuming you mean 6 GB, not 6 TB? If the entire collection compresses to only 6 GB, then I agree one file is a good choice.

For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.

I'm assuming you mean 4 GB, not 4 TB? If the entire collection compresses to only 4 GB, then I agree one file is a good choice.

For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.

The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

Seems like a good idea.

juzen2003 · 2024-11-14T00:22:35Z

The file cassini_iss_satrun_archive_info contains a typo in the filename (saturn, not satrun)

This is just a text file Dave created by hand for our review. It's not part of the tool chain.

*_info files are manually created with .tar.gz files info of different bundles/bundle set for review, I'll fixed the file name typo.

For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

I'll break the browse_raw the same way as the data_raw, and put collection_browse_raw.csv/xml to inside all the browse_raw .tar.gz files as discussed in the meeting.

For cassini_iss_cruise, the raw data seems to have a very high compression ratio. Even though the data adds up to 37G, all of the archive files together add up to only 6 TB, about the same as browse_raw.tar.gz. Perhaps we shouldn't break up the data_raw archives for cassini_iss_cruise?

I'm assuming you mean 6 GB, not 6 TB? If the entire collection compresses to only 6 GB, then I agree one file is a good choice.

The total sizes of archive files for cassini_iss_cruise is around 12-13GB, do we just want to have one archive file for the cassini_iss_cruise? It will be around 12-13GB.

For cassini_vims_cruise, the entire bundle is in one archive, only 4 TB in size. This is fine.

I'm assuming you mean 4 GB, not 4 TB? If the entire collection compresses to only 4 GB, then I agree one file is a good choice.

The entire cassini_vims_cruise only has one archive file, cassini_vims_cruise.tar.gz, which is 1.9GB

For cassini_iss_saturn and cassini_vims_saturn, both browse_raw and data_raw are broken down at second-level directories. This is also fine.

The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I'll update *_info files later

Will _metadata be removed from the filename of data_raw_col_xml_csv_metadata.tar.gz, now that we understand that data_raw/metadata directories are to be eliminated?

Seems like a good idea.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

matthewtiscareno · 2024-11-14T14:28:21Z

For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I understand that. I just want to have some place where I can be informed about what rules are being set up. If it's not these files, then don't bother changing these files.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

We will be removing metadata directories from these places in the PDS4 holdings, so you should not spend time setting up a special rule to ignore them, right?

juzen2003 · 2024-11-14T18:53:07Z

For cassini_iss_cruise, why break down data_raw at second-level directories but not browse_raw? Is this specified manually, or is there a rule? Either way, what is the criterion used?

We discussed this in the group meeting and agreed to break the browse products in the same way as the data products.

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

They are encoded in these two variables under the Archives area of the rules files:
archive_paths: Map a bundle set or a bundle to a list of logical paths of the archive file names.
archive_dirs: Map a logical path of an archive file name to a list of logical paths of the included directories.
They are in these files: (you can click the Files changed tab to see these changes in the pull request as well)

pdsfile/pds4file/rules/cassini_iss.py
pdsfile/pds4file/rules/cassini_vims.py
pdsfile/pds4file/rules/uranus_occs_earthbased.py
pdsfile/pds4file/rules/cassini_uvis_solarocc_beckerjarmak2023.py

For reference, we can also call .archive_paths and .archive_dirs on a pds4file instance to get the corresponding archive info, here is the log from ipython:

In [1]: import pdsfile
   ...: pdsfile.pds4file.Pds4File.use_shelves_only(False)
   ...: pdsfile.pds4file.Pds4File.preload('/Volumes/rms-holdings/pds4-holdings')
   ...: pdsfile.pds4file.Pds4File.preload('/Users/yu-jenchang/Dropbox (SETI Institute)/Shared-OPUS/pdsda
   ...: ta/pds4-holdings')

In [2]: b = pdsfile.pds4file.Pds4File.from_abspath('/Volumes/rms-holdings/pds4-holdings/bundles/uranus_o
   ...: ccs_earthbased/')

In [3]: b.archive_paths()
Out[3]: ['/Volumes/rms-holdings/pds4-holdings/archives-bundles/uranus_occs_earthbased/uranus_occs_earthbased.tar.gz']

In [4]: b.archive_dirs()
Out[4]: {'/Volumes/rms-holdings/pds4-holdings/archives-bundles/uranus_occs_earthbased/uranus_occs_earthbased.tar.gz': ['/Volumes/rms-holdings/pds4-holdings/bundles/uranus_occs_earthbased']}

In [5]:

The file data_raw_col_xml_csv_metadata.tar.gz is not described at the top of the text file for any of the bundles

The _info files are just for our internal review. They are not for public consumption. But perhaps Dave can add the requested details to prevent confusion.

I understand that. I just want to have some place where I can be informed about what rules are being set up. If it's not these files, then don't bother changing these files.

Got it, I'll remove metadata directories, and update the rule by removing metadata from the .tar.gz file name

We will be removing metadata directories from these places in the PDS4 holdings, so you should not spend time setting up a special rule to ignore them, right?

- Split browse_raw into multiple archive files based on sclk for cassini_iss_cruise - Include collection_*_raw.csv/xml in every browse_raw & data_raw .tar.gz files based on sclk - Remove data_raw_col_xml_csv_metadata.tar.gz since metadata directory will be removed and all collection_*_raw.csv/xml are included in every brwose_raw & data_raw archive files based on sclk

juzen2003 · 2024-11-18T18:43:59Z

Updates:

Update *_info files on dropbox
Update the archive rules by:
- Split browse_raw into multiple archive files based on sclk for cassini_iss_cruise.
- Include collection_*_raw.csv/xml in every browse_raw & data_raw .tar.gz files based on sclk.
- Remove data_raw_col_xml_csv_metadata.tar.gz since the metadata directory will be removed and collection_*_raw.csv/xml are included in every browse_raw & data_raw archive files based on sclk.

opus_products functino (line 4792-4795, pdsfile.py)

- Sort the list of sublists by version and filepath (in the order of decreasing version, or reversed alphabetical order if version is the same) - Sort the sublist by filepath (alphabetical order)

the abspath of element in the sublist.

0 when there is 'REDO', 'TIRETRACK', or 'REPAIRED' substring in one of the path in a sublist so that prioritizer can be properly sorted.

matthewtiscareno · 2024-11-22T02:36:35Z

That's fine, but part of what I was trying to figure out at the group meeting is how and where these decisions are encoded. Can you point me to the file(s)?

They are encoded in these two variables under the Archives area of the rules files:

Okay, thanks for that. I see that everything seems to be hardcoded in an ad hoc manner. Maybe that's what we want to do. A more systematic approach could conceivably be attractive but might be more trouble than it's worth.

- for the same opus type (header), combining different lists of the same version to one sublist - sorting each sublist by filepath (alphabetical order) - sorting the list of sublists by version (in the order of decreasing version)

juzen2003 added 28 commits December 6, 2023 13:25

Add validation directory (moved from rms-webtools repo)

2e94c7b

Merge remote-tracking branch 'origin/main'

4be35d4

Merge remote-tracking branch 'origin/main'

5e5fe97

Merge remote-tracking branch 'origin/main'

fd70eb6

Merge remote-tracking branch 'origin/main'

9f9c733

Merge remote-tracking branch 'origin/main'

077c359

Merge remote-tracking branch 'origin/main'

291555e

Merge remote-tracking branch 'origin/main'

d195a79

Merge remote-tracking branch 'origin/main'

1f762cb

Merge remote-tracking branch 'origin/main'

92bd65c

Merge remote-tracking branch 'origin/main'

205b9df

Add command line tool to show opus products output with the given

ab669f9

absolute path of a filespec.

Merge remote-tracking branch 'origin/main'

c7d84e5

Merge remote-tracking branch 'origin/main'

a075bbc

Merge remote-tracking branch 'origin/main'

428d727

Merge remote-tracking branch 'origin/main' into create_shelves_files_…

fd4c7b6

…for_pds4

Merge remote-tracking branch 'origin/main'

756547c

Merge remote-tracking branch 'origin/main'

d2d3803

Merge branch 'main' into create_shelves_files_for_pds4

a875899

Create pds4 directory to store maintenance files for pds4.

4ffbc45

Merge remote-tracking branch 'origin/main' into create_shelves_files_…

4b9b154

…for_pds4

Update pds4/pds4checksums.py so that checksum files can be created under

cd66002

checksums-bundles directory for pds4

Merge remote-tracking branch 'origin/main' into create_shelves_files_…

fe7c687

…for_pds4

Update pds4/pds4infoshelf.py to create info sheleves files under

ddbbb2a

_infoshelf-bundles

Removed the debug print statement in the pdsfile.py

69bfd7a

Merge remote-tracking branch 'origin/main' into create_shelves_files_…

cb08184

…for_pds4

Create pds4/pds4indexshelf.py & pds4/pds4linkshelf.py

bc74047

Update pds4linkshelf.py to make sure we can create files in

5509b5a

_linkshelf-* directory

juzen2003 marked this pull request as draft September 23, 2024 22:53

At line 156-158 of pds4linkshelf.py, skip the empty entry of a csv file

39639c6

when trying to parse each entry to get the basename of a file in the archive.

juzen2003 added 7 commits October 22, 2024 04:54

Replace all volumes/volset/volume with bundles/bundleset/bundle in

425ee74

pds4checksums.py

Rename volume and volset to bundle and bundlest in pds4archives.py

1e389cb

Clean up the code style (commented out code), and update archive_dirs

d95afe5

functions in pds4 to use glob_glob to properly get the abspath for included archive dirs.

Add archive_dirs and archive_paths rules for cassini_iss_cruise and

b961091

cassini_iss_saturn. (line 292-376 in rules/cassini_iss.py)

Merge remote-tracking branch 'origin/main' into create_shelves_files_…

cf318d7

…for_pds4

juzen2003 added 2 commits November 8, 2024 10:57

Update archive_paths and archive_dirs rules for cassini_iss_cruise &

427c3e4

cassini_iss_saturn

Add cassini_uvis_solarocc_beckerjarmak2023.py to include archive_paths

cb7bf18

and archive_dirs rules for cassini_uvis_solarocc_beckerjarmak2023

juzen2003 added 4 commits November 19, 2024 11:56

Sort the paths of each sublist of the opus products output by modifying

0348fed

opus_products functino (line 4792-4795, pdsfile.py)

Sort the return of opus_products (line 4792-4799, pdsfile.py) by:

66b01d5

- Sort the list of sublists by version and filepath (in the order of decreasing version, or reversed alphabetical order if version is the same) - Sort the sublist by filepath (alphabetical order)

Fixed the sort of each sublist for the return of opus_products by using

22e22fb

the abspath of element in the sublist.

Update opus_prioritizer in rules/GO_0xxx.py to make sure prio is set to

5e0bf09

0 when there is 'REDO', 'TIRETRACK', or 'REPAIRED' substring in one of the path in a sublist so that prioritizer can be properly sorted.

Sort the opus_products return by:

9119f81

- for the same opus type (header), combining different lists of the same version to one sublist - sorting each sublist by filepath (alphabetical order) - sorting the list of sublists by version (in the order of decreasing version)

juzen2003 mentioned this pull request Dec 12, 2024

1384 sort filenames on details tab SETI/rms-opus#1396

Merged

rfrenchseti marked this pull request as ready for review December 12, 2024 22:37

rfrenchseti self-requested a review December 12, 2024 22:37

rfrenchseti marked this pull request as draft December 12, 2024 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create shelves files for pds4 #57

Create shelves files for pds4 #57

juzen2003 commented Sep 23, 2024 •

edited

Loading

rfrenchseti commented Sep 23, 2024

juzen2003 commented Oct 21, 2024

juzen2003 commented Nov 4, 2024

juzen2003 commented Nov 12, 2024

matthewtiscareno commented Nov 13, 2024

rfrenchseti commented Nov 13, 2024

juzen2003 commented Nov 14, 2024

matthewtiscareno commented Nov 14, 2024

juzen2003 commented Nov 14, 2024

juzen2003 commented Nov 18, 2024

matthewtiscareno commented Nov 22, 2024

Create shelves files for pds4 #57

Are you sure you want to change the base?

Create shelves files for pds4 #57

Conversation

juzen2003 commented Sep 23, 2024 • edited Loading

rfrenchseti commented Sep 23, 2024

juzen2003 commented Oct 21, 2024

juzen2003 commented Nov 4, 2024

juzen2003 commented Nov 12, 2024

matthewtiscareno commented Nov 13, 2024

rfrenchseti commented Nov 13, 2024

juzen2003 commented Nov 14, 2024

matthewtiscareno commented Nov 14, 2024

juzen2003 commented Nov 14, 2024

juzen2003 commented Nov 18, 2024

matthewtiscareno commented Nov 22, 2024

juzen2003 commented Sep 23, 2024 •

edited

Loading