New records harvested in the registry don't have the expected Node value #186

tloubrieu-jpl · 2024-09-17T15:30:08Z

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

In the registry, we can find documents in the *-registry indexes with property ops:Harvest_Info/ops:node_name = geo or en

🕵️ Expected behavior

I expected to have values PDS_GEO, or PDS_ENG.

📜 To Reproduce

Harvest data as described in the manual https://docs.google.com/document/d/12W1DyaRYh4yYnw4p_-qFLQ0kmn_w14Sr/edit

🖥 Environment Info

No response

📚 Version of Software Used

Using harvest v4.0.1

🩺 Test Data / Additional context

The node value expected are:

PDS_ATM  - Planetary Data System: Atmospheres Node

PDS_ENG  - Planetary Data System: Engineering Node

PDS_GEO  - Planetary Data System: Geosciences Node

PDS_IMG  - Planetary Data System: Imaging Node

PDS_NAIF - Planetary Data System: NAIF Node

PDS_PPI  - Planetary Data System: Planetary Plasma Interactions Node

PDS_RMS  - Planetary Data System: Rings Node

PDS_SBN  - Planetary Data System: Small Bodies Node

```
PSA      - Planetary Science Archive
```

JAXA     - Japan Aerospace Exploration Agency

🦄 Related requirements

🦄 #187

⚙️ Engineering Details

It sounds related to the change in the code which made the node name in the harvest configuration obsolete.

🎉 Integration & Test

No response

The text was updated successfully, but these errors were encountered:

jordanpadams · 2024-09-19T12:53:18Z

Created a clear requirement to tie to this as well:

#187

al-niessner · 2024-09-19T15:26:52Z

@jordanpadams @tloubrieu-jpl

Yes, this is to be expected. As discussed previously, moved the node name to be identified through the registry name. This is done here:

harvest/src/main/java/gov/nasa/pds/harvest/cfg/ConfigManager.java

Lines 60 to 64 in ae58da7

    
           static public String exchangeIndexForNode (String indexName) { 
        
             if (indexNodeMap.containsKey (indexName)) return indexNodeMap.get(indexName); 
        
             if (0 < indexName.indexOf("-registry")) return indexName.substring(0, indexName.indexOf("-registry")); 
        
             return "development"; 
        
           }

It always returns a name. Do you prefer to throw an error when not found in this table?

harvest/src/main/java/gov/nasa/pds/harvest/cfg/ConfigManager.java

Lines 21 to 34 in ae58da7

    
           private static HashMap<String,String> indexNodeMap = new HashMap<String,String>(); 
        
           static { 
        
             indexNodeMap.put("pds-atm-registry", "PDS_ATM"); 
        
             indexNodeMap.put("pds-eng-registry", "PDS_ENG"); 
        
             indexNodeMap.put("pds-geo-registry", "PDS_GEO"); 
        
             indexNodeMap.put("pds-img-registry", "PDS_IMG"); 
        
             indexNodeMap.put("pds-naif-registry", "PDS_NAIF"); 
        
             indexNodeMap.put("pds-ppi-registry", "PDS_PPI"); 
        
             indexNodeMap.put("pds-rms-registry", "PDS_RMS"); 
        
             indexNodeMap.put("pds-sbn-registry", "pds_SBN"); 
        
             indexNodeMap.put("psa-registry", "PSA"); 
        
             indexNodeMap.put("jaxa-registry", "JAXA"); 
        
             indexNodeMap.put("roscosmos", "ROSCOSMOS"); 
        
           }

As you can see from the function, it is mangling your registry name to get the node name since you are not following the convention you gave me. That convention was used to create the connections here:
https://github.com/NASA-PDS/registry-common/tree/main/src/main/resources/connections/cognito

We can do a couple of thing if this is a set of integration test nodes (maybe for some users initial review) but not full deployment nodes (ones that follow the original naming rules):

throw error if not recognized rather than make best guess
add the integration nodes to table above
change the naming rules and update all of the code
use the app://connections/cognito/*.xml instead of your own so that naming conventions are consistent

I would suggest 2 and 4. We can and should add more connections, both cognito and direct, to support more registries. We can then add them to the table above as needed with whatever name is desired.

nutjob4life · 2024-09-19T21:16:15Z

Need answering for outstanding questions

tloubrieu-jpl · 2024-09-19T22:15:10Z

Hi @al-niessner ,

Sorry, after taking a bit more time to think of it, I believe you are right, that does not make sense to keep a hardcoded mapping in the code, because we would need to update the harvest code if we add a new discipline node. Although that might never happen that is an unnecessary hassle.

Besides, I am realizing we don't want to redundantly manage that harvesintg node in harvest since we know who the document has been harvested by from the index in which the document is. In the documents it shows as _index = geo-registry (although that might not be searchable)

Unless we have cases where someone would like to ingest into an index, e.g. geo-registry but be authored as a different node, for example PDS_SBN or a more specific entity, for example PDS_GEO_MARS ... I don't believe we need that, but I am leaving that decision to @jordanpadams.
If we actually need that, we would go for option 4.

If we confirm we don't need that flexibility of having different harvesting nodes in the same index, I would completely remove this updates from the client side, in harvest, and move it servers ide, in sweeper or by an opensearch configuration (I am not finding which though).

@al-niessner , @alexdunnjpl , @jordanpadams , let me know what are your thoughts.

Thanks,

"

jordanpadams · 2024-09-20T08:09:05Z

@tloubrieu-jpl @al-niessner @alexdunnjpl I agree that it is not “clean” to have this mapping hardcoded in the client, and we can go with a server side solution, as long as it will not take a significant amount of time or effort. The creation of new indexes / nodes should not happen often (albeit we will have some international partners being added soon :-)). In the event that it is updated and a new node/index is needed, the user that requesting a new index can be required to install a new version of harvest.

al-niessner · 2024-09-20T15:29:10Z

@tloubrieu-jpl

Because of the solution you are outlined above, it is quite obvious I cannot communicate my ideas to you. I am basically saying the opposite.

The argument about the hard coded registry name <-> node name mapping is empty. No matter what, because naming rules for both registry and node names is neither defined nor consistent, the mapping has to be hard coded somewhere - client or server - in either a table or algorithm or both.

Putting the hard code in the client is long term cheapest as once it is right fewer and fewer corrections are needed. If the ability of forced updates existed (possible), then this would not be a problem - sweeper could be run once. Typos in the index name is a serious problem since the user can use create DB then harvest to fill up pda-registry instead of psa-registry. Hence, users should have only used app://. Doing so also ties the mapping to a fixed set of registry indices. The table and app:// provided in registry-common was self consistent as well preventing this issue altogether.

Go back to the purpose of the node name to drive the design. The purpose of node name seems to be to reduce the search space to a specific index. In other words, a user can say I just want to search the node name == PDS_SBN stuff and not the universe, which, for us, maps to an index. If it has more purpose or a different purpose, then we should work with its intended purpose. For instance, if the intent is who authored it, then we should embed the cognito credentials not some enum that cannot be verified; which is to say mars author can claim to be an sbn author while putting data into psa node.

@jordanpadams

"clean" is not possible. See above in this comment for why hard coding must happen.

You should write all of the tools to look and see if a new version is available before running and then refuse to run until versions match. Getting this out would save you so much money in development time and database sweeping. There may still be a bit of open window to slip it in with this release but that window is barely open and closing.

Anyway, I doubt I did any better communicating my position so just tell me what you want implemented and I will do it.

jordanpadams · 2024-09-21T14:21:32Z

@al-niessner per the new requirement now developed, we need a very consistent node_name value for a particular index. I don't necessarily care what that value is, but PDS_*, etc., was something we found to drive it for now.

You should write all of the tools to look and see if a new version is available before running and then refuse to run until versions match. Getting this out would save you so much money in development time and database sweeping. There may still be a bit of open window to slip it in with this release but that window is barely open and closing.

I 100% agree we need something like this. We are spending way too much time fixing issues with older/bad versions of tools. The problem I foresee here is (1) the tool should key off of the major version, and ignore the minor. (2) we have to use semantic versioning correctly when it comes to changes to metadata.

tloubrieu-jpl · 2024-09-24T21:07:00Z

@tloubrieu-jpl will describe the table to be hardcoded. Anything out of the table should through an error.

create dev name.

tloubrieu-jpl · 2024-09-24T22:43:18Z

Hi @al-niessner ,

The expected mapping is:

  atm-registry  --> PDS_ATM
  en-registry --> PDS_ENG
  geo-registry --> PDS_GEO
  img-registry --> PDS_IMG
  naif-registry --> PDS_NAIF
  ppi-registry --> PDS_PPI
  rms-registry --> PDS_RMS
  sbnpsi-registry --> PDS_SBN
  sbnumd-registry --> PDS_SBN
  psa-registry --> PSA
  jaxa-registry --> JAXA
  dev-registry --> PDS_ENG_DEV

Anything different, should raise an error saying: "Index not supported: either fix it in your configuration by using one of the supported or request an upgrade of harvest to support your new index by submitting a ticket on https://github.com/NASA-PDS/harvest/issues".

Thanks,

tloubrieu-jpl added bug Something isn't working needs:triage labels Sep 17, 2024

tloubrieu-jpl assigned jordanpadams Sep 17, 2024

tloubrieu-jpl added this to EN Portfolio Backlog Sep 17, 2024

github-project-automation bot moved this to ToDo in EN Portfolio Backlog Sep 17, 2024

tloubrieu-jpl assigned al-niessner and unassigned jordanpadams Sep 17, 2024

tloubrieu-jpl added sprint-backlog B15.1 s.critical s.high and removed needs:triage s.critical labels Sep 17, 2024

al-niessner mentioned this issue Sep 19, 2024

As a user, I want to include my organization name in the harvest metadata (ops:Harvest_Info.ops:node_name) #187

Closed

al-niessner mentioned this issue Sep 19, 2024

Properly set the ops:Harvest_Info/ops:node_name values, remove obsolete node_name in configuration file #188

Merged

jordanpadams added the B15.0 label Sep 24, 2024

al-niessner moved this from ToDo to ⚙ Review / QA in EN Portfolio Backlog Sep 25, 2024

al-niessner moved this from ⚙ Review / QA to ToDo in EN Portfolio Backlog Sep 25, 2024

jordanpadams removed the B15.0 label Sep 30, 2024

tloubrieu-jpl closed this as completed in #188 Oct 8, 2024

github-project-automation bot moved this from ToDo to 🏁 Done in EN Portfolio Backlog Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New records harvested in the registry don't have the expected Node value #186

New records harvested in the registry don't have the expected Node value #186

tloubrieu-jpl commented Sep 17, 2024 •

edited by jordanpadams

Loading

jordanpadams commented Sep 19, 2024

al-niessner commented Sep 19, 2024 •

edited

Loading

nutjob4life commented Sep 19, 2024

tloubrieu-jpl commented Sep 19, 2024

jordanpadams commented Sep 20, 2024

al-niessner commented Sep 20, 2024 •

edited

Loading

jordanpadams commented Sep 21, 2024

tloubrieu-jpl commented Sep 24, 2024 •

edited

Loading

tloubrieu-jpl commented Sep 24, 2024

New records harvested in the registry don't have the expected Node value #186

New records harvested in the registry don't have the expected Node value #186

Comments

tloubrieu-jpl commented Sep 17, 2024 • edited by jordanpadams Loading

Checked for duplicates

🐛 Describe the bug

🕵️ Expected behavior

📜 To Reproduce

🖥 Environment Info

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

🎉 Integration & Test

jordanpadams commented Sep 19, 2024

al-niessner commented Sep 19, 2024 • edited Loading

nutjob4life commented Sep 19, 2024

tloubrieu-jpl commented Sep 19, 2024

jordanpadams commented Sep 20, 2024

al-niessner commented Sep 20, 2024 • edited Loading

jordanpadams commented Sep 21, 2024

tloubrieu-jpl commented Sep 24, 2024 • edited Loading

tloubrieu-jpl commented Sep 24, 2024

tloubrieu-jpl commented Sep 17, 2024 •

edited by jordanpadams

Loading

al-niessner commented Sep 19, 2024 •

edited

Loading

al-niessner commented Sep 20, 2024 •

edited

Loading

tloubrieu-jpl commented Sep 24, 2024 •

edited

Loading