Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New records harvested in the registry don't have the expected Node value #186

Closed
tloubrieu-jpl opened this issue Sep 17, 2024 · 9 comments Β· Fixed by #188
Closed

New records harvested in the registry don't have the expected Node value #186

tloubrieu-jpl opened this issue Sep 17, 2024 · 9 comments Β· Fixed by #188
Assignees
Labels

Comments

@tloubrieu-jpl
Copy link
Member

tloubrieu-jpl commented Sep 17, 2024

Checked for duplicates

Yes - I've already checked

πŸ› Describe the bug

In the registry, we can find documents in the *-registry indexes with property ops:Harvest_Info/ops:node_name = geo or en

πŸ•΅οΈ Expected behavior

I expected to have values PDS_GEO, or PDS_ENG.

πŸ“œ To Reproduce

Harvest data as described in the manual https://docs.google.com/document/d/12W1DyaRYh4yYnw4p_-qFLQ0kmn_w14Sr/edit

πŸ–₯ Environment Info

No response

πŸ“š Version of Software Used

Using harvest v4.0.1

🩺 Test Data / Additional context

The node value expected are:

  • PDS_ATM  - Planetary Data System: Atmospheres Node
    
  • PDS_ENG  - Planetary Data System: Engineering Node
    
  • PDS_GEO  - Planetary Data System: Geosciences Node
    
  • PDS_IMG  - Planetary Data System: Imaging Node
    
  • PDS_NAIF - Planetary Data System: NAIF Node
    
  • PDS_PPI  - Planetary Data System: Planetary Plasma Interactions Node
    
  • PDS_RMS  - Planetary Data System: Rings Node
    
  • PDS_SBN  - Planetary Data System: Small Bodies Node
    
  • PSA      - Planetary Science Archive
    
  • JAXA     - Japan Aerospace Exploration Agency
    

πŸ¦„ Related requirements

πŸ¦„ #187

βš™οΈ Engineering Details

It sounds related to the change in the code which made the node name in the harvest configuration obsolete.

πŸŽ‰ Integration & Test

No response

@jordanpadams
Copy link
Member

Created a clear requirement to tie to this as well:

#187

@al-niessner
Copy link
Contributor

al-niessner commented Sep 19, 2024

@jordanpadams @tloubrieu-jpl

Yes, this is to be expected. As discussed previously, moved the node name to be identified through the registry name. This is done here:

static public String exchangeIndexForNode (String indexName) {
if (indexNodeMap.containsKey (indexName)) return indexNodeMap.get(indexName);
if (0 < indexName.indexOf("-registry")) return indexName.substring(0, indexName.indexOf("-registry"));
return "development";
}

It always returns a name. Do you prefer to throw an error when not found in this table?

private static HashMap<String,String> indexNodeMap = new HashMap<String,String>();
static {
indexNodeMap.put("pds-atm-registry", "PDS_ATM");
indexNodeMap.put("pds-eng-registry", "PDS_ENG");
indexNodeMap.put("pds-geo-registry", "PDS_GEO");
indexNodeMap.put("pds-img-registry", "PDS_IMG");
indexNodeMap.put("pds-naif-registry", "PDS_NAIF");
indexNodeMap.put("pds-ppi-registry", "PDS_PPI");
indexNodeMap.put("pds-rms-registry", "PDS_RMS");
indexNodeMap.put("pds-sbn-registry", "pds_SBN");
indexNodeMap.put("psa-registry", "PSA");
indexNodeMap.put("jaxa-registry", "JAXA");
indexNodeMap.put("roscosmos", "ROSCOSMOS");
}

As you can see from the function, it is mangling your registry name to get the node name since you are not following the convention you gave me. That convention was used to create the connections here:
https://github.com/NASA-PDS/registry-common/tree/main/src/main/resources/connections/cognito

We can do a couple of thing if this is a set of integration test nodes (maybe for some users initial review) but not full deployment nodes (ones that follow the original naming rules):

  1. throw error if not recognized rather than make best guess
  2. add the integration nodes to table above
  3. change the naming rules and update all of the code
  4. use the app://connections/cognito/*.xml instead of your own so that naming conventions are consistent

I would suggest 2 and 4. We can and should add more connections, both cognito and direct, to support more registries. We can then add them to the table above as needed with whatever name is desired.

@nutjob4life
Copy link
Member

Need answering for outstanding questions

@tloubrieu-jpl
Copy link
Member Author

Hi @al-niessner ,

Sorry, after taking a bit more time to think of it, I believe you are right, that does not make sense to keep a hardcoded mapping in the code, because we would need to update the harvest code if we add a new discipline node. Although that might never happen that is an unnecessary hassle.

Besides, I am realizing we don't want to redundantly manage that harvesintg node in harvest since we know who the document has been harvested by from the index in which the document is. In the documents it shows as _index = geo-registry (although that might not be searchable)

Unless we have cases where someone would like to ingest into an index, e.g. geo-registry but be authored as a different node, for example PDS_SBN or a more specific entity, for example PDS_GEO_MARS ... I don't believe we need that, but I am leaving that decision to @jordanpadams.
If we actually need that, we would go for option 4.

If we confirm we don't need that flexibility of having different harvesting nodes in the same index, I would completely remove this updates from the client side, in harvest, and move it servers ide, in sweeper or by an opensearch configuration (I am not finding which though).

@al-niessner , @alexdunnjpl , @jordanpadams , let me know what are your thoughts.

Thanks,

"

@jordanpadams
Copy link
Member

@tloubrieu-jpl @al-niessner @alexdunnjpl I agree that it is not β€œclean” to have this mapping hardcoded in the client, and we can go with a server side solution, as long as it will not take a significant amount of time or effort. The creation of new indexes / nodes should not happen often (albeit we will have some international partners being added soon :-)). In the event that it is updated and a new node/index is needed, the user that requesting a new index can be required to install a new version of harvest.

@al-niessner
Copy link
Contributor

al-niessner commented Sep 20, 2024

@tloubrieu-jpl

Because of the solution you are outlined above, it is quite obvious I cannot communicate my ideas to you. I am basically saying the opposite.

The argument about the hard coded registry name <-> node name mapping is empty. No matter what, because naming rules for both registry and node names is neither defined nor consistent, the mapping has to be hard coded somewhere - client or server - in either a table or algorithm or both.

Putting the hard code in the client is long term cheapest as once it is right fewer and fewer corrections are needed. If the ability of forced updates existed (possible), then this would not be a problem - sweeper could be run once. Typos in the index name is a serious problem since the user can use create DB then harvest to fill up pda-registry instead of psa-registry. Hence, users should have only used app://. Doing so also ties the mapping to a fixed set of registry indices. The table and app:// provided in registry-common was self consistent as well preventing this issue altogether.

Go back to the purpose of the node name to drive the design. The purpose of node name seems to be to reduce the search space to a specific index. In other words, a user can say I just want to search the node name == PDS_SBN stuff and not the universe, which, for us, maps to an index. If it has more purpose or a different purpose, then we should work with its intended purpose. For instance, if the intent is who authored it, then we should embed the cognito credentials not some enum that cannot be verified; which is to say mars author can claim to be an sbn author while putting data into psa node.

@jordanpadams

"clean" is not possible. See above in this comment for why hard coding must happen.

You should write all of the tools to look and see if a new version is available before running and then refuse to run until versions match. Getting this out would save you so much money in development time and database sweeping. There may still be a bit of open window to slip it in with this release but that window is barely open and closing.

Anyway, I doubt I did any better communicating my position so just tell me what you want implemented and I will do it.

@jordanpadams
Copy link
Member

@al-niessner per the new requirement now developed, we need a very consistent node_name value for a particular index. I don't necessarily care what that value is, but PDS_*, etc., was something we found to drive it for now.

You should write all of the tools to look and see if a new version is available before running and then refuse to run until versions match. Getting this out would save you so much money in development time and database sweeping. There may still be a bit of open window to slip it in with this release but that window is barely open and closing.

I 100% agree we need something like this. We are spending way too much time fixing issues with older/bad versions of tools. The problem I foresee here is (1) the tool should key off of the major version, and ignore the minor. (2) we have to use semantic versioning correctly when it comes to changes to metadata.

@tloubrieu-jpl
Copy link
Member Author

tloubrieu-jpl commented Sep 24, 2024

@tloubrieu-jpl will describe the table to be hardcoded. Anything out of the table should through an error.

create dev name.

@tloubrieu-jpl
Copy link
Member Author

Hi @al-niessner ,

The expected mapping is:

  atm-registry  --> PDS_ATM
  en-registry --> PDS_ENG
  geo-registry --> PDS_GEO
  img-registry --> PDS_IMG
  naif-registry --> PDS_NAIF
  ppi-registry --> PDS_PPI
  rms-registry --> PDS_RMS
  sbnpsi-registry --> PDS_SBN
  sbnumd-registry --> PDS_SBN
  psa-registry --> PSA
  jaxa-registry --> JAXA
  dev-registry --> PDS_ENG_DEV

Anything different, should raise an error saying: "Index not supported: either fix it in your configuration by using one of the supported or request an upgrade of harvest to support your new index by submitting a ticket on https://github.com/NASA-PDS/harvest/issues".

Thanks,

@al-niessner al-niessner moved this from ToDo to βš™ Review / QA in EN Portfolio Backlog Sep 25, 2024
@al-niessner al-niessner moved this from βš™ Review / QA to ToDo in EN Portfolio Backlog Sep 25, 2024
@github-project-automation github-project-automation bot moved this from ToDo to 🏁 Done in EN Portfolio Backlog Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏁 Done
4 participants