Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't download files when file already exists locally and message checksum is set to arbitrary #1351

Closed
andreleblanc11 opened this issue Dec 18, 2024 · 2 comments · Fixed by #1362
Assignees
Labels
bug Something isn't working Discussion_Needed developers should discuss this issue. Priority 4 - Strategic would benefit multiple use cases if resolved regression Broke something that was working before. v3 issue deferred to or affects version 3 work-around a work-around is provided, mitigating the issue.

Comments

@andreleblanc11
Copy link
Member

andreleblanc11 commented Dec 18, 2024

When a message comes in with an arbitrary checksum, and we want to download this file in a location where

  1. The filename already exists, but
  2. The checksum is different from that of the incoming message.

The file gets labeled as a duplicate and is rejected.

2024-12-18 16:01:21,372 [DEBUG] sarracenia.flow file_should_be_downloaded checksum in message: {'method': 'arbitrary', 'value': 'f47363343f9330bc36652458255f17ab'} vs. local: {'method': 'a
rbitrary', 'value': 'f47363343f9330bc36652458255f17ab'}
2024-12-18 16:01:21,372 [DEBUG] sarracenia.flow file_should_be_downloaded checksum in message: {'method': 'arbitrary', 'value': '3fb870ca02dd4040c0c3c6ba0237a9e8'} vs. local: {'method': 'a
rbitrary', 'value': '3fb870ca02dd4040c0c3c6ba0237a9e8'}
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CPCI-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CWEW-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CZGH-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CESB-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CWZW-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CWOK-AUTO-minute-swob.
xml
2024-12-18 16:01:21,372 [INFO] sarracenia.flowcb.log after_work rejected: 304 same checksum /apps/sarra/public_data/20241218/WXO-DD-STAGE/observations/swob-ml/latest/CVOP-AUTO-minute-swob.
xml

For my use case, I found, that xattr wasn't installed when I did a sr3 features.
I installed it and see'd if that helped, but some files were still being labeled as duplicates.

I think something might be wrong in compute_local_checksum

def compute_local_checksum(self, msg) -> None:
if sarracenia.filemetadata.supports_extended_attributes:
try:
x = sarracenia.filemetadata.FileMetadata(msg['new_path'])
s = x.get('identity')
if s:
metadata_cached_mtime = x.get('mtime')
if ((metadata_cached_mtime >= msg['mtime'])):
# file has not been modified since checksum value was stored.
if (( 'identity' in msg ) and ( 'method' in msg['identity'] ) and \
( msg['identity']['method'] == s['method'] )) or \
( s['method'] == self.o.identity_method ) :
# file
# cache good.
msg['local_identity'] = s
msg['_deleteOnPost'] |= set(['local_identity'])
b = x.get('blocks')
msg['local_blocks'] = b
msg['_deleteOnPost'] |= set(['local_blocks'])
return
except:
pass

I've added debugging in the source and found that line 1439 returns as False.
When this happens, this line of code is executed and will always set an arbitrary checksum as a duplicate.

if msg['identity']['method'] == 'arbitrary':
local_identity.value = msg['identity']['value']

I think we might need to change the >= to a =<? If the file at the destination already exists, the mtime should be lesser than that of the file we're trying to download.

Work around

We can work around this by overriding the checksum with identity cod,sha512 however this isn't good practice, because we can't truely validate the integrity of the file that comes from the source.

@andreleblanc11 andreleblanc11 added work-around a work-around is provided, mitigating the issue. v3 issue deferred to or affects version 3 Discussion_Needed developers should discuss this issue. Priority 4 - Strategic would benefit multiple use cases if resolved labels Dec 18, 2024
@petersilva petersilva added the bug Something isn't working label Dec 18, 2024
@petersilva petersilva self-assigned this Jan 6, 2025
@petersilva
Copy link
Contributor

This problem occurs, when:

  • file with same name exists on local tree.
  • file checksum is arbitrary.
  • calculate_local_checksum... is called... the xattr python module was missing... so it always returns the checksum from the message because there is no way to calculate local checksum without xattr.
  • installing the xattr routine should fix the problem.

oh... found a bug in flow/compute_local_checksum... working on it.

@petersilva petersilva added the regression Broke something that was working before. label Jan 6, 2025
@petersilva
Copy link
Contributor

regression vs. v2.

petersilva added a commit that referenced this issue Jan 6, 2025
Need to check against local file age, not message age to
understand whether to use local file xattr storage.

Also, if you don't have a stored xattr, and you have arbitrary,
you can't calculate it, so return without it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Discussion_Needed developers should discuss this issue. Priority 4 - Strategic would benefit multiple use cases if resolved regression Broke something that was working before. v3 issue deferred to or affects version 3 work-around a work-around is provided, mitigating the issue.
Projects
None yet
2 participants