Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SHA for BLOB update instead of modification time #3697

Merged
merged 7 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ public BlobStoreFileInputStream(BlobStoreFile part) throws IOException {

@Override
public long getVersion() throws IOException {
return part.getModTime();
return part.getVersion();
}

@Override
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,10 @@ public abstract class BlobStoreFile {

public abstract long getModTime() throws IOException;

public long getVersion() throws IOException {
reiabreu marked this conversation as resolved.
Show resolved Hide resolved
return getModTime();
}

public abstract InputStream getInputStream() throws IOException;

public abstract OutputStream getOutputStream() throws IOException;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -291,7 +291,7 @@ public ReadableBlobMeta getBlobMeta(String key, Subject who) throws Authorizatio
rbm.set_settable(meta);
try {
LocalFsBlobStoreFile pf = fbs.read(DATA_PREFIX + key);
rbm.set_version(pf.getModTime());
rbm.set_version(pf.getVersion());
} catch (IOException e) {
throw new RuntimeException(e);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,10 @@
import java.io.OutputStream;
import java.nio.file.Files;
import java.nio.file.StandardCopyOption;
import java.util.Arrays;
import java.util.regex.Matcher;
import org.apache.storm.generated.SettableBlobMeta;
import org.apache.storm.shade.org.apache.commons.codec.digest.DigestUtils;

public class LocalFsBlobStoreFile extends BlobStoreFile {

Expand Down Expand Up @@ -72,6 +74,14 @@ public String getKey() {
return key;
}

@Override
public long getVersion() throws IOException {
try (FileInputStream fis = new FileInputStream(path)) {
byte[] bytes = DigestUtils.sha1(fis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think about using something such as sha256 or sha512 to avoid (unlikely) collisions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I've update it to Sha256

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a fealing how often getVersion(...) is called? Creating a SHA hash is rather expensive compared to the modification date (just hink, if we need to do caching after first call or a like) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is called often, perhaps we can use something such as MurmurHash that is used elsewhere in the code for the sharding of tuples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being ran by the AsyncLocalizer every interval defined by supervisor.localizer.update.blob.interval.secs but this won't have impact on the worker but on the supervisor. We wouldn't need to cache it but nevertheless we can add cache.

Copy link
Contributor Author

@paxadax paxadax Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being ran continuously we can opt for a Murmur hash that will prioritise fast hashing(suggested by @reiabreu) and this way we would opt for not using caching

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, after a brief discussion, we've decided to follow with Checksum instead of Murmur since Checksum computation is faster. Commit with the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to use checksum.
I've approved the changes

return Arrays.hashCode(bytes);
}
}

@Override
public long getModTime() throws IOException {
return path.lastModified();
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version
* 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions
* and limitations under the License.
*/

package org.apache.storm.blobstore;

import org.apache.storm.shade.org.apache.commons.codec.digest.DigestUtils;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.util.Arrays;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotEquals;
import static org.junit.jupiter.api.Assertions.assertThrows;

class LocalFsBlobStoreFileTest {

private File tempFile;
private LocalFsBlobStoreFile blobStoreFile;

@BeforeEach
public void setUp() throws IOException {
tempFile = Files.createTempFile(null, ".tmp").toFile();
try (FileOutputStream fs = new FileOutputStream(tempFile)) {
fs.write("Content for SHA hash".getBytes());
}
blobStoreFile = new LocalFsBlobStoreFile(tempFile.getParentFile(), tempFile.getName());
}

@Test
void testGetVersion() throws IOException {
long expectedVersion = Arrays.hashCode(DigestUtils.sha1("Content for SHA hash"));
long actualVersion = blobStoreFile.getVersion();
assertEquals(expectedVersion, actualVersion, "The version should match the expected hash code.");
}

@Test
void testGetVersion_Mismatch() throws IOException {
long expectedVersion = Arrays.hashCode(DigestUtils.sha1("Different content"));
long actualVersion = blobStoreFile.getVersion();
assertNotEquals(expectedVersion, actualVersion, "The version shouldn't match the hash code of different content.");
}

@Test
void testGetVersion_FileNotFound() {
boolean deleted = tempFile.delete();
if (!deleted) {
throw new IllegalStateException("Failed to delete the temporary file.");
}
assertThrows(IOException.class, () -> blobStoreFile.getVersion(), "Should throw IOException if file is not found.");
}

@Test
void testGetModTime() throws IOException {
long expectedModTime = tempFile.lastModified();
long actualModTime = blobStoreFile.getModTime();
assertEquals(expectedModTime, actualModTime, "The modification time should match the expected value.");
}
}