Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Stream Source #969

Closed
Tracked by #948
penghuo opened this issue Oct 26, 2022 · 0 comments
Closed
Tracked by #948

Add Stream Source #969

penghuo opened this issue Oct 26, 2022 · 0 comments
Assignees

Comments

@penghuo
Copy link
Collaborator

penghuo commented Oct 26, 2022

Stream source interface

  • Optional getLatestOffset()
    • return the latest Offset which map to S3Metadata. Notice, the StreamSource does NOT guarantee the Offset is mapping to unread files.
    • return empty if there is no data in data stream source.
// read stream from file data source
Set<Files> allFiles = fileDataSource.listAllObjects();

// get unread files
Set<Files> unreadFileds = Sets.*difference*(allFiles, seenObjects);

// update seenObjects
seenFiles = allFiles

Long latestBatchId = fileMetadataLog.getLatest()

if (!unreadFileds.isEmpty()) {
// has unread files
// update batchId, keep it monotonically increasing
    latestBatchId += 1;
// update s3MetadataLog    
    fileMetadataLog.add(latestBatchId, new S3Metadata(unreadFileds, latestBatchId));
    return Optional.of(new Offset(latestBatchId));
} else {
    return latestBatchId == -1 ? Optional.empty() : Optional.of(new Offset(latestBatchId));
}
  • Batch getBatch(Optional start, Offset end)
    • return the Batch from stream source between (start, end].

Stream source state maintain

  • FileMetadataLog maintain the mapping between Offset and FileMetadata. The user of FileMetadataLog MUST maintain the monotonically increasing of Offset.
    • Optional<Pair<Long, FileMetadata>> getLatest(). return the latest Offset and FileMetaData.
    • List<FileMetadata> get(Optional<Long> start, Optional<Long> end). return the list of FileMetaData between Offset range in [start, end]
    • boolean add(Long offset, T metadata). add Offset and FileMetaData.
  • SeenFiles, maintain the seen files from stream source so far.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants