Skip to content

Latest commit

 

History

History
102 lines (64 loc) · 3.39 KB

spark-sql-streaming-HDFSMetadataLog.adoc

File metadata and controls

102 lines (64 loc) · 3.39 KB

HDFSMetadataLog — MetadataLog with Hadoop HDFS for Storage

HDFSMetadataLog is a MetadataLog that uses Hadoop HDFS for a reliable storage.

Note
HDFSMetadataLog uses path (specified when created) that is created automatically unless exists already.

HDFSMetadataLog is created when:

HDFSMetadataLog is further customized to…​FIXME

Table 1. HDFSMetadataLog’s Available Implementations
HDFSMetadataLog Description

BatchCommitLog

CompactibleFileStreamLog

OffsetSeqLog

Table 2. HDFSMetadataLog’s Internal Registries and Counters
Name Description

fileManager

FileManager that…​FIXME

batchFilesFilter

Filter of batch files

metadataPath

The path to metadata directory

Writing Metadata in Serialized Format — serialize Method

Caution
FIXME

deserialize Method

Caution
FIXME

createFileManager Internal Method

createFileManager(): FileManager
Caution
FIXME
Note
createFileManager is used exclusively when HDFSMetadataLog is created (and the internal FileManager is created alongside).

Retrieving Metadata By Batch Id — get Method

Caution
FIXME

add Method

Caution
FIXME

Retrieving Latest Committed Batch Id with Metadata If Available — getLatest Method

getLatest(): Option[(Long, T)]
Note
getLatest is a part of MetadataLog Contract to retrieve the recently-committed batch id and the corresponding metadata if available in the metadata storage.

getLatest requests the internal FileManager for the files in metadata directory that match batch file filter.

getLatest takes the batch ids (the batch files correspond to) and sorts the ids in reverse order.

getLatest gives the first batch id with the metadata which could be found in the metadata storage.

Note
It is possible that the batch id could be in the metadata storage, but not available for retrieval.

Creating HDFSMetadataLog Instance

HDFSMetadataLog takes the following when created:

  • SparkSession

  • Path of the metadata log directory

HDFSMetadataLog initializes the internal registries and counters.

HDFSMetadataLog creates the path unless exists already.