-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ISSUE #4635] Implemented the functions of file source connector #4650
[ISSUE #4635] Implemented the functions of file source connector #4650
Conversation
.../src/main/java/org/apache/eventmesh/connector/file/source/connector/FileSourceConnector.java
Outdated
Show resolved
Hide resolved
@EqualsAndHashCode(callSuper = true) | ||
public class FileSourceConfig extends SourceConfig { | ||
|
||
public SourceConnectorConfig connectorConfig; | ||
private String fileName; | ||
private String filePath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fileName and filePath be directly obtained from the configuration file source-config.yml? Then, it will not restrict users from using specific file name and file path.
In addition, many configuration items in the configuration file source-config.yml are not used by you. Could you directly remove the configuration items that you do not need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -36,6 +47,11 @@ public class FileSourceConnector implements Source { | |||
|
|||
private OffsetStorageReader offsetStorageReader; | |||
|
|||
private String filePath; | |||
private String fileName; | |||
private InputStream inputStream; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having BufferedReader
variable, it seems unnecessary to declare InputStream
variable now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4650 +/- ##
============================================
+ Coverage 17.39% 17.59% +0.20%
- Complexity 1757 1774 +17
============================================
Files 797 797
Lines 29774 29786 +12
Branches 2573 2573
============================================
+ Hits 5178 5242 +64
+ Misses 24118 24063 -55
- Partials 478 481 +3 ☔ View full report in Codecov by Sentry. |
fileName: TopicTest-%s | ||
filePath: TopicTest/%s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. May you please provide a config document here later? That would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.
For a more visual representation of the test results, you can integrate testing with the File Sink Connector to see if your File Source Connector can transfer the content of one file to another.
|
||
@Data | ||
@Slf4j | ||
@EqualsAndHashCode(callSuper = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these two annotations work?
try { | ||
String line; | ||
while ((line = bufferedReader.readLine()) != null) { | ||
ConnectRecord connectRecord = new ConnectRecord(new RecordPartition(), new RecordOffset(), System.currentTimeMillis(), line); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The offset recording seems redundant because the Source Connector you've implemented reads the entire contents of a file in a single pass and does not perform incremental reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the entire file may potentially impose significant memory pressure. It is necessary to set a memory limit or read the file in segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see that this Connector implements memory limitation and fragmented reads through offset. If it is indeed achieved, could you explain how it was achieved? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the implementation of fragmented reads through offset in the updated code. This was achieved by reading the file in chunks and recording the offset based on the last position index in the buffer array. Additionally, I ensured that the last position is updated in each iteration to maintain accurate tracking. Thank you for bringing this to my attention.
connectorConfig: | ||
connectorName: fileSource | ||
nameserver: 127.0.0.1:9877 | ||
topic: TopicTest | ||
commitOffsetIntervalMs: 5000 | ||
fileName: TopicTest-%s | ||
filePath: TopicTest/%s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When users use this File Source Connector, they will have to place local files in directories structured like XXX/2024/01/03. I think it is not user-friendly. Why not directly use the filePath and fileName configured by the user here to create input stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
connectorConfig: | ||
connectorName: fileSource | ||
nameserver: 127.0.0.1:9877 | ||
topic: TopicTest | ||
commitOffsetIntervalMs: 5000 | ||
fileName: TopicTest-%s | ||
filePath: TopicTest/%s | ||
offsetStorageConfig: | ||
offsetStorageType: nacos | ||
offsetStorageAddr: 127.0.0.1:8848 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding these offset configurations, as I mentioned before, the File Source Connector you implemented reads all the content of the file at once, and offsets seem to have no effect. Could offsetStorageConfig
and commitOffsetIntervalMs
be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented the commitOffsetIntervalMs in the updated code.
@@ -15,20 +15,13 @@ | |||
# limitations under the License. | |||
# | |||
|
|||
pubSubConfig: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was pubSubConfig
configuration removed? With this configuration removed, this Source Connector will no longer work properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
eventmesh-connectors/eventmesh-connector-file/src/main/resources/source-config.yml
Outdated
Show resolved
Hide resolved
|
||
import lombok.extern.slf4j.Slf4j; | ||
|
||
@Slf4j | ||
public class FileSourceConnector implements Source { | ||
private static final int DEFAULT_BATCH_SIZE = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the role of this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectRecord connectRecord = new ConnectRecord(recordPartition, recordOffset, timeStamp, line); | ||
connectRecords.add(connectRecord); | ||
if (timeStamp - prevTimeStamp > this.sourceConfig.getConnectorConfig().getCommitOffsetIntervalMs()) { | ||
this.commitOffset(connectRecord, lastOffset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main purpose of offSet
: Source Connector periodically persists the latest offSet
somewhere (such as Nacos, Consul, ETCD, etc.). Then, if Source Connector is abruptly interrupted and restarted, it can resume reading from the position indicated by the recorded offSet
.
However, for file reading, relying solely on offSet
is not feasible because file content does not follow a linear append-only pattern. If file modifications are considered, using offset
is more impractical.
Therefore, in this PR, I think the codes related to offSet
are not meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update the PR accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main purpose of
offSet
: Source Connector periodically persists the latestoffSet
somewhere (such as Nacos, Consul, ETCD, etc.). Then, if Source Connector is abruptly interrupted and restarted, it can resume reading from the position indicated by the recordedoffSet
. However, for file reading, relying solely onoffSet
is not feasible because file content does not follow a linear append-only pattern. If file modifications are considered, usingoffset
is more impractical. Therefore, in this PR, I think the codes related tooffSet
are not meaningful.
@pandaapo Can RecordPartition be considered for recording the last offset position?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Through this explanation, I believe you understand the functionality of offSet
now. If you strongly wish to utilize RecordPartition
or RecordOffset
to implement the functionality of offSet
, you can share your proposed approach here first.
Assuming this is the content of a file, and Source Connector starts reading from "A" and stops when it reaches "K".
ABCDEDFG,
HIJKLMN,
OPQRST.
Your approach needs to consider some scenarios, such as:
(1) After restarting Source Connector, how can it start reading from "L"?
(2) Before Source Connector is restarted, the content of the file is modified as follows. How should it be appropriately handled after restarting Source Connector? read from "L", "A" or other position?
ABCDEDFG 123456,
HIJKLMN,
OPQRST.
or
ABC,
HIJKLMN,
OPQRST.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HarshSawarkar There is no need to consider implementing the concept of offset in the file source connector. For the ability to resume uploading from a breakpoint, you only need to retransmit the entire file content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -28,13 +27,7 @@ connectorConfig: | |||
connectorName: fileSource | |||
nameserver: 127.0.0.1:9877 | |||
topic: TopicTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameserver
and topic
are actually redundant for this Connector.
while ((bytesRead = bufferedReader.read(buffer)) != -1) { | ||
String line = new String(buffer, 0, bytesRead); | ||
long timeStamp = System.currentTimeMillis(); | ||
RecordPartition recordPartition = convertToRecordPartition(this.sourceConfig.getConnectorConfig().getTopic(), fileName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this object are the same each time it is created. Either move it outside while () {}
or remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -28,13 +27,7 @@ connectorConfig: | |||
connectorName: fileSource | |||
nameserver: 127.0.0.1:9877 | |||
topic: TopicTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameserver
and topic
are actually redundant for this Connector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nameserver
andtopic
are actually redundant for this Connector.
Updated.
@@ -26,15 +25,7 @@ pubSubConfig: | |||
passWord: filePassWord | |||
connectorConfig: | |||
connectorName: fileSource | |||
nameserver: 127.0.0.1:9877 | |||
topic: TopicTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topic
is redundant, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pandaapo Sure, I will update the PR. Just for my knowledge, I wanted to know the usage of topic in the source connector. Also, in the ConnectorConfig, the topic was used only once, as in every other source-config.yml file for other source connectors. So, would removing topic be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
topic
is used as part of the data source definition in other Source Connectors, such as fetching data from a topic
of Kafka, RocketMQ, or RabbitMQ. The current Connector fetches data from a file, and filePath
and fileName
already specify the data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR.Can you please check now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I am now starting the CI for this PR. I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.
this.bufferedReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8)); | ||
} else { | ||
this.bufferedReader = new BufferedReader(new InputStreamReader(Files.newInputStream( | ||
Paths.get(filePath, fileName)), StandardCharsets.UTF_8), BUFFER_SIZE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your new commit
- I think there is no need to make separate judgments for
fileName
andfilePath
in this conditional judgmentif (fileName == null || fileName.isEmpty() || filePath == null || filePath.isEmpty())
. - What will
Paths.get("/path/to/example.txt", "example.txt")
return?
I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you submit your unit test code to this PR as well?
private OffsetStorageReader offsetStorageReader; | ||
private String filePath; | ||
private String fileName; | ||
public BufferedReader bufferedReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you expand the scope of access modifiers for bufferedReader
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -17,6 +17,8 @@ | |||
|
|||
dependencies { | |||
api project(":eventmesh-openconnect:eventmesh-openconnect-java") | |||
implementation 'org.junit.jupiter:junit-jupiter:5.8.1' | |||
implementation 'org.mockito:mockito-junit-jupiter:5.3.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change implementation
to testImplementation
. Also, there is no need to specify the version number of these 2 jars additionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
fileSourceConnector.start(); | ||
List<ConnectRecord> connectRecords = fileSourceConnector.poll(); | ||
fileSourceConnector.stop(); | ||
Assertions.assertEquals(content, connectRecords.get(0).getData().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to delete the d/f/foo.txt
file created at the end of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
import org.junit.jupiter.api.Test; | ||
import org.mockito.Mock; | ||
|
||
class FileSourceConnectorTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test classes should be placed in the 'src/test' directory, not 'src/main' directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Fixes #4635 .
Motivation
File source connector class methods were not implemented.
Modifications
Successfully implemented start(), stop(), poll() methods of file source connector class.
Documentation