Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE #4635] Implemented the functions of file source connector #4650

Conversation

HarshSawarkar
Copy link
Contributor

Fixes #4635 .

Motivation

File source connector class methods were not implemented.

Modifications

Successfully implemented start(), stop(), poll() methods of file source connector class.

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
  • If a feature is not applicable for documentation, explain why?
  • If a feature is not documented yet in this PR, please create a followup issue for adding the documentation

@EqualsAndHashCode(callSuper = true)
public class FileSourceConfig extends SourceConfig {

public SourceConnectorConfig connectorConfig;
private String fileName;
private String filePath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fileName and filePath be directly obtained from the configuration file source-config.yml? Then, it will not restrict users from using specific file name and file path.

In addition, many configuration items in the configuration file source-config.yml are not used by you. Could you directly remove the configuration items that you do not need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -36,6 +47,11 @@ public class FileSourceConnector implements Source {

private OffsetStorageReader offsetStorageReader;

private String filePath;
private String fileName;
private InputStream inputStream;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having BufferedReader variable, it seems unnecessary to declare InputStream variable now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link

codecov bot commented Dec 15, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (b85a58c) 17.39% compared to head (9f1c144) 17.59%.
Report is 36 commits behind head on master.

❗ Current head 9f1c144 differs from pull request most recent head 3729fdb. Consider uploading reports for the commit 3729fdb to get more accurate results

Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4650      +/-   ##
============================================
+ Coverage     17.39%   17.59%   +0.20%     
- Complexity     1757     1774      +17     
============================================
  Files           797      797              
  Lines         29774    29786      +12     
  Branches       2573     2573              
============================================
+ Hits           5178     5242      +64     
+ Misses        24118    24063      -55     
- Partials        478      481       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@xwm1992 xwm1992 changed the title Implemented the functions of file source connector. [ISSUE #4635]Implemented the functions of file source connector. Dec 22, 2023
Pil0tXia
Pil0tXia previously approved these changes Jan 2, 2024
Comment on lines 23 to 24
fileName: TopicTest-%s
filePath: TopicTest/%s
Copy link
Member

@Pil0tXia Pil0tXia Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. May you please provide a config document here later? That would be helpful.

Copy link
Member

@pandaapo pandaapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.
For a more visual representation of the test results, you can integrate testing with the File Sink Connector to see if your File Source Connector can transfer the content of one file to another.


@Data
@Slf4j
@EqualsAndHashCode(callSuper = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these two annotations work?

try {
String line;
while ((line = bufferedReader.readLine()) != null) {
ConnectRecord connectRecord = new ConnectRecord(new RecordPartition(), new RecordOffset(), System.currentTimeMillis(), line);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The offset recording seems redundant because the Source Connector you've implemented reads the entire contents of a file in a single pass and does not perform incremental reads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the entire file may potentially impose significant memory pressure. It is necessary to set a memory limit or read the file in segments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see that this Connector implements memory limitation and fragmented reads through offset. If it is indeed achieved, could you explain how it was achieved? Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the implementation of fragmented reads through offset in the updated code. This was achieved by reading the file in chunks and recording the offset based on the last position index in the buffer array. Additionally, I ensured that the last position is updated in each iteration to maintain accurate tracking. Thank you for bringing this to my attention.

connectorConfig:
connectorName: fileSource
nameserver: 127.0.0.1:9877
topic: TopicTest
commitOffsetIntervalMs: 5000
fileName: TopicTest-%s
filePath: TopicTest/%s
Copy link
Member

@pandaapo pandaapo Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When users use this File Source Connector, they will have to place local files in directories structured like XXX/2024/01/03. I think it is not user-friendly. Why not directly use the filePath and fileName configured by the user here to create input stream?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

connectorConfig:
connectorName: fileSource
nameserver: 127.0.0.1:9877
topic: TopicTest
commitOffsetIntervalMs: 5000
fileName: TopicTest-%s
filePath: TopicTest/%s
offsetStorageConfig:
offsetStorageType: nacos
offsetStorageAddr: 127.0.0.1:8848
Copy link
Member

@pandaapo pandaapo Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding these offset configurations, as I mentioned before, the File Source Connector you implemented reads all the content of the file at once, and offsets seem to have no effect. Could offsetStorageConfig and commitOffsetIntervalMs be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented the commitOffsetIntervalMs in the updated code.

@@ -15,20 +15,13 @@
# limitations under the License.
#

pubSubConfig:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was pubSubConfig configuration removed? With this configuration removed, this Source Connector will no longer work properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated


import lombok.extern.slf4j.Slf4j;

@Slf4j
public class FileSourceConnector implements Source {
private static final int DEFAULT_BATCH_SIZE = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the role of this variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was used in the initialization of List of ConnectRecord in here
image
It can be removed.

ConnectRecord connectRecord = new ConnectRecord(recordPartition, recordOffset, timeStamp, line);
connectRecords.add(connectRecord);
if (timeStamp - prevTimeStamp > this.sourceConfig.getConnectorConfig().getCommitOffsetIntervalMs()) {
this.commitOffset(connectRecord, lastOffset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose of offSet: Source Connector periodically persists the latest offSet somewhere (such as Nacos, Consul, ETCD, etc.). Then, if Source Connector is abruptly interrupted and restarted, it can resume reading from the position indicated by the recorded offSet.
However, for file reading, relying solely on offSet is not feasible because file content does not follow a linear append-only pattern. If file modifications are considered, using offset is more impractical.
Therefore, in this PR, I think the codes related to offSet are not meaningful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update the PR accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main purpose of offSet: Source Connector periodically persists the latest offSet somewhere (such as Nacos, Consul, ETCD, etc.). Then, if Source Connector is abruptly interrupted and restarted, it can resume reading from the position indicated by the recorded offSet. However, for file reading, relying solely on offSet is not feasible because file content does not follow a linear append-only pattern. If file modifications are considered, using offset is more impractical. Therefore, in this PR, I think the codes related to offSet are not meaningful.

@pandaapo Can RecordPartition be considered for recording the last offset position?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Through this explanation, I believe you understand the functionality of offSet now. If you strongly wish to utilize RecordPartition or RecordOffset to implement the functionality of offSet, you can share your proposed approach here first.

Assuming this is the content of a file, and Source Connector starts reading from "A" and stops when it reaches "K".

ABCDEDFG,
HIJKLMN,
OPQRST.

Your approach needs to consider some scenarios, such as:
(1) After restarting Source Connector, how can it start reading from "L"?

(2) Before Source Connector is restarted, the content of the file is modified as follows. How should it be appropriately handled after restarting Source Connector? read from "L", "A" or other position?

ABCDEDFG 123456,
HIJKLMN,
OPQRST.

or

ABC,
HIJKLMN,
OPQRST.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HarshSawarkar There is no need to consider implementing the concept of offset in the file source connector. For the ability to resume uploading from a breakpoint, you only need to retransmit the entire file content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -28,13 +27,7 @@ connectorConfig:
connectorName: fileSource
nameserver: 127.0.0.1:9877
topic: TopicTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameserver and topic are actually redundant for this Connector.

while ((bytesRead = bufferedReader.read(buffer)) != -1) {
String line = new String(buffer, 0, bytesRead);
long timeStamp = System.currentTimeMillis();
RecordPartition recordPartition = convertToRecordPartition(this.sourceConfig.getConnectorConfig().getTopic(), fileName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of this object are the same each time it is created. Either move it outside while () {} or remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -28,13 +27,7 @@ connectorConfig:
connectorName: fileSource
nameserver: 127.0.0.1:9877
topic: TopicTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameserver and topic are actually redundant for this Connector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nameserver and topic are actually redundant for this Connector.

Updated.

@@ -26,15 +25,7 @@ pubSubConfig:
passWord: filePassWord
connectorConfig:
connectorName: fileSource
nameserver: 127.0.0.1:9877
topic: TopicTest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topic is redundant, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pandaapo Sure, I will update the PR. Just for my knowledge, I wanted to know the usage of topic in the source connector. Also, in the ConnectorConfig, the topic was used only once, as in every other source-config.yml file for other source connectors. So, would removing topic be a good idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topic is used as part of the data source definition in other Source Connectors, such as fetching data from a topic of Kafka, RocketMQ, or RabbitMQ. The current Connector fetches data from a file, and filePath and fileName already specify the data source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR.Can you please check now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I am now starting the CI for this PR. I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.

this.bufferedReader = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));
} else {
this.bufferedReader = new BufferedReader(new InputStreamReader(Files.newInputStream(
Paths.get(filePath, fileName)), StandardCharsets.UTF_8), BUFFER_SIZE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your new commit

  1. I think there is no need to make separate judgments for fileName and filePath in this conditional judgment if (fileName == null || fileName.isEmpty() || filePath == null || filePath.isEmpty()).
  2. What will Paths.get("/path/to/example.txt", "example.txt") return?

I would appreciate it if you could test the functionality of this File Source Connector locally. I believe it is a very important job.

Copy link
Contributor Author

@HarshSawarkar HarshSawarkar Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.Also I have tested the File Source Connector for start(), stop() and poll() method, locally, and it's running fine.I have attached the Ss of the tests I ran on this class below.
image
image

Copy link
Member

@pandaapo pandaapo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you submit your unit test code to this PR as well?

private OffsetStorageReader offsetStorageReader;
private String filePath;
private String fileName;
public BufferedReader bufferedReader;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you expand the scope of access modifiers for bufferedReader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -17,6 +17,8 @@

dependencies {
api project(":eventmesh-openconnect:eventmesh-openconnect-java")
implementation 'org.junit.jupiter:junit-jupiter:5.8.1'
implementation 'org.mockito:mockito-junit-jupiter:5.3.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change implementation to testImplementation. Also, there is no need to specify the version number of these 2 jars additionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

fileSourceConnector.start();
List<ConnectRecord> connectRecords = fileSourceConnector.poll();
fileSourceConnector.stop();
Assertions.assertEquals(content, connectRecords.get(0).getData().toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to delete the d/f/foo.txt file created at the end of the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

import org.junit.jupiter.api.Test;
import org.mockito.Mock;

class FileSourceConnectorTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test classes should be placed in the 'src/test' directory, not 'src/main' directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@Pil0tXia Pil0tXia changed the title [ISSUE #4635]Implemented the functions of file source connector. [ISSUE #4635] Implemented the functions of file source connector Jan 20, 2024
@Pil0tXia Pil0tXia merged commit 21e4c3a into apache:master Jan 20, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Enhancement] Implement the function of File Source Connector
4 participants