Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Chunks do not make sense #1407

Closed
wartek69 opened this issue Jul 7, 2024 · 4 comments
Closed

[Question]: Chunks do not make sense #1407

wartek69 opened this issue Jul 7, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@wartek69
Copy link

wartek69 commented Jul 7, 2024

Describe your problem

Hi,

I have a 500+ pages law document that I chunked once with the general method & default settings and once with the law method & default settings.

In both cases, when looking at the chunks, the chunk looks gibberish. The text in the pdf isn't like that.
What is the reason for this and how can this be solved?

Br
Alex
Screenshot from 2024-07-07 22-53-55

@wartek69 wartek69 added the question Further information is requested label Jul 7, 2024
@guoyuhao2330
Copy link
Contributor

Can you provide this law document?

@KevinHuSh
Copy link
Collaborator

RAGFlow parsing methods restructured the text blocks to ensure that semantic expressions are not interrupted as much as possible.

@wartek69
Copy link
Author

wartek69 commented Jul 8, 2024

You can find the document here:
cesni22_27en_annex1_ES_TRIN23.pdf

I have the same behavior on shorter non-law documents. I can understand that text blocks are restructured, but why does the text become gibberish? eg. bAoNiCleEr doesn't make sense, it's like a concatenation of parts from different words

@KevinHuSh
Copy link
Collaborator

Got it. We're gona to fix it.

@KevinHuSh KevinHuSh added bug Something isn't working and removed question Further information is requested labels Jul 9, 2024
KevinHuSh pushed a commit that referenced this issue Jul 11, 2024
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
paresh2806 pushed a commit to paresh2806/ragflow that referenced this issue Jul 11, 2024
### What problem does this PR solve?

infiniflow#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
KevinHuSh pushed a commit that referenced this issue Jul 11, 2024
### What problem does this PR solve?

#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
KevinHuSh pushed a commit that referenced this issue Jul 25, 2024
### What problem does this PR solve?

#1407 #1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
### What problem does this PR solve?

infiniflow#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
### What problem does this PR solve?

infiniflow#1407 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Halfknow pushed a commit to Halfknow/ragflow that referenced this issue Nov 11, 2024
### What problem does this PR solve?

infiniflow#1407 infiniflow#1656 

### Type of change

- [x] Bug Fix (non-breaking change which fixes an issue)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants