-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input data #7
Comments
Hello, |
The project is refreshed with all history removed. All programs are runnable expect that the data example is not uploaded. You may infer the correct data format from the data_loader_json.py file. Pull request is welcomed for making the project runnable out of the box. I'll let you know when the original data format can be provided, otherwise please feel free to create a pull request. |
Please tell me if this is the correct data format? Format:
Example:
{
"global_attributes": {
"file_id": "file1.jpg"
},
"fields":[
{
"field_name": "class1",
"key_id": [],
"key_text": [],
"value_id": [1],
"value_text": "sample1"
},
{
"field_name": "class1",
"key_id": [],
"key_text": [],
"value_id": [2],
"value_text": "sample2"
},
{
"field_name": "class2",
"key_id": [],
"key_text": [],
"value_id": [3],
"value_text": "sample3"
}
],
"text_boxes":[
{
"id": 1,
"bbox": [10, 10, 50, 20],
"text": "sample1"
},
{
"id": 2,
"bbox": [55, 10, 100, 20],
"text": "sample2"
},
{
"id": 3,
"bbox": [50, 30, 100, 40],
"text": "sample3"
}
]
} Or maybe the correct format should look like this
{
"global_attributes": {
"file_id": "file1.jpg"
},
"fields":[
{
"field_name": "class1",
"key_id": [],
"key_text": [],
"value_id": [1, 2],
"value_text": ["sample1", "sample2"]
},
{
"field_name": "class2",
"key_id": [],
"key_text": [],
"value_id": [3],
"value_text": ["sample3"]
}
],
"text_boxes":[
{
"id": 1,
"bbox": [10, 10, 50, 20],
"text": "sample1"
},
{
"id": 2,
"bbox": [55, 10, 100, 20],
"text": "sample2"
},
{
"id": 3,
"bbox": [50, 30, 100, 40],
"text": "sample3"
}
]
} |
Hello @vsymbol , http://52.193.30.103 seems to be down. Could you provide the updated link? |
I too am unable to reach 52.193.30.103 even via ping. Can you confirm if this is up? |
Sample data file #8 (comment) |
Hi varshaneya, the link is down. |
Hi, Can you please provide details on how to generate and test the model? There are a whole lot of files and command line arguments to be given. Can you please update README as to how this model has to be trained and tested? |
Hello @vsymbol |
To get the format. I analyzed this file https://github.com/vsymbol/CUTIE/blob/master/data_loader_json.py I'm training a model with these parameters:
|
@4kssoft |
I use own software for labeling documents (https://www.youtube.com/watch?v=1okRMNxC0ec) |
@4kssoft |
@4kssoft Thanks for sharing. How can i access you tool ? |
This is a beta version for now. I plan to publish this software, but not as open source |
Could you please provide the ckpt file |
@4kssoft if possible please provide the pretrained model that you are using! And guys for the annotation with bounding boxes please look into this link, might be useful : |
@4kssoft Hi I have my own data and extracted text using OCR tesseract and got the position of each word, can i know how to get in the format you showed an example in your repository for sample pdf file Faktura1.pdf_0.json how to get in this format and i need in the format you done can you let me know ??? |
@4kssoft Thanks for your suggestions, I have generated my own training datasets and i am able to train the model, but I am not getting what should be the input format to predict the result. If you know what modification it requires to get the result please just inform us. |
Hello all
@sathvikask0
@Hrishkesh
@Neelesh1121 |
everytime i try to use main_evaluate_json.py I get this error @4kssoft @samhita-alla @vsymbol 2 root error(s) found. |
@4kssoft Do you generate your own dictionnary? I don't really understand the part "Generate your own dictionary with main_build_dict.py / main_data_tokenizer.py". Can you explain how to apply this process on own dataset? Thanks Also, to what the ckpt_path argument refers to? |
Hello @vsymbol can you please give brief about how to generate the texts and corresponding bounding boxes & manually labelling each text and their bounding box Which tools we have use for manually labelling |
@4kssoft Thanks for the labeling video. Does your software export in the format required by CUTIE (json template you provided) or you have to run explicit post processing ? |
Apply any OCR tool that help you detecting and recognizing words in the scanned document image. |
@Hrishkesh , @sathvikask0, @Karthik1904 |
Did someone able to train and test the model? I couldn't find how to predict on new data . |
Hi, if you have ground truth data in a different format, you can simply
read that data and fill in these field_names,
otherwise, you have to fill it manually. The script I wrote only does
Teserract OCR on the image and output in the format that CUTIE needs.
…On Thu, Feb 11, 2021 at 5:30 PM ywsyws ***@***.***> wrote:
@Hrishkesh <https://github.com/Hrishkesh> , @sathvikask0
<https://github.com/sathvikask0>, @Karthik1904
<https://github.com/Karthik1904>
Guys, I have written a simple file to run Tesseract ocr and output a json
file in the format as in invoice_data/ example:
https://github.com/hhien/tesseract_applications.git
@hhien <https://github.com/hhien> Thank you so much for the script. one
question:
It doesn't fill in the values of each field_name. Did you manuelly fill it
up (which I doubt)? or you did another script to do it? Thank you for your
help!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE7ZO655LL2P4OYVT4F42F3S6QAYRANCNFSM4KYRDGHA>
.
|
I'm struggling with it. So far I was able to create the .json-files, with the solution of hhien's code. I anyone succeeded, I'm thankful for any recommendation on how to train the model. |
what are bbox entries? x1,y1,widht,height? or x1,y1(top left),x2,y2(bottom right) |
i have created the json files in the required format. i have 400 invoices data. the main_train_json.py gets killed because it utilises all the RAM. has anyone faced this issue? I have 16 gb of ram. |
Anyone pls share the inference script? |
Hello,
Could you provide your input data for the model to reproduce the results or at least the input data format so that I can try the model on my custom dataset
The text was updated successfully, but these errors were encountered: