Lanuages | Num of Images | Num of Text | Baidu Drive | Google Drive |
---|---|---|---|---|
English/Latin | 728K | ~20M | Link password: 2h8d | Link |
Multilingual | 674K | ~18M | Link password: tddl | Link |
The multilingual version consists of the following 10 languages: Arabic, English, French, Chinese, German, Korean, Japanese, Italian, Bangla, Hindi
Both datasets are very large (~150GB). Therefore, I split them into "several" files (~130). They are organzied as follows:
./
+---sub_0
+---imgs
| 0.jpg
| 1.jpg
| ...
|
+---labels
| 0.json
| 1.json
| ...
|
+---sub_1
+---sub_2
+---sub_3
...
+---sub_100
...
The labels are stored in the following format:
{
"imgfile":str path to the corresponding image file, e.g. "imgs/0.jpg",
"bbox": List[
word_i(8 float):[x0, y0, x1, y1, x2, y2, x3, x4]
(from upper left corner, clockwise),
],
"cbox": List[
char_i(8 float):[x0, y0, x1, y1, x2, y2, x3, x4]
(from upper left corner, clockwise),
],
"text": List[str]
}
Note that there may be a very small proportion of wrong labels. They are caused by the defects in some scene models. These wrong samples are characterized by very small sizes. You can discard these samples by filtering out word boxes that are less than 10 pixels high.
Scene Name | Baidu Drive | Google Drive |
---|---|---|
Realistic Rendering | Link password: wgja | Link |
How-to:
- download and uncompress the project
- in UE4.22, load the following file:
Demo/Demo.uproject
Resources | Baidu Drive | Google Drive |
---|---|---|
background images | Link password: 3x3r | Link |
fonts & corpus | Link password: ip8w | Link |
Scenes | Baidu Drive | Google Drive |
---|---|---|
All 30 scene executables | Link password: br31 | Link |
How-to:
- download and uncompress the project
- cd to
$Name/$Name/Binaries/Linux/
, and double-click the executable./Demo
- alternatively, you can launch it in terminal,
./$Name/$Name/Binaries/Linux/Demo