raw pdf files #3

kailigo · 2019-09-15T02:54:09Z

Nice work and appreciated for the effort for making the dataset publicly available.

I am working document object detection now and would like to utilize the content of the document to help boost detection performance. So, I need the raw pdf files (from which you generates images ) to extract some content information. Could you please release them as well?

I think one of the major differences of object detection in document images and natural images is documents contain auxiliary text information absent in natural images. Incorporating this auxiliary information should help reach better detection results. It also should benefit for some NLP+CV tasks. Thanks.

zhxgj · 2019-09-15T23:07:47Z

@kailigo it is a good point. I do have the PDFs, but need to confirm with our legal team if I can share them because the original discussions were only around images.

kailigo · 2019-09-16T00:55:31Z

@zhxgj Thanks. Look forward to hear your update.

kailigo · 2019-09-17T22:02:59Z

Hi @zhxgj . It seems that the pdf files are publicly available. I can download it by myself if the legal permission takes a long time. So, could you provide me some help on which pdf files you used in your dataset, for example, providing some index or file names. Thanks very much.

kailigo · 2019-09-21T01:54:21Z

Hi, @zhxgj , I have downloaded the raw pdfs from PMC webiste. How can I find the correspondence between pdf files and the images. It is apparently that you renamed the images when they are generated from pdf files. Could you tell your rules of naming the images. Thanks.

zhxgj · 2019-09-26T01:58:08Z

Hi @kailigo It is great that you managed to download the pdfs. The filenames of the images in PubLayNet is formatted as "_.pdf"

dwalton76 · 2019-10-15T23:18:21Z

@zhxgj any update on making the pdfs available for download? Thanks!

zhxgj · 2019-10-20T22:48:15Z

@dwalton76 We are working with our legal team to host the data on a different platform which supports aws s3 API. The pdf pages are part of the conversation. Once all the legal processes are approved, I think we should have the current data and the PDF pages available on the new platform.

ghost · 2020-01-06T19:08:10Z

@kailigo Upload a link to your downloaded pdf dataset

theCodinCowboy · 2020-06-24T17:33:20Z

Did y'all ever release the PDF dataset that corresponds to the images and annotations?

zhxgj · 2020-06-26T05:13:53Z

Hi @theCodinCowboy , I have the PDFs ready to release. I will follow up with our legal team to double check if I can share them

theCodinCowboy · 2020-06-29T15:47:36Z

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

zhxgj · 2020-06-30T01:42:34Z

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

We have submitted a ticket for approval. Normally this is reviewed quickly, but I do not think it will be done this week. My best guess is in two weeks.

theCodinCowboy · 2020-07-20T14:16:39Z

Hi @zhxgj just following up on this request. Any updates?

KaiHu-KH · 2020-07-30T04:24:14Z

Hi, @theCodinCowboy I have downloaded some raw pdfs from PMC, but some papers had been retracted or been updated. I have not found these old version pdfs. Can you find these papers? @zhxgj Could you release all raw pdfs you used in this dataset? It's difficult to download retracted or updated papers and located all updated papers in this dataset. Thanks~

zhxgj · 2020-07-31T05:47:27Z

Hi @theCodinCowboy @YueshangGu our legal team has approved to release the PDFs. I am chasing up our open data team to get the PDFs online. It shouldn't take long.

KaiHu-KH · 2020-08-03T02:05:25Z

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

zhxgj · 2020-08-07T01:58:56Z

The PDFs of the document pages in PubLayNet are released

zhxgj · 2020-08-07T02:00:00Z

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

@ajjimeno would you please be able to check with @kmh4321 on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw pdf files #3

raw pdf files #3

kailigo commented Sep 15, 2019 •

edited

Loading

zhxgj commented Sep 15, 2019

kailigo commented Sep 16, 2019

kailigo commented Sep 17, 2019

kailigo commented Sep 21, 2019

zhxgj commented Sep 26, 2019

dwalton76 commented Oct 15, 2019

zhxgj commented Oct 20, 2019

ghost commented Jan 6, 2020

theCodinCowboy commented Jun 24, 2020

zhxgj commented Jun 26, 2020

theCodinCowboy commented Jun 29, 2020

zhxgj commented Jun 30, 2020

theCodinCowboy commented Jul 20, 2020

KaiHu-KH commented Jul 30, 2020

zhxgj commented Jul 31, 2020

KaiHu-KH commented Aug 3, 2020

zhxgj commented Aug 7, 2020

zhxgj commented Aug 7, 2020

raw pdf files #3

raw pdf files #3

Comments

kailigo commented Sep 15, 2019 • edited Loading

zhxgj commented Sep 15, 2019

kailigo commented Sep 16, 2019

kailigo commented Sep 17, 2019

kailigo commented Sep 21, 2019

zhxgj commented Sep 26, 2019

dwalton76 commented Oct 15, 2019

zhxgj commented Oct 20, 2019

ghost commented Jan 6, 2020

theCodinCowboy commented Jun 24, 2020

zhxgj commented Jun 26, 2020

theCodinCowboy commented Jun 29, 2020

zhxgj commented Jun 30, 2020

theCodinCowboy commented Jul 20, 2020

KaiHu-KH commented Jul 30, 2020

zhxgj commented Jul 31, 2020

KaiHu-KH commented Aug 3, 2020

zhxgj commented Aug 7, 2020

zhxgj commented Aug 7, 2020

kailigo commented Sep 15, 2019 •

edited

Loading