Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw pdf files #3

Open
kailigo opened this issue Sep 15, 2019 · 18 comments
Open

raw pdf files #3

kailigo opened this issue Sep 15, 2019 · 18 comments

Comments

@kailigo
Copy link

kailigo commented Sep 15, 2019

Nice work and appreciated for the effort for making the dataset publicly available.

I am working document object detection now and would like to utilize the content of the document to help boost detection performance. So, I need the raw pdf files (from which you generates images ) to extract some content information. Could you please release them as well?

I think one of the major differences of object detection in document images and natural images is documents contain auxiliary text information absent in natural images. Incorporating this auxiliary information should help reach better detection results. It also should benefit for some NLP+CV tasks. Thanks.

@zhxgj
Copy link
Contributor

zhxgj commented Sep 15, 2019

@kailigo it is a good point. I do have the PDFs, but need to confirm with our legal team if I can share them because the original discussions were only around images.

@kailigo
Copy link
Author

kailigo commented Sep 16, 2019

@zhxgj Thanks. Look forward to hear your update.

@kailigo
Copy link
Author

kailigo commented Sep 17, 2019

Hi @zhxgj . It seems that the pdf files are publicly available. I can download it by myself if the legal permission takes a long time. So, could you provide me some help on which pdf files you used in your dataset, for example, providing some index or file names. Thanks very much.

@kailigo
Copy link
Author

kailigo commented Sep 21, 2019

Hi, @zhxgj , I have downloaded the raw pdfs from PMC webiste. How can I find the correspondence between pdf files and the images. It is apparently that you renamed the images when they are generated from pdf files. Could you tell your rules of naming the images. Thanks.

@zhxgj
Copy link
Contributor

zhxgj commented Sep 26, 2019

Hi @kailigo It is great that you managed to download the pdfs. The filenames of the images in PubLayNet is formatted as "_.pdf"

@dwalton76
Copy link

@zhxgj any update on making the pdfs available for download? Thanks!

@zhxgj
Copy link
Contributor

zhxgj commented Oct 20, 2019

@dwalton76 We are working with our legal team to host the data on a different platform which supports aws s3 API. The pdf pages are part of the conversation. Once all the legal processes are approved, I think we should have the current data and the PDF pages available on the new platform.

@ghost
Copy link

ghost commented Jan 6, 2020

@kailigo Upload a link to your downloaded pdf dataset

@theCodinCowboy
Copy link

Did y'all ever release the PDF dataset that corresponds to the images and annotations?

@zhxgj
Copy link
Contributor

zhxgj commented Jun 26, 2020

Hi @theCodinCowboy , I have the PDFs ready to release. I will follow up with our legal team to double check if I can share them

@theCodinCowboy
Copy link

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

@zhxgj
Copy link
Contributor

zhxgj commented Jun 30, 2020

What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC.

We have submitted a ticket for approval. Normally this is reviewed quickly, but I do not think it will be done this week. My best guess is in two weeks.

@theCodinCowboy
Copy link

Hi @zhxgj just following up on this request. Any updates?

@KaiHu-KH
Copy link

Hi, @theCodinCowboy I have downloaded some raw pdfs from PMC, but some papers had been retracted or been updated. I have not found these old version pdfs. Can you find these papers? @zhxgj Could you release all raw pdfs you used in this dataset? It's difficult to download retracted or updated papers and located all updated papers in this dataset. Thanks~

@zhxgj
Copy link
Contributor

zhxgj commented Jul 31, 2020

Hi @theCodinCowboy @YueshangGu our legal team has approved to release the PDFs. I am chasing up our open data team to get the PDFs online. It shouldn't take long.

@KaiHu-KH
Copy link

KaiHu-KH commented Aug 3, 2020

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

@zhxgj
Copy link
Contributor

zhxgj commented Aug 7, 2020

The PDFs of the document pages in PubLayNet are released

@zhxgj
Copy link
Contributor

zhxgj commented Aug 7, 2020

@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~

@ajjimeno would you please be able to check with @kmh4321 on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants