-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raw pdf files #3
Comments
@kailigo it is a good point. I do have the PDFs, but need to confirm with our legal team if I can share them because the original discussions were only around images. |
@zhxgj Thanks. Look forward to hear your update. |
Hi @zhxgj . It seems that the pdf files are publicly available. I can download it by myself if the legal permission takes a long time. So, could you provide me some help on which pdf files you used in your dataset, for example, providing some index or file names. Thanks very much. |
Hi, @zhxgj , I have downloaded the raw pdfs from PMC webiste. How can I find the correspondence between pdf files and the images. It is apparently that you renamed the images when they are generated from pdf files. Could you tell your rules of naming the images. Thanks. |
Hi @kailigo It is great that you managed to download the pdfs. The filenames of the images in PubLayNet is formatted as "_.pdf" |
@zhxgj any update on making the pdfs available for download? Thanks! |
@dwalton76 We are working with our legal team to host the data on a different platform which supports aws s3 API. The pdf pages are part of the conversation. Once all the legal processes are approved, I think we should have the current data and the PDF pages available on the new platform. |
@kailigo Upload a link to your downloaded pdf dataset |
Did y'all ever release the PDF dataset that corresponds to the images and annotations? |
Hi @theCodinCowboy , I have the PDFs ready to release. I will follow up with our legal team to double check if I can share them |
What's the ETA on when legal will have a judgement @zhxgj? I was planning to use the PDF's this week if possible. If not I am going to try to get them myself from PMC. |
We have submitted a ticket for approval. Normally this is reviewed quickly, but I do not think it will be done this week. My best guess is in two weeks. |
Hi @zhxgj just following up on this request. Any updates? |
Hi, @theCodinCowboy I have downloaded some raw pdfs from PMC, but some papers had been retracted or been updated. I have not found these old version pdfs. Can you find these papers? @zhxgj Could you release all raw pdfs you used in this dataset? It's difficult to download retracted or updated papers and located all updated papers in this dataset. Thanks~ |
Hi @theCodinCowboy @YueshangGu our legal team has approved to release the PDFs. I am chasing up our open data team to get the PDFs online. It shouldn't take long. |
@zhxgj Thank you for your help. In addition to raw pdfs, could you release all raw xml files you used for these dataset? Thanks~ |
The PDFs of the document pages in PubLayNet are released |
Nice work and appreciated for the effort for making the dataset publicly available.
I am working document object detection now and would like to utilize the content of the document to help boost detection performance. So, I need the raw pdf files (from which you generates images ) to extract some content information. Could you please release them as well?
I think one of the major differences of object detection in document images and natural images is documents contain auxiliary text information absent in natural images. Incorporating this auxiliary information should help reach better detection results. It also should benefit for some NLP+CV tasks. Thanks.
The text was updated successfully, but these errors were encountered: