Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BibTeX-from-PDF import #11999

Open
koppor opened this issue Oct 16, 2024 · 4 comments
Open

Improve BibTeX-from-PDF import #11999

koppor opened this issue Oct 16, 2024 · 4 comments
Assignees
Labels
📍 Assigned Assigned by assign-issue-action (or manually assigned) good first issue An issue intended for project-newcomers. Varies in difficulty.

Comments

@koppor
Copy link
Member

koppor commented Oct 16, 2024

!! This is more an issue to experiment with heuristics. How can a machine with "traditional" (non-AI) code create useful information. !!

When importing the PDF se2paper.pdf

one gets following BibTeX entry

@InProceedings{How,
  author   = {On How and We Can and Teach and Exploring New and Ways in and Professional Software and Development for Students},
  title    = {Microsoft Word - ieee_on_how_we_teach_jul_01.docx},
  abstract = {— Requirements and approaches for introductory
courses in software development at universities differ II. SETTING THE STAGE: SOFTWARE DEVELOPMENT AT 
considerably. There seems to be little consensus on which HDM 
languages are a good fit, which methodologies lead to the best 
...
  file     = {:C\:/Users/koppor/Downloads/se2paper-1.pdf:PDF},
}

However, the title should be better:

Image

The properties of the file show

Image


Tasks:

  1. If title is "better" from the text importer, the title from the properties should not be used (class org.jabref.logic.importer.fileformat.PdfMergeMetadataImporter#importDatabase(java.nio.file.Path))
  2. Improve abstract parsing. Maybe stripper.setSortByPosition(true); needs to be removed from org.jabref.logic.importer.fileformat.PdfContentImporter#getFirstPageContents. Maybe, two methods need to be done to be able to parse the title (depending on positon) and parsing the abstract (more on content)

Hint:

  • Rely and add test cases to org.jabref.logic.importer.fileformat.PdfContentImporterTest and ´org.jabref.logic.importer.fileformat.PdfMergeMetadataImporterTest`
@koppor koppor added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Oct 16, 2024
@github-project-automation github-project-automation bot moved this to Free to take in Good First Issues Oct 16, 2024
@leaf-soba
Copy link
Contributor

I want to take this issue.

@koppor
Copy link
Member Author

koppor commented Oct 18, 2024

/assign @leaf-soba

@github-actions github-actions bot added the 📍 Assigned Assigned by assign-issue-action (or manually assigned) label Oct 18, 2024
Copy link
Contributor

👋 Hey @,

Thanks for your interest in this issue! 🎉

Newcomers, we're excited to have you on board. Start by exploring our Contributing guidelines, and don't forget to check out our workspace setup guidelines to get started smoothly.

In case you encounter failing tests during development, please check our developer FAQs!

Having any questions or issues? Feel free to ask here on GitHub. Need help setting up your local workspace? Join the conversation on JabRef's Gitter chat. And don't hesitate to open a (draft) pull request early on to show the direction it is heading towards. This way, you will receive valuable feedback.

⚠ Note that this issue will become unassigned if it isn't closed within ** days**.

🔧 A maintainer can also add the **** label to prevent it from being unassigned automatically.

Happy coding! 🚀

@leaf-soba
Copy link
Contributor

I want to check the next step is #12139, or get a correct author/abstract?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📍 Assigned Assigned by assign-issue-action (or manually assigned) good first issue An issue intended for project-newcomers. Varies in difficulty.
Projects
Status: Assigned
Development

No branches or pull requests

2 participants