Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Key generator new field marker #525

Open
funnym0nk3y opened this issue Dec 29, 2021 · 15 comments
Open

Key generator new field marker #525

funnym0nk3y opened this issue Dec 29, 2021 · 15 comments

Comments

@funnym0nk3y
Copy link
Contributor

Is your suggestion for improvement related to a problem? Please describe.
Some authors have double names separated by a whitespace or a hyphen which generates long citation keys. Truncating the keys after N characters leads to more or less cryptic keys.

Describe the solution you'd like
To have more natrual keys I'd like to have a key marker that truncates the authors last name after the first non-ASCII character. I tried achiving this through a regex but did not succeed.

@ThiloteE
Copy link
Member

I have not tried it, so the syntax could be wrong. Feel free to adapt.

  1. Capture the author field from library file: [auth]
  2. Then we use the regex replacement function: :regex("pattern", "replacement")
  3. in combination with a Lookbehind assertion: (?<=y)x (Matches "x" only if "x" is preceded by "y")
  4. and the ASCII [a-z A-Z 0-9]and non-ASCII [^a-z A-Z 0-9] characters:

Something like this?

[auth:regex("(?<=[^a-z A-Z 0-9])[a-z A-Z 0-9]", "")]

Sources:

@funnym0nk3y
Copy link
Contributor Author

funnym0nk3y commented Jan 15, 2022

I found a regex ((?<=\p{IsLatin})\b.*) that works for non-word characters e.g.

Puente León -> Puente
Puente-León -> Puente
Puente.León -> Puente

but still need one to match

PuenteLeón -> Puente

@ThiloteE
Copy link
Member

Would this work?

[auth:regex("(?<=\p{IsLatin})\b.*|[A-Z].*", "")]

Explanation:

| is OR
[A-Z] capital letters

@funnym0nk3y
Copy link
Contributor Author

@ThiloteE Unfortunately not. But changing it a little leads to (?<=\p{IsLatin})(\b|\p{Upper}).* which works on regex101 but not in JabRef

@ThiloteE
Copy link
Member

ThiloteE commented Jan 15, 2022

How about this one:

(?<=\p{IsLatin})\b.*|(?<=\p{IsLatin}{2})\p{Upper}.*

image

Alternatively this one seems nice too:

(?<=.\p{IsLatin})\b.*|(?<=\p{IsLatin}{2})\p{Upper}.*

image

Do they work with Jabref?

@funnym0nk3y
Copy link
Contributor Author

@ThiloteE No, unfortunately not.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Jan 17, 2022

Some (hopefully helpful) thoughts

  • Bracketed patterns/citation key patterns are in need of refactoring so you can, unfortunately, not always rely on the text being what you expect it to be when the modifier is applied. For citation key generation, in particular, I believe they are cleaned of unwanted/disallowed characters before the modifiers are applied. Regarding -, make sure it is not in the unwanted characters list https://docs.jabref.org/setup/citationkeypatterns#removing-unwanted-characters
  • You are likely going to have to use \\ instead of \ in the regexps, as Java treats \ as an escape character
  • I'd probably try matching - directly in the regex, or [auth:regex("\\p{Punct}.*", "")] if that isn't sufficient
  • A VERY verbose option is to use the [authN_M] bracketed pattern if you want to make sure your regexp is only targeting one author's last name. If the Nth author doesn't exist they should evaluate to "" so that [auth5_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")] will apply correctly to the 5th author if it exists (and has fewer than 99 chars in their last name).

I.e., [auth1_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")][auth2_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")][auth3_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")][auth4_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")][auth5_99:regexp("(?<=\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Upper}.*","")]

If I managed to copy the regexp correctly

@ThiloteE
Copy link
Member

ThiloteE commented Jan 17, 2022

I wanted to do some tests first and haven't had time yet, but i read (here: https://www.regular-expressions.info/unicode.html) that "Java 7 adds support for Unicode scripts. Unlike the other flavors, Java 7 requires the “Is” prefix." Jabref is built with/on JavaFX, so i assume it uses Java-based RegEx.

Therefore replacing \p{Upper} with \\p{IsUppercase} or \\p{IsUpper} might be necessary also.

Edit:

\p{Upper} seems only to be for ASCII uppercase letters.
If you want to detect Unicode Uppercase letters, \p{Lu} is the proper command.

@Siedlerchr
Copy link
Member

https://docs.oracle.com/en/java/javase/16/docs/api/java.base/java/util/regex/Pattern.html

@funnym0nk3y
Copy link
Contributor Author

funnym0nk3y commented Jan 17, 2022

@k3KAW8Pnf7mkmdSMPHz27 Escaping the \ did the trick, thanks!

I don't know of this is a common problem, but if so, it could be integrated into the application. Just if you think that is a useful feature to have...

@Siedlerchr
Copy link
Member

Maybe one of you can add a hint regarding the backslash and the Character class thing to the docs https://github.com/JabRef/user-documentation

funnym0nk3y referenced this issue in funnym0nk3y/user-documentation Jan 17, 2022
@funnym0nk3y
Copy link
Contributor Author

The more I think about it, couldn't it be solved with something like replacing ' with "? But then I remember I am a Java noob, so...

@ThiloteE
Copy link
Member

ThiloteE commented Jan 24, 2022

@funnym0nk3y i tried some lookbehind, lookahead and other conditional (IF ELSE) constructs in linked file search, but i could not make it work (even though we know it works on regex 101.

If i may ask, was it working for you in JabRef after all, and if yes, could you post the full solution here?

@funnym0nk3y
Copy link
Contributor Author

@ThiloteE It did work for me with this exact regex: [auth:regex("(?<=.\\p{IsLatin})\\b.*|(?<=\\p{IsLatin}{2})\\p{Lu}.*", "")][shortyear]

@koppor
Copy link
Member

koppor commented Oct 15, 2024

Note that I removed some escaspings JabRef/jabref#11967. Feels much cleaner now (at least for me ^^)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants