Skip to content

Commit

Permalink
Bulk metadata corrections 2025-02-08 (#4588)
Browse files Browse the repository at this point in the history
* Process metadata corrections for 2025.genaidetect-1.10 (closes #4579)

* Process metadata corrections for 2025.mcg-1.4 (closes #4578)

* Process metadata corrections for 2023.emnlp-main.212 (closes #4576)

* Process metadata corrections for 2024.findings-acl.220 (closes #4572)

* Process metadata corrections for 2024.findings-eacl.156 (closes #4571)

* Process metadata corrections for 2025.finnlp-1.30 (closes #4570)

* Process metadata corrections for 2022.findings-acl.21 (closes #4567)

* Process metadata corrections for 2025.comedi-1.6 (closes #4564)

* Process metadata corrections for 2025.coling-main.535 (closes #4563)

* Process metadata corrections for 2022.emnlp-main.788 (closes #4562)

* Process metadata corrections for 2024.acl-long.191 (closes #4556)

* Process metadata corrections for 2024.acl-srw.29 (closes #4555)

* Process metadata corrections for 2024.conll-1.17 (closes #4554)

* Process metadata corrections for 2024.emnlp-main.59 (closes #4551)

* Process metadata corrections for 2020.nlposs-1.2 (closes #4550)

* Process metadata corrections for 2022.naacl-main.13 (closes #4548)

* Process metadata corrections for 2025.genaidetect-1.31 (closes #4544)

* Process metadata corrections for 2024.ltedi-1.16 (closes #4543)

* Process metadata corrections for 2024.wnut-1.5 (closes #4542)

* Process metadata corrections for 2024.figlang-1.8 (closes #4521)

* Handle errors in script

- No title or abstract in frontmatter
- Print issue number when JSON fails
  • Loading branch information
mjpost authored Feb 9, 2025
1 parent 4ad7489 commit 6d3e96c
Show file tree
Hide file tree
Showing 18 changed files with 79 additions and 68 deletions.
61 changes: 36 additions & 25 deletions bin/process_bulk_metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,13 +90,10 @@ def _parse_metadata_changes(self, issue_body):
# For some reason, the issue body has \r\n line endings
issue_body = issue_body.replace("\r", "")

try:
if (
match := re.search(r"```json\n(.*?)\n```", issue_body, re.DOTALL)
) is not None:
return json.loads(match[1])
except Exception as e:
print(f"Error parsing metadata changes: {e}", file=sys.stderr)
if (
match := re.search(r"```json\n(.*?)\n```", issue_body, re.DOTALL)
) is not None:
return json.loads(match[1])

return None

Expand All @@ -119,19 +116,20 @@ def _apply_changes_to_xml(self, xml_repo_path, anthology_id, changes):
raise Exception(f"-> Paper not found in XML file: {xml_repo_path}")

# Apply changes to XML
for key in ["title", "abstract"]:
if key in changes:
node = paper_node.find(key)
if node is None:
node = make_simple_element(key, parent=paper_node)
# set the node to the structure of the new string
try:
new_node = ET.fromstring(f"<{key}>{changes[key]}</{key}>")
except ET.XMLSyntaxError as e:
print(f"Error parsing XML for key {key}: {e}", file=sys.stderr)
raise e
# replace the current node with the new node in the tree
paper_node.replace(node, new_node)
if paper_id != "0":
# frontmatter has no title or abstract
for key in ["title", "abstract"]:
if key in changes:
node = paper_node.find(key)
if node is None:
node = make_simple_element(key, parent=paper_node)
# set the node to the structure of the new string
try:
new_node = ET.fromstring(f"<{key}>{changes[key]}</{key}>")
except ET.XMLSyntaxError as e:
raise e
# replace the current node with the new node in the tree
paper_node.replace(node, new_node)

if "authors" in changes:
"""
Expand Down Expand Up @@ -234,7 +232,15 @@ def process_metadata_issues(
)

# Parse metadata changes from issue
json_block = self._parse_metadata_changes(issue.body)
try:
json_block = self._parse_metadata_changes(issue.body)
except json.decoder.JSONDecodeError as e:
print(
f"Failed to parse JSON block in #{issue.number}: {e}",
file=sys.stderr,
)
json_block = None

if not json_block:
if close_old_issues:
# for old issues, filed without a JSON block, we append a comment
Expand Down Expand Up @@ -267,7 +273,10 @@ def process_metadata_issues(
continue
else:
if verbose:
print("-> Skipping (no JSON block)", file=sys.stderr)
print(
f"-> Skipping #{issue.number} (no JSON block)",
file=sys.stderr,
)
continue

self.stats["relevant_issues"] += 1
Expand All @@ -294,8 +303,10 @@ def process_metadata_issues(
xml_repo_path, anthology_id, json_block
)
except Exception as e:
if verbose:
print(e, file=sys.stderr)
print(
f"Failed to apply changes to #{issue.number}: {e}",
file=sys.stderr,
)
continue

if tree:
Expand All @@ -312,7 +323,7 @@ def process_metadata_issues(
# Commit changes
self.local_repo.index.add([xml_repo_path])
self.local_repo.index.commit(
f"Processed metadata corrections (closes #{issue.number})"
f"Process metadata corrections for {anthology_id} (closes #{issue.number})"
)

closed_issues.append(issue)
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2020.nlposs.xml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
<bibkey>madeira-etal-2020-framework</bibkey>
</paper>
<paper id="2">
<title><fixed-case>ARBML</fixed-case>: Democritizing <fixed-case>A</fixed-case>rabic Natural Language Processing Tools</title>
<title><fixed-case>ARBML</fixed-case>: Democratizing <fixed-case>A</fixed-case>rabic Natural Language Processing Tools</title>
<author><first>Zaid</first><last>Alyafeai</last></author>
<author><first>Maged</first><last>Al-Shaibani</last></author>
<pages>8–13</pages>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2022.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -10383,13 +10383,13 @@
<doi>10.18653/v1/2022.emnlp-main.787</doi>
</paper>
<paper id="788">
<title>Attentional Probe: Estimating a Module’s Functional Potential</title>
<title>The Architectural Bottleneck Principle</title>
<author><first>Tiago</first><last>Pimentel</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Josef</first><last>Valvoda</last><affiliation>University of Cambridge</affiliation></author>
<author><first>Niklas</first><last>Stoehr</last><affiliation>ETH Zurich</affiliation></author>
<author><first>Ryan</first><last>Cotterell</last><affiliation>ETH Zürich</affiliation></author>
<pages>11459-11472</pages>
<abstract/>
<abstract>In this paper, we seek to measure how much information a component in a neural network could extract from the representations fed into it. Our work stands in contrast to prior probing work, most of which investigates how much information a model's representations contain. This shift in perspective leads us to propose a new principle for probing, the architectural bottleneck principle: In order to estimate how much information a given component could extract, a probe should look exactly like the component. Relying on this principle, we estimate how much syntactic information is available to transformers through our attentional probe, a probe that exactly resembles a transformer's self-attention head. Experimentally, we find that, in three models (BERT, ALBERT, and RoBERTa), a sentence's syntax tree is mostly extractable by our probe, suggesting these models have access to syntactic information while composing their contextual representations. Whether this information is actually used by these models, however, remains an open question.</abstract>
<url hash="7514164a">2022.emnlp-main.788</url>
<bibkey>pimentel-etal-2022-attentional</bibkey>
<doi>10.18653/v1/2022.emnlp-main.788</doi>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2022.findings.xml
Original file line number Diff line number Diff line change
Expand Up @@ -349,7 +349,7 @@
<author><first>Andrey</first><last>Chertok</last></author>
<author><first>Sergey</first><last>Nikolenko</last></author>
<pages>239-245</pages>
<abstract>We present RuCCoN, a new dataset for clinical concept normalization in Russian manually annotated by medical professionals. It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology. We provide train/test splits for different settings (stratified, zero-shot, and CUI-less) and present strong baselines obtained with state-of-the-art models such as SapBERT. At present, Russian medical NLP is lacking in both datasets and trained models, and we view this work as an important step towards filling this gap. Our dataset and annotation guidelines are available at <url>https://github.com/sberbank-ai-lab/RuCCoN</url>.</abstract>
<abstract>We present RuCCoN, a new dataset for clinical concept normalization in Russian manually annotated by medical professionals. It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology. We provide train/test splits for different settings (stratified, zero-shot, and CUI-less) and present strong baselines obtained with state-of-the-art models such as SapBERT. At present, Russian medical NLP is lacking in both datasets and trained models, and we view this work as an important step towards filling this gap. Our dataset and annotation guidelines are available at <url>https://github.com/AIRI-Institute/RuCCoN</url>.</abstract>
<url hash="8f620f3a">2022.findings-acl.21</url>
<bibkey>nesterov-etal-2022-ruccon</bibkey>
<doi>10.18653/v1/2022.findings-acl.21</doi>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2022.naacl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@
</paper>
<paper id="13">
<title>Two Contrasting Data Annotation Paradigms for Subjective <fixed-case>NLP</fixed-case> Tasks</title>
<author><first>Paul</first><last>Rottger</last></author>
<author><first>Paul</first><last>Röttger</last></author>
<author><first>Bertie</first><last>Vidgen</last></author>
<author><first>Dirk</first><last>Hovy</last></author>
<author><first>Janet</first><last>Pierrehumbert</last></author>
Expand Down
2 changes: 1 addition & 1 deletion data/xml/2023.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2986,7 +2986,7 @@
<author><first>Soda Marem</first><last>Lo</last></author>
<author><first>Valerio</first><last>Basile</last></author>
<author><first>Simona</first><last>Frenda</last></author>
<author><first>Alessandra</first><last>Cignarella</last></author>
<author><first>Alessandra Teresa</first><last>Cignarella</last></author>
<author><first>Viviana</first><last>Patti</last></author>
<author><first>Cristina</first><last>Bosco</last></author>
<pages>3496-3507</pages>
Expand Down
10 changes: 5 additions & 5 deletions data/xml/2024.acl.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2663,11 +2663,11 @@
</paper>
<paper id="191">
<title><fixed-case>L</fixed-case>lama2<fixed-case>V</fixed-case>ec: Unsupervised Adaptation of Large Language Models for Dense Retrieval</title>
<author><first>Chaofan</first><last>Li</last></author>
<author><first>Zheng</first><last>Liu</last></author>
<author><first>Chaofan</first><last>Li</last></author>
<author><first>Shitao</first><last>Xiao</last></author>
<author><first>Yingxia</first><last>Shao</last><affiliation>Beijing University of Posts and Telecommunications</affiliation></author>
<author><first>Defu</first><last>Lian</last><affiliation>University of Science and Technology of China</affiliation></author>
<author><first>Yingxia</first><last>Shao</last></author>
<author><first>Defu</first><last>Lian</last></author>
<pages>3490-3500</pages>
<abstract>Dense retrieval calls for discriminative embeddings to represent the semantic relationship between query and document. It may benefit from the using of large language models (LLMs), given LLMs’ strong capability on semantic understanding. However, the LLMs are learned by auto-regression, whose working mechanism is completely different from representing whole text as one discriminative embedding. Thus, it is imperative to study how to adapt LLMs properly so that they can be effectively initialized as the backbone encoder for dense retrieval. In this paper, we propose a novel approach, called <b>Llama2Vec</b>, which performs unsupervised adaptation of LLM for its dense retrieval application. Llama2Vec consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the LLM is prompted to <i>reconstruct the input sentence</i> and <i>predict the next sentence</i> based on its text embeddings. Llama2Vec is simple, lightweight, but highly effective. It is used to adapt LLaMA-2-7B on the Wikipedia corpus. With a moderate steps of adaptation, it substantially improves the model’s fine-tuned performances on a variety of dense retrieval benchmarks. Notably, it results in the new state-of-the-art performances on popular benchmarks, such as passage and document retrieval on MSMARCO, and zero-shot retrieval on BEIR. The model and source code will be made publicly available to facilitate the future research. Our model is available at https://github.com/FlagOpen/FlagEmbedding.</abstract>
<url hash="e0092648">2024.acl-long.191</url>
Expand Down Expand Up @@ -13986,8 +13986,8 @@
<paper id="29">
<title>Compromesso! <fixed-case>I</fixed-case>talian Many-Shot Jailbreaks undermine the safety of Large Language Models</title>
<author><first>Fabio</first><last>Pernisi</last></author>
<author><first>Dirk</first><last>Hovy</last><affiliation>Bocconi University</affiliation></author>
<author><first>Paul</first><last>R�ttger</last><affiliation>Bocconi University</affiliation></author>
<author><first>Dirk</first><last>Hovy</last></author>
<author><first>Paul</first><last>Röttger</last></author>
<pages>245-251</pages>
<abstract>As diverse linguistic communities and users adopt Large Language Models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to align these models with safe and ethical guidelines, they can still be induced into unsafe behavior with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. What research has been conducted on these vulnerabilities was predominantly on English, limiting the understanding of LLM behavior in other languages. We address this gap by investigating Many-Shot Jailbreaking (MSJ) in Italian, underscoring the importance of understanding LLM behavior in different languages. We base our analysis on a newly created Italian dataset to identify unique safety vulnerabilities in 4 families of open-source LLMs.We find that the models exhibit unsafe behaviors even with minimal exposure to harmful prompts, and–more alarmingly–this tendency rapidly escalates with more demonstrations.</abstract>
<url hash="5b5c8ec0">2024.acl-srw.29</url>
Expand Down
4 changes: 2 additions & 2 deletions data/xml/2024.conll.xml
Original file line number Diff line number Diff line change
Expand Up @@ -203,10 +203,10 @@
</paper>
<paper id="17">
<title>The Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading</title>
<author><first>Keren Gruteke</first><last>Klein</last><affiliation>Technion - Israel Institute of Technology, Technion</affiliation></author>
<author><first>Keren</first><last>Gruteke Klein</last></author>
<author><first>Yoav</first><last>Meiri</last></author>
<author><first>Omer</first><last>Shubi</last></author>
<author><first>Yevgeni</first><last>Berzak</last><affiliation>Technion - Israel Institute of Technology, Technion</affiliation></author>
<author><first>Yevgeni</first><last>Berzak</last></author>
<pages>219-230</pages>
<abstract>The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.</abstract>
<url hash="edbdb721">2024.conll-1.17</url>
Expand Down
6 changes: 3 additions & 3 deletions data/xml/2024.emnlp.xml
Original file line number Diff line number Diff line change
Expand Up @@ -826,11 +826,11 @@
</paper>
<paper id="59">
<title><fixed-case>HEART</fixed-case>-felt Narratives: Tracing Empathy and Narrative Style in Personal Stories with <fixed-case>LLM</fixed-case>s</title>
<author><first>Jocelyn J</first><last>Shen</last><affiliation>Massachusetts Institute of Technology</affiliation></author>
<author><first>Jocelyn</first><last>Shen</last></author>
<author><first>Joel</first><last>Mire</last></author>
<author><first>Hae Won</first><last>Park</last><affiliation>Amazon and Massachusetts Institute of Technology</affiliation></author>
<author><first>Hae Won</first><last>Park</last></author>
<author><first>Cynthia</first><last>Breazeal</last></author>
<author><first>Maarten</first><last>Sap</last><affiliation>Carnegie Mellon University</affiliation></author>
<author><first>Maarten</first><last>Sap</last></author>
<pages>1026-1046</pages>
<abstract>Empathy serves as a cornerstone in enabling prosocial behaviors, and can be evoked through sharing of personal experiences in stories. While empathy is influenced by narrative content, intuitively, people respond to the way a story is told as well, through narrative style. Yet the relationship between empathy and narrative style is not fully understood. In this work, we empirically examine and quantify this relationship between style and empathy using LLMs and large-scale crowdsourcing studies. We introduce a novel, theory-based taxonomy, HEART (Human Empathy and Narrative Taxonomy) that delineates elements of narrative style that can lead to empathy with the narrator of a story. We establish the performance of LLMs in extracting narrative elements from HEART, showing that prompting with our taxonomy leads to reasonable, human-level annotations beyond what prior lexicon-based methods can do. To show empirical use of our taxonomy, we collect a dataset of empathy judgments of stories via a large-scale crowdsourcing study with <tex-math>N=2,624</tex-math> participants. We show that narrative elements extracted via LLMs, in particular, vividness of emotions and plot volume, can elucidate the pathways by which narrative style cultivates empathy towards personal stories. Our work suggests that such models can be used for narrative analyses that lead to human-centered social and behavioral insights.</abstract>
<url hash="204518d9">2024.emnlp-main.59</url>
Expand Down
Loading

0 comments on commit 6d3e96c

Please sign in to comment.