From d24331feb1cef383efc075c15cc48b2a0dbf8593 Mon Sep 17 00:00:00 2001 From: jS5t3r Date: Fri, 15 Nov 2024 02:36:40 +0000 Subject: [PATCH] =?UTF-8?q?Deploying=20to=20gh-pages=20from=20@=20lorenz-p?= =?UTF-8?q?eter/lorenz-peter.github.io@e536d8047b1926909fc768f18d563f97bc4?= =?UTF-8?q?d747f=20=F0=9F=9A=80?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- 404.html | 2 +- _pages/dropdown/index.html | 2 +- assets/json/model_stealing_papers.json | 2 +- assets/jupyter/blog.ipynb.html | 2 +- assets/jupyter/load_json.ipynb.html | 2 +- blog/2023/gandalf/index.html | 2 +- blog/2023/index.html | 2 +- blog/2024/findconference/index.html | 1 + blog/2024/findvenue/index.html | 1 - blog/2024/index.html | 2 +- blog/2024/load-json/index.html | 2 +- blog/2024/phdthesis/index.html | 2 +- blog/2024/rw-ms/index.html | 2 +- blog/category/llm-security/index.html | 2 +- blog/category/research/index.html | 2 +- blog/category/thesis/index.html | 2 +- blog/category/writing/index.html | 2 +- blog/index.html | 2 +- blog/page/2/index.html | 2 +- blog/page/3/index.html | 2 +- blog/tag/adversarial/index.html | 2 +- blog/tag/conference/index.html | 2 +- blog/tag/cvpr/index.html | 2 +- blog/tag/examples/index.html | 2 +- blog/tag/icml/index.html | 2 +- blog/tag/research/index.html | 2 +- blog/tag/security/index.html | 2 +- cv/index.html | 2 +- feed.xml | 2 +- index.html | 2 +- news/announcement_100/index.html | 2 +- news/announcement_101/index.html | 2 +- news/announcement_102/index.html | 2 +- news/announcement_103/index.html | 2 +- news/index.html | 2 +- people/index.html | 2 +- projects/1_project/index.html | 2 +- projects/2_project/index.html | 2 +- projects/3_project/index.html | 2 +- projects/4_project/index.html | 2 +- projects/5_project/index.html | 2 +- projects/6_project/index.html | 2 +- projects/index.html | 2 +- publications/index.html | 2 +- repositories/index.html | 2 +- sitemap.xml | 2 +- teaching/index.html | 2 +- 47 files changed, 46 insertions(+), 46 deletions(-) create mode 100644 blog/2024/findconference/index.html delete mode 100644 blog/2024/findvenue/index.html diff --git a/404.html b/404.html index 56546e72..6c052a8c 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - Page not found | Peter Lorenz

Page not found

Looks like there has been a mistake. Nothing exists here.

You will be redirected to the main page within 3 seconds. If not redirected, please go back to the home page.

\ No newline at end of file + Page not found | Peter Lorenz

Page not found

Looks like there has been a mistake. Nothing exists here.

You will be redirected to the main page within 3 seconds. If not redirected, please go back to the home page.

\ No newline at end of file diff --git a/_pages/dropdown/index.html b/_pages/dropdown/index.html index b31cb674..4d89aefa 100644 --- a/_pages/dropdown/index.html +++ b/_pages/dropdown/index.html @@ -1 +1 @@ - submenus | Peter Lorenz

submenus

\ No newline at end of file + submenus | Peter Lorenz

submenus

\ No newline at end of file diff --git a/assets/json/model_stealing_papers.json b/assets/json/model_stealing_papers.json index 34ac6c14..f842ba96 100644 --- a/assets/json/model_stealing_papers.json +++ b/assets/json/model_stealing_papers.json @@ -1 +1 @@ -[{"date":"2024-11","title":"AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data","author":"Tianyi Zhang, Miu Kojima, and Simon D'Alfonso","link":"http://arxiv.org/abs/2411.04691v1","abstract":"Smartphones, equipped with an array of sensors, have become valuable tools\nfor personal sensing. Particularly in digital health, smartphones facilitate\nthe tracking of health-related behaviors and contexts, contributing\nsignificantly to digital phenotyping, a process where data from digital\ninteractions is analyzed to infer behaviors and assess mental health.\nTraditional methods process raw sensor data into information features for\nstatistical and machine learning analyses. In this paper, we introduce a novel\napproach that systematically converts smartphone-collected data into\nstructured, chronological narratives. The AWARE Narrator translates\nquantitative smartphone sensing data into English language descriptions,\nforming comprehensive narratives of an individual's activities. We apply the\nframework to the data collected from university students over a week,\ndemonstrating the potential of utilizing the narratives to summarize individual\nbehavior, and analyzing psychological states by leveraging large language\nmodels."},{"date":"2024-11","title":"Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction","author":"Muhammad Tayyab Khan, Lequn Chen, Ye Han Ng, Wenhe Feng, Nicholas Yew Jin Tan, and Seung Ki Moon","link":"http://arxiv.org/abs/2411.03707v1","abstract":"Geometric Dimensioning and Tolerancing (GD&T) plays a critical role in\nmanufacturing by defining acceptable variations in part features to ensure\ncomponent quality and functionality. However, extracting GD&T information from\n2D engineering drawings is a time-consuming and labor-intensive task, often\nrelying on manual efforts or semi-automated tools. To address these challenges,\nthis study proposes an automated and computationally efficient GD&T extraction\nmethod by fine-tuning Florence-2, an open-source vision-language model (VLM).\nThe model is trained on a dataset of 400 drawings with ground truth annotations\nprovided by domain experts. For comparison, two state-of-the-art closed-source\nVLMs, GPT-4o and Claude-3.5-Sonnet, are evaluated on the same dataset. All\nmodels are assessed using precision, recall, F1-score, and hallucination\nmetrics. Due to the computational cost and impracticality of fine-tuning large\nclosed-source VLMs for domain-specific tasks, GPT-4o and Claude-3.5-Sonnet are\nevaluated in a zero-shot setting. In contrast, Florence-2, a smaller model with\n0.23 billion parameters, is optimized through full-parameter fine-tuning across\nthree distinct experiments, each utilizing datasets augmented to different\nlevels. The results show that Florence-2 achieves a 29.95% increase in\nprecision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a\n43.15% reduction in hallucination rate compared to the best-performing\nclosed-source model. These findings highlight the effectiveness of fine-tuning\nsmaller, open-source VLMs like Florence-2, offering a practical and efficient\nsolution for automated GD&T extraction to support downstream manufacturing\ntasks."},{"date":"2024-11","title":"Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT","author":"Pourya Jafarzadeh, Amir Mohammad Rostami, and Padideh Choobdar","link":"http://arxiv.org/abs/2411.02964v2","abstract":"Speech is the most natural way of expressing ourselves as humans. Identifying\nemotion from speech is a nontrivial task due to the ambiguous definition of\nemotion itself. Speaker Emotion Recognition (SER) is essential for\nunderstanding human emotional behavior. The SER task is challenging due to the\nvariety of speakers, background noise, complexity of emotions, and speaking\nstyles. It has many applications in education, healthcare, customer service,\nand Human-Computer Interaction (HCI). Previously, conventional machine learning\nmethods such as SVM, HMM, and KNN have been used for the SER task. In recent\nyears, deep learning methods have become popular, with convolutional neural\nnetworks and recurrent neural networks being used for SER tasks. The input of\nthese methods is mostly spectrograms and hand-crafted features. In this work,\nwe study the use of self-supervised transformer-based models, Wav2Vec2 and\nHuBERT, to determine the emotion of speakers from their voice. The models\nautomatically extract features from raw audio signals, which are then used for\nthe classification task. The proposed solution is evaluated on reputable\ndatasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show\nthe effectiveness of the proposed method on different datasets. Moreover, the\nmodel has been used for real-world applications like call center conversations,\nand the results demonstrate that the model accurately predicts emotions."},{"date":"2024-11","title":"DM4Steal: Diffusion Model For Link Stealing Attack On Graph Neural Networks","author":"Jinyin Chen, Haonan Ma, and Haibin Zheng","link":"http://arxiv.org/abs/2411.03364v1","abstract":"Graph has become increasingly integral to the advancement of recommendation\nsystems, particularly with the fast development of graph neural network(GNN).\nBy exploring the virtue of rich node features and link information, GNN is\ndesigned to provide personalized and accurate suggestions. Meanwhile, the\nprivacy leakage of GNN in such contexts has also captured special attention.\nPrior work has revealed that a malicious user can utilize auxiliary knowledge\nto extract sensitive link data of the target graph, integral to recommendation\nsystems, via the decision made by the target GNN model. This poses a\nsignificant risk to the integrity and confidentiality of data used in\nrecommendation system. Though important, previous works on GNN's privacy\nleakage are still challenged in three aspects, i.e., limited stealing attack\nscenarios, sub-optimal attack performance, and adaptation against defense. To\naddress these issues, we propose a diffusion model based link stealing attack,\nnamed DM4Steal. It differs previous work from three critical aspects. (i)\nGenerality: aiming at six attack scenarios with limited auxiliary knowledge, we\npropose a novel training strategy for diffusion models so that DM4Steal is\ntransferable to diverse attack scenarios. (ii) Effectiveness: benefiting from\nthe retention of semantic structure in the diffusion model during the training\nprocess, DM4Steal is capable to learn the precise topology of the target graph\nthrough the GNN decision process. (iii) Adaptation: when GNN is defensive\n(e.g., DP, Dropout), DM4Steal relies on the stability that comes from sampling\nthe score model multiple times to keep performance degradation to a minimum,\nthus DM4Steal implements successful adaptive attack on defensive GNN."},{"date":"2024-11","title":"HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction","author":"Rujiao Long, Pengfei Wang, Zhibo Yang, and Cong Yao","link":"http://arxiv.org/abs/2411.01139v1","abstract":"End-to-end visual information extraction (VIE) aims at integrating the\nhierarchical subtasks of VIE, including text spotting, word grouping, and\nentity labeling, into a unified framework. Dealing with the gaps among the\nthree subtasks plays a pivotal role in designing an effective VIE model.\nOCR-dependent methods heavily rely on offline OCR engines and inevitably suffer\nfrom OCR errors, while OCR-free methods, particularly those employing a\nblack-box model, might produce outputs that lack interpretability or contain\nhallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP,\nwhich models entities as HIerarchical Points to better conform to the\nhierarchical nature of the end-to-end VIE task. Specifically, such hierarchical\npoints can be flexibly encoded and subsequently decoded into desired text\ntranscripts, centers of various regions, and categories of entities.\nFurthermore, we devise corresponding hierarchical pre-training strategies,\ncategorized as image reconstruction, layout learning, and language enhancement,\nto reinforce the cross-modality representation of the hierarchical encoders.\nQuantitative experiments on public benchmarks demonstrate that HIP outperforms\nprevious state-of-the-art methods, while qualitative results show its excellent\ninterpretability."},{"date":"2024-10","title":"Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document","author":"Vicky Dong, Hao Yu, and Yao Chen","link":"http://arxiv.org/abs/2410.23452v1","abstract":"This study introduces a novel approach to sentence-level relation extraction\n(RE) that integrates Graph Neural Networks (GNNs) with Large Language Models\n(LLMs) to generate contextually enriched support documents. By harnessing the\npower of LLMs to generate auxiliary information, our approach crafts an\nintricate graph representation of textual data. This graph is subsequently\nprocessed through a Graph Neural Network (GNN) to refine and enrich the\nembeddings associated with each entity ensuring a more nuanced and\ninterconnected understanding of the data. This methodology addresses the\nlimitations of traditional sentence-level RE models by incorporating broader\ncontexts and leveraging inter-entity interactions, thereby improving the\nmodel's ability to capture complex relationships across sentences. Our\nexperiments, conducted on the CrossRE dataset, demonstrate the effectiveness of\nour approach, with notable improvements in performance across various domains.\nThe results underscore the potential of combining GNNs with LLM-generated\ncontext to advance the field of relation extraction."},{"date":"2024-10","title":"Image2Struct: Benchmarking Structure Extraction for Vision-Language Models","author":"Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang","link":"http://arxiv.org/abs/2410.22456v1","abstract":"We introduce Image2Struct, a benchmark to evaluate vision-language models\n(VLMs) on extracting structure from images. Our benchmark 1) captures\nreal-world use cases, 2) is fully automatic and does not require human\njudgment, and 3) is based on a renewable stream of fresh data. In Image2Struct,\nVLMs are prompted to generate the underlying structure (e.g., LaTeX code or\nHTML) from an input image (e.g., webpage screenshot). The structure is then\nrendered to produce an output image (e.g., rendered webpage), which is compared\nagainst the input image to produce a similarity score. This round-trip\nevaluation allows us to quantitatively evaluate VLMs on tasks with multiple\nvalid structures. We create a pipeline that downloads fresh data from active\nonline communities upon execution and evaluates the VLMs without human\nintervention. We introduce three domains (Webpages, LaTeX, and Musical Scores)\nand use five image metrics (pixel similarity, cosine similarity between the\nInception vectors, learned perceptual image patch similarity, structural\nsimilarity index measure, and earth mover similarity) that allow efficient and\nautomatic comparison between pairs of images. We evaluate Image2Struct on 14\nprominent VLMs and find that scores vary widely, indicating that Image2Struct\ncan differentiate between the performances of different VLMs. Additionally, the\nbest score varies considerably across domains (e.g., 0.402 on sheet music vs.\n0.830 on LaTeX equations), indicating that Image2Struct contains tasks of\nvarying difficulty. For transparency, we release the full results at\nhttps://crfm.stanford.edu/helm/image2struct/v1.0.1/."},{"date":"2024-10","title":"Integrating Deep Feature Extraction and Hybrid ResNet-DenseNet Model for Multi-Class Abnormality Detection in Endoscopic Images","author":"Aman Sagar, Preeti Mehta, Monika Shrivastva, and Suchi Kumari","link":"http://arxiv.org/abs/2410.18457v1","abstract":"This paper presents a deep learning framework for the multi-class\nclassification of gastrointestinal abnormalities in Video Capsule Endoscopy\n(VCE) frames. The aim is to automate the identification of ten GI abnormality\nclasses, including angioectasia, bleeding, and ulcers, thereby reducing the\ndiagnostic burden on gastroenterologists. Utilizing an ensemble of DenseNet and\nResNet architectures, the proposed model achieves an overall accuracy of 94\\%\nacross a well-structured dataset. Precision scores range from 0.56 for erythema\nto 1.00 for worms, with recall rates peaking at 98% for normal findings. This\nstudy emphasizes the importance of robust data preprocessing techniques,\nincluding normalization and augmentation, in enhancing model performance. The\ncontributions of this work lie in developing an effective AI-driven tool that\nstreamlines the diagnostic process in gastroenterology, ultimately improving\npatient care and clinical outcomes."},{"date":"2024-10","title":"Extracting Spatiotemporal Data from Gradients with Large Language Models","author":"Lele Zheng, Yang Cao, Renhe Jiang, Kenjiro Taura, Yulong Shen, Sheng Li, and Masatoshi Yoshikawa","link":"http://arxiv.org/abs/2410.16121v1","abstract":"Recent works show that sensitive user data can be reconstructed from gradient\nupdates, breaking the key privacy promise of federated learning. While success\nwas demonstrated primarily on image data, these methods do not directly\ntransfer to other domains, such as spatiotemporal data. To understand privacy\nrisks in spatiotemporal federated learning, we first propose Spatiotemporal\nGradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to\nspatiotemporal data that successfully reconstructs the original location from\ngradients. Furthermore, the absence of priors in attacks on spatiotemporal data\nhas hindered the accurate reconstruction of real client data. To address this\nlimitation, we propose ST-GIA+, which utilizes an auxiliary language model to\nguide the search for potential locations, thereby successfully reconstructing\nthe original data from gradients. In addition, we design an adaptive defense\nstrategy to mitigate gradient inversion attacks in spatiotemporal federated\nlearning. By dynamically adjusting the perturbation levels, we can offer\ntailored protection for varying rounds of training data, thereby achieving a\nbetter trade-off between privacy and utility than current state-of-the-art\nmethods. Through intensive experimental analysis on three real-world datasets,\nwe reveal that the proposed defense strategy can well preserve the utility of\nspatiotemporal federated learning with effective security protection."},{"date":"2024-10","title":"Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation","author":"Pei Liu, Nanfang Zheng, Yiqun Li, Junlan Chen, and Ziyuan Pu","link":"http://arxiv.org/abs/2410.15814v1","abstract":"With the development of AI-assisted driving, numerous methods have emerged\nfor ego-vehicle 3D perception tasks, but there has been limited research on\nroadside perception. With its ability to provide a global view and a broader\nsensing range, the roadside perspective is worth developing. LiDAR provides\nprecise three-dimensional spatial information, while cameras offer semantic\ninformation. These two modalities are complementary in 3D detection. However,\nadding camera data does not increase accuracy in some studies since the\ninformation extraction and fusion procedure is not sufficiently reliable.\nRecently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements\nfor MLPs, which are better suited for high-dimensional, complex data. Both the\ncamera and the LiDAR provide high-dimensional information, and employing KANs\nshould enhance the extraction of valuable features to produce better fusion\noutcomes. This paper proposes Kaninfradet3D, which optimizes the feature\nextraction and fusion modules. To extract features from complex\nhigh-dimensional data, the model's encoder and fuser modules were improved\nusing KAN Layers. Cross-attention was applied to enhance feature fusion, and\nvisual comparisons verified that camera features were more evenly integrated.\nThis addressed the issue of camera features being abnormally concentrated,\nnegatively impacting fusion. Compared to the benchmark, our approach shows\nimprovements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf\nIntersection Dataset and an improvement of +1.40 mAP in the roadside end of the\nTUMTraf V2X Cooperative Perception Dataset. The results indicate that\nKaninfradet3D can effectively fuse features, demonstrating the potential of\napplying KANs in roadside perception tasks."},{"date":"2024-10","title":"Efficient Model Extraction via Boundary Sampling","author":"Maor Biton Dor, and Yisroel Mirsky","link":"http://arxiv.org/abs/2410.15429v1","abstract":"This paper introduces a novel data-free model extraction attack that\nsignificantly advances the current state-of-the-art in terms of efficiency,\naccuracy, and effectiveness. Traditional black-box methods rely on using the\nvictim's model as an oracle to label a vast number of samples within\nhigh-confidence areas. This approach not only requires an extensive number of\nqueries but also results in a less accurate and less transferable model. In\ncontrast, our method innovates by focusing on sampling low-confidence areas\n(along the decision boundaries) and employing an evolutionary algorithm to\noptimize the sampling process. These novel contributions allow for a dramatic\nreduction in the number of queries needed by the attacker by a factor of 10x to\n600x while simultaneously improving the accuracy of the stolen model. Moreover,\nour approach improves boundary alignment, resulting in better transferability\nof adversarial examples from the stolen model to the victim's model (increasing\nthe attack success rate from 60\\% to 82\\% on average). Finally, we accomplish\nall of this with a strict black-box assumption on the victim, with no knowledge\nof the target's architecture or dataset.\n We demonstrate our attack on three datasets with increasingly larger\nresolutions and compare our performance to four state-of-the-art model\nextraction attacks."},{"date":"2024-10","title":"Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model","author":"Jiahao Wang, and Amer Shalaby","link":"http://arxiv.org/abs/2410.15016v1","abstract":"Users of the transit system flood social networks daily with messages that\ncontain valuable insights crucial for improving service quality. These posts\nhelp transit agencies quickly identify emerging issues. Parsing topics and\nsentiments is key to gaining comprehensive insights to foster service\nexcellence. However, the volume of messages makes manual analysis impractical,\nand standard NLP techniques like Term Frequency-Inverse Document Frequency\n(TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis\nseparates topics and sentiments before integrating them, often missing the\ninteraction between them. This incremental approach complicates classification\nand reduces analytical productivity. To address these challenges, we propose a\nnovel approach to extracting and analyzing transit-related information,\nincluding sentiment and sarcasm detection, identification of unusual system\nproblems, and location data from social media. Our method employs Large\nLanguage Models (LLM), specifically Llama 3, for a streamlined analysis free\nfrom pre-established topic labels. To enhance the model's domain-specific\nknowledge, we utilize Retrieval-Augmented Generation (RAG), integrating\nexternal knowledge sources into the information extraction pipeline. We\nvalidated our method through extensive experiments comparing its performance\nwith traditional NLP approaches on user tweet data from the real world transit\nsystem. Our results demonstrate the potential of LLMs to transform social media\ndata analysis in the public transit domain, providing actionable insights and\nenhancing transit agencies' responsiveness by extracting a broader range of\ninformation."},{"date":"2024-10","title":"Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model","author":"Li Yuan, Yi Cai, and Junsheng Huang","link":"http://arxiv.org/abs/2410.14225v1","abstract":"Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task\nthat aims to extract entities and their relations from text-image pairs in\nsocial media posts. Existing methods for JMERE require large amounts of labeled\ndata. However, gathering and annotating fine-grained multimodal data for JMERE\nposes significant challenges. Initially, we construct diverse and comprehensive\nmultimodal few-shot datasets fitted to the original data distribution. To\naddress the insufficient information in the few-shot setting, we introduce the\n\\textbf{K}nowledge-\\textbf{E}nhanced \\textbf{C}ross-modal \\textbf{P}rompt\n\\textbf{M}odel (KECPM) for JMERE. This method can effectively address the\nproblem of insufficient information in the few-shot setting by guiding a large\nlanguage model to generate supplementary background knowledge. Our proposed\nmethod comprises two stages: (1) a knowledge ingestion stage that dynamically\nformulates prompts based on semantic similarity guide ChatGPT generating\nrelevant knowledge and employs self-reflection to refine the knowledge; (2) a\nknowledge-enhanced language model stage that merges the auxiliary knowledge\nwith the original input and utilizes a transformer-based model to align with\nJMERE's required output format. We extensively evaluate our approach on a\nfew-shot dataset derived from the JMERE dataset, demonstrating its superiority\nover strong baselines in terms of both micro and macro F$_1$ scores.\nAdditionally, we present qualitative analyses and case studies to elucidate the\neffectiveness of our model."},{"date":"2024-10","title":"Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models","author":"Tong Liu, and Hadi Meidani","link":"http://arxiv.org/abs/2410.13051v1","abstract":"Supply chain networks are critical to the operational efficiency of\nindustries, yet their increasing complexity presents significant challenges in\nmapping relationships and identifying the roles of various entities.\nTraditional methods for constructing supply chain networks rely heavily on\nstructured datasets and manual data collection, limiting their scope and\nefficiency. In contrast, recent advancements in Natural Language Processing\n(NLP) and large language models (LLMs) offer new opportunities for discovering\nand analyzing supply chain networks using unstructured text data. This paper\nproposes a novel approach that leverages LLMs to extract and process raw\ntextual information from publicly available sources to construct a\ncomprehensive supply chain graph. We focus on the civil engineering sector as a\ncase study, demonstrating how LLMs can uncover hidden relationships among\ncompanies, projects, and other entities. Additionally, we fine-tune an LLM to\nclassify entities within the supply chain graph, providing detailed insights\ninto their roles and relationships. The results show that domain-specific\nfine-tuning improves classification accuracy, highlighting the potential of\nLLMs for industry-specific supply chain analysis. Our contributions include the\ndevelopment of a supply chain graph for the civil engineering sector, as well\nas a fine-tuned LLM model that enhances entity classification and understanding\nof supply chain networks."},{"date":"2024-10","title":"CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment","author":"Qinfeng Li, Yangfan Xie, Tianyu Du, Zhiqiang Shen, Zhenghan Qin, Hao Peng, Xinkui Zhao, Xianwei Zhu, Jianwei Yin, and Xuhong Zhang","link":"http://arxiv.org/abs/2410.13903v1","abstract":"Proprietary large language models (LLMs) demonstrate exceptional\ngeneralization ability across various tasks. Additionally, deploying LLMs on\nedge devices is trending for efficiency and privacy reasons. However, edge\ndeployment of proprietary LLMs introduces new security threats: attackers who\nobtain an edge-deployed LLM can easily use it as a base model for various tasks\ndue to its high generalization ability, which we call foundational capability\nstealing. Unfortunately, existing model protection mechanisms are often\ntask-specific and fail to protect general-purpose LLMs, as they mainly focus on\nprotecting task-related parameters using trusted execution environments (TEEs).\nAlthough some recent TEE-based methods are able to protect the overall model\nparameters in a computation-efficient way, they still suffer from prohibitive\ncommunication costs between TEE and CPU/GPU, making it impractical to deploy\nfor edge LLMs. To protect the foundational capabilities of edge LLMs, we\npropose CoreGuard, a computation- and communication-efficient model protection\napproach against model stealing on edge devices. The core component of\nCoreGuard is a lightweight and propagative authorization module residing in\nTEE. Extensive experiments show that CoreGuard achieves the same security\nprotection as the black-box security guarantees with negligible overhead."},{"date":"2024-10","title":"Identity-Focused Inference and Extraction Attacks on Diffusion Models","author":"Jayneel Vora, Aditya Krishnan, Nader Bouacida, Prabhu RV Shankar, and Prasant Mohapatra","link":"http://arxiv.org/abs/2410.10177v1","abstract":"The increasing reliance on diffusion models for generating synthetic images\nhas amplified concerns about the unauthorized use of personal data,\nparticularly facial images, in model training. In this paper, we introduce a\nnovel identity inference framework to hold model owners accountable for\nincluding individuals' identities in their training data. Our approach moves\nbeyond traditional membership inference attacks by focusing on identity-level\ninference, providing a new perspective on data privacy violations. Through\ncomprehensive evaluations on two facial image datasets, Labeled Faces in the\nWild (LFW) and CelebA, our experiments demonstrate that the proposed membership\ninference attack surpasses baseline methods, achieving an attack success rate\nof up to 89% and an AUC-ROC of 0.91, while the identity inference attack\nattains 92% on LDM models trained on LFW, and the data extraction attack\nachieves 91.6% accuracy on DDPMs, validating the effectiveness of our approach\nacross diffusion models."},{"date":"2024-10","title":"Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset","author":"Victor Radermecker, Andrea Zanon, Nancy Thomas, Annita Vapsi, Saba Rahimi, Rama Ramakrishnan, and Daniel Borrajo","link":"http://arxiv.org/abs/2410.09135v1","abstract":"Understanding land cover holds considerable potential for a myriad of\npractical applications, particularly as data accessibility transitions from\nbeing exclusive to governmental and commercial entities to now including the\nbroader research community. Nevertheless, although the data is accessible to\nany community member interested in exploration, there exists a formidable\nlearning curve and no standardized process for accessing, pre-processing, and\nleveraging the data for subsequent tasks. In this study, we democratize this\ndata by presenting a flexible and efficient end to end pipeline for working\nwith the Dynamic World dataset, a cutting-edge near-real-time land use/land\ncover (LULC) dataset. This includes a pre-processing and representation\nframework which tackles noise removal, efficient extraction of large amounts of\ndata, and re-representation of LULC data in a format well suited for several\ndownstream tasks. To demonstrate the power of our pipeline, we use it to\nextract data for an urbanization prediction problem and build a suite of\nmachine learning models with excellent performance. This task is easily\ngeneralizable to the prediction of any type of land cover and our pipeline is\nalso compatible with a series of other downstream tasks."},{"date":"2024-10","title":"Contrastive Learning to Fine-Tune Feature Extraction Models for the Visual Cortex","author":"Alex Mulrooney, and Austin J. Brockmeier","link":"http://arxiv.org/abs/2410.06067v1","abstract":"Predicting the neural response to natural images in the visual cortex\nrequires extracting relevant features from the images and relating those\nfeature to the observed responses. In this work, we optimize the feature\nextraction in order to maximize the information shared between the image\nfeatures and the neural response across voxels in a given region of interest\n(ROI) extracted from the BOLD signal measured by fMRI. We adapt contrastive\nlearning (CL) to fine-tune a convolutional neural network, which was pretrained\nfor image classification, such that a mapping of a given image's features are\nmore similar to the corresponding fMRI response than to the responses to other\nimages. We exploit the recently released Natural Scenes Dataset (Allen et al.,\n2022) as organized for the Algonauts Project (Gifford et al., 2023), which\ncontains the high-resolution fMRI responses of eight subjects to tens of\nthousands of naturalistic images. We show that CL fine-tuning creates feature\nextraction models that enable higher encoding accuracy in early visual ROIs as\ncompared to both the pretrained network and a baseline approach that uses a\nregression loss at the output of the network to tune it for fMRI response\nencoding. We investigate inter-subject transfer of the CL fine-tuned models,\nincluding subjects from another, lower-resolution dataset (Gong et al., 2023).\nWe also pool subjects for fine-tuning to further improve the encoding\nperformance. Finally, we examine the performance of the fine-tuned models on\ncommon image classification tasks, explore the landscape of ROI-specific models\nby applying dimensionality reduction on the Bhattacharya dissimilarity matrix\ncreated using the predictions on those tasks (Mao et al., 2024), and\ninvestigate lateralization of the processing for early visual ROIs using\nsalience maps of the classifiers built on the CL-tuned models."},{"date":"2024-10","title":"Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting","author":"Nicholas Carlini, Jorge Ch\u00e1vez-Saab, Anna Hambitzer, Francisco Rodr\u00edguez-Henr\u00edquez, and Adi Shamir","link":"http://arxiv.org/abs/2410.05750v1","abstract":"Deep neural networks (DNNs) are valuable assets, yet their public\naccessibility raises security concerns about parameter extraction by malicious\nactors. Recent work by Carlini et al. (crypto'20) and Canales-Mart\\'inez et al.\n(eurocrypt'24) has drawn parallels between this issue and block cipher key\nextraction via chosen plaintext attacks. Leveraging differential cryptanalysis,\nthey demonstrated that all the weights and biases of black-box ReLU-based DNNs\ncould be inferred using a polynomial number of queries and computational time.\nHowever, their attacks relied on the availability of the exact numeric value of\noutput logits, which allowed the calculation of their derivatives. To overcome\nthis limitation, Chen et al. (asiacrypt'24) tackled the more realistic\nhard-label scenario, where only the final classification label (e.g., \"dog\" or\n\"car\") is accessible to the attacker. They proposed an extraction method\nrequiring a polynomial number of queries but an exponential execution time. In\naddition, their approach was applicable only to a restricted set of\narchitectures, could deal only with binary classifiers, and was demonstrated\nonly on tiny neural networks with up to four neurons split among up to two\nhidden layers. This paper introduces new techniques that, for the first time,\nachieve cryptanalytic extraction of DNN parameters in the most challenging\nhard-label setting, using both a polynomial number of queries and polynomial\ntime. We validate our approach by extracting nearly one million parameters from\na DNN trained on the CIFAR-10 dataset, comprising 832 neurons in four hidden\nlayers. Our results reveal the surprising fact that all the weights of a\nReLU-based DNN can be efficiently determined by analyzing only the geometric\nshape of its decision boundaries."},{"date":"2024-10","title":"Multiscale Latent Diffusion Model for Enhanced Feature Extraction from Medical Images","author":"Rabeya Tus Sadia, Jie Zhang, and Jin Chen","link":"http://arxiv.org/abs/2410.04000v2","abstract":"Various imaging modalities are used in patient diagnosis, each offering\nunique advantages and valuable insights into anatomy and pathology. Computed\nTomography (CT) is crucial in diagnostics, providing high-resolution images for\nprecise internal organ visualization. CT's ability to detect subtle tissue\nvariations is vital for diagnosing diseases like lung cancer, enabling early\ndetection and accurate tumor assessment. However, variations in CT scanner\nmodels and acquisition protocols introduce significant variability in the\nextracted radiomic features, even when imaging the same patient. This\nvariability poses considerable challenges for downstream research and clinical\nanalysis, which depend on consistent and reliable feature extraction. Current\nmethods for medical image feature extraction, often based on supervised\nlearning approaches, including GAN-based models, face limitations in\ngeneralizing across different imaging environments. In response to these\nchallenges, we propose LTDiff++, a multiscale latent diffusion model designed\nto enhance feature extraction in medical imaging. The model addresses\nvariability by standardizing non-uniform distributions in the latent space,\nimproving feature consistency. LTDiff++ utilizes a UNet++ encoder-decoder\narchitecture coupled with a conditional Denoising Diffusion Probabilistic Model\n(DDPM) at the latent bottleneck to achieve robust feature extraction and\nstandardization. Extensive empirical evaluations on both patient and phantom CT\ndatasets demonstrate significant improvements in image standardization, with\nhigher Concordance Correlation Coefficients (CCC) across multiple radiomic\nfeature categories. Through these advancements, LTDiff++ represents a promising\nsolution for overcoming the inherent variability in medical imaging data,\noffering improved reliability and accuracy in feature extraction processes."},{"date":"2024-10","title":"A Novel Feature Extraction Model for the Detection of Plant Disease from Leaf Images in Low Computational Devices","author":"Rikathi Pal, Anik Basu Bhaumik, Arpan Murmu, Sanoar Hossain, Biswajit Maity, and Soumya Sen","link":"http://arxiv.org/abs/2410.01854v1","abstract":"Diseases in plants cause significant danger to productive and secure\nagriculture. Plant diseases can be detected early and accurately, reducing crop\nlosses and pesticide use. Traditional methods of plant disease identification,\non the other hand, are generally time-consuming and require professional\nexpertise. It would be beneficial to the farmers if they could detect the\ndisease quickly by taking images of the leaf directly. This will be a\ntime-saving process and they can take remedial actions immediately. To achieve\nthis a novel feature extraction approach for detecting tomato plant illnesses\nfrom leaf photos using low-cost computing systems such as mobile phones is\nproposed in this study. The proposed approach integrates various types of Deep\nLearning techniques to extract robust and discriminative features from leaf\nimages. After the proposed feature extraction comparisons have been made on\nfive cutting-edge deep learning models: AlexNet, ResNet50, VGG16, VGG19, and\nMobileNet. The dataset contains 10,000 leaf photos from ten classes of tomato\nillnesses and one class of healthy leaves. Experimental findings demonstrate\nthat AlexNet has an accuracy score of 87%, with the benefit of being quick and\nlightweight, making it appropriate for use on embedded systems and other\nlow-processing devices like smartphones."},{"date":"2024-10","title":"Preserving Generalization of Language models in Few-shot Continual Relation Extraction","author":"Quyen Tran, Nguyen Xuan Thanh, Nguyen Hoang Anh, Nam Le Hai, Trung Le, Linh Van Ngo, and Thien Huu Nguyen","link":"http://arxiv.org/abs/2410.00334v1","abstract":"Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic\narea of study where models can sequentially integrate knowledge from new\nrelations with limited labeled data while circumventing catastrophic forgetting\nand preserving prior knowledge from pre-trained backbones. In this work, we\nintroduce a novel method that leverages often-discarded language model heads.\nBy employing these components via a mutual information maximization strategy,\nour approach helps maintain prior knowledge from the pre-trained backbone and\nstrategically aligns the primary classification head, thereby enhancing model\nperformance. Furthermore, we explore the potential of Large Language Models\n(LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges.\nOur comprehensive experimental results underscore the efficacy of the proposed\nmethod and offer valuable insights for future work."},{"date":"2024-09","title":"Towards Robust Extractive Question Answering Models: Rethinking the Training Methodology","author":"Son Quoc Tran, and Matt Kretchmar","link":"http://arxiv.org/abs/2409.19766v1","abstract":"This paper proposes a novel training method to improve the robustness of\nExtractive Question Answering (EQA) models. Previous research has shown that\nexisting models, when trained on EQA datasets that include unanswerable\nquestions, demonstrate a significant lack of robustness against distribution\nshifts and adversarial attacks. Despite this, the inclusion of unanswerable\nquestions in EQA training datasets is essential for ensuring real-world\nreliability. Our proposed training method includes a novel loss function for\nthe EQA problem and challenges an implicit assumption present in numerous EQA\ndatasets. Models trained with our method maintain in-domain performance while\nachieving a notable improvement on out-of-domain datasets. This results in an\noverall F1 score improvement of 5.7 across all testing sets. Furthermore, our\nmodels exhibit significantly enhanced robustness against two types of\nadversarial attacks, with a performance decrease of only about a third compared\nto the default models."},{"date":"2024-09","title":"INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large Language Models and Ensemble Learning","author":"Pablo Romero, Lifeng Han, and Goran Nenadic","link":"http://arxiv.org/abs/2409.19467v1","abstract":"Medication Extraction and Mining play an important role in healthcare NLP\nresearch due to its practical applications in hospital settings, such as their\nmapping into standard clinical knowledge bases (SNOMED-CT, BNF, etc.). In this\nwork, we investigate state-of-the-art LLMs in text mining tasks on medications\nand their related attributes such as dosage, route, strength, and adverse\neffects. In addition, we explore different ensemble learning methods\n(\\textsc{Stack-Ensemble} and \\textsc{Voting-Ensemble}) to augment the model\nperformances from individual LLMs. Our ensemble learning result demonstrated\nbetter performances than individually fine-tuned base models BERT, RoBERTa,\nRoBERTa-L, BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and\nPubMedBERT across general and specific domains. Finally, we build up an entity\nlinking function to map extracted medical terminologies into the SNOMED-CT\ncodes and the British National Formulary (BNF) codes, which are further mapped\nto the Dictionary of Medicines and Devices (dm+d), and ICD. Our model's toolkit\nand desktop applications are publicly available at\n\\url{https://github.com/HECTA-UoM/ensemble-NER}."},{"date":"2024-09","title":"Semi-strong Efficient Market of Bitcoin and Twitter: an Analysis of Semantic Vector Spaces of Extracted Keywords and Light Gradient Boosting Machine Models","author":"Fang Wang, and Marko Gacesa","link":"http://arxiv.org/abs/2409.15988v1","abstract":"This study extends the examination of the Efficient-Market Hypothesis in\nBitcoin market during a five year fluctuation period, from September 1 2017 to\nSeptember 1 2022, by analyzing 28,739,514 qualified tweets containing the\ntargeted topic \"Bitcoin\". Unlike previous studies, we extracted fundamental\nkeywords as an informative proxy for carrying out the study of the EMH in the\nBitcoin market rather than focusing on sentiment analysis, information volume,\nor price data. We tested market efficiency in hourly, 4-hourly, and daily time\nperiods to understand the speed and accuracy of market reactions towards the\ninformation within different thresholds. A sequence of machine learning methods\nand textual analyses were used, including measurements of distances of semantic\nvector spaces of information, keywords extraction and encoding model, and Light\nGradient Boosting Machine (LGBM) classifiers. Our results suggest that 78.06%\n(83.08%), 84.63% (87.77%), and 94.03% (94.60%) of hourly, 4-hourly, and daily\nbullish (bearish) market movements can be attributed to public information\nwithin organic tweets."},{"date":"2024-09","title":"ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction","author":"Iwo Naglik, and Mateusz Lango","link":"http://arxiv.org/abs/2409.15202v2","abstract":"Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of\naspect-based sentiment analysis that consists in extracting (aspect phrase,\nopinion phrase, sentiment polarity) triples from a given sentence. Recent\nstate-of-the-art methods approach this task by first extracting all possible\ntext spans from a given text, then filtering the potential aspect and opinion\nphrases with a classifier, and finally considering all their pairs with another\nclassifier that additionally assigns sentiment polarity to them. Although\nseveral variations of the above scheme have been proposed, the common feature\nis that the final result is constructed by a sequence of independent classifier\ndecisions. This hinders the exploitation of dependencies between extracted\nphrases and prevents the use of knowledge about the interrelationships between\nclassifier predictions to improve performance. In this paper, we propose a new\nASTE approach consisting of three transformer-inspired layers, which enables\nthe modelling of dependencies both between phrases and between the final\nclassifier decisions. Experimental results show that the method achieves higher\nperformance in terms of F1 measure than other methods studied on popular\nbenchmarks. In addition, we show that a simple pre-training technique further\nimproves the performance of the model."},{"date":"2024-09","title":"Efficient and Effective Model Extraction","author":"Hongyu Zhu, Wentao Hu, Sichu Liang, Fangqi Li, Wenwen Wang, and Shilin Wang","link":"http://arxiv.org/abs/2409.14122v2","abstract":"Model extraction aims to create a functionally similar copy from a machine\nlearning as a service (MLaaS) API with minimal overhead, typically for illicit\nprofit or as a precursor to further attacks, posing a significant threat to the\nMLaaS ecosystem. However, recent studies have shown that model extraction is\nhighly inefficient, particularly when the target task distribution is\nunavailable. In such cases, even substantially increasing the attack budget\nfails to produce a sufficiently similar replica, reducing the adversary's\nmotivation to pursue extraction attacks. In this paper, we revisit the\nelementary design choices throughout the extraction lifecycle. We propose an\nembarrassingly simple yet dramatically effective algorithm, Efficient and\nEffective Model Extraction (E3), focusing on both query preparation and\ntraining routine. E3 achieves superior generalization compared to\nstate-of-the-art methods while minimizing computational costs. For instance,\nwith only 0.005 times the query budget and less than 0.2 times the runtime, E3\noutperforms classical generative model based data-free model extraction by an\nabsolute accuracy improvement of over 50% on CIFAR-10. Our findings underscore\nthe persistent threat posed by model extraction and suggest that it could serve\nas a valuable benchmarking algorithm for future security evaluations."},{"date":"2024-09","title":"Hard-Label Cryptanalytic Extraction of Neural Network Models","author":"Yi Chen, Xiaoyang Dong, Jian Guo, Yantian Shen, Anyu Wang, and Xiaoyun Wang","link":"http://arxiv.org/abs/2409.11646v1","abstract":"The machine learning problem of extracting neural network parameters has been\nproposed for nearly three decades. Functionally equivalent extraction is a\ncrucial goal for research on this problem. When the adversary has access to the\nraw output of neural networks, various attacks, including those presented at\nCRYPTO 2020 and EUROCRYPT 2024, have successfully achieved this goal. However,\nthis goal is not achieved when neural networks operate under a hard-label\nsetting where the raw output is inaccessible.\n In this paper, we propose the first attack that theoretically achieves\nfunctionally equivalent extraction under the hard-label setting, which applies\nto ReLU neural networks. The effectiveness of our attack is validated through\npractical experiments on a wide range of ReLU neural networks, including neural\nnetworks trained on two real benchmarking datasets (MNIST, CIFAR10) widely used\nin computer vision. For a neural network consisting of $10^5$ parameters, our\nattack only requires several hours on a single core."},{"date":"2024-09","title":"CaBaGe: Data-Free Model Extraction using ClAss BAlanced Generator Ensemble","author":"Jonathan Rosenthal, Shanchao Liang, Kevin Zhang, and Lin Tan","link":"http://arxiv.org/abs/2409.10643v1","abstract":"Machine Learning as a Service (MLaaS) is often provided as a pay-per-query,\nblack-box system to clients. Such a black-box approach not only hinders open\nreplication, validation, and interpretation of model results, but also makes it\nharder for white-hat researchers to identify vulnerabilities in the MLaaS\nsystems. Model extraction is a promising technique to address these challenges\nby reverse-engineering black-box models. Since training data is typically\nunavailable for MLaaS models, this paper focuses on the realistic version of\nit: data-free model extraction. We propose a data-free model extraction\napproach, CaBaGe, to achieve higher model extraction accuracy with a small\nnumber of queries. Our innovations include (1) a novel experience replay for\nfocusing on difficult training samples; (2) an ensemble of generators for\nsteadily producing diverse synthetic data; and (3) a selective filtering\nprocess for querying the victim model with harder, more balanced samples. In\naddition, we create a more realistic setting, for the first time, where the\nattacker has no knowledge of the number of classes in the victim training data,\nand create a solution to learn the number of classes on the fly. Our evaluation\nshows that CaBaGe outperforms existing techniques on seven datasets -- MNIST,\nFMNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet-subset, and Tiny ImageNet -- with\nan accuracy improvement of the extracted models by up to 43.13%. Furthermore,\nthe number of queries required to extract a clone model matching the final\naccuracy of prior work is reduced by up to 75.7%."},{"date":"2024-09","title":"Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports","author":"Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, and Evan Calabrese","link":"http://arxiv.org/abs/2409.10576v2","abstract":"Purpose: To develop and evaluate an automated system for extracting\nstructured clinical information from unstructured radiology and pathology\nreports using open-weights large language models (LMs) and retrieval augmented\ngeneration (RAG), and to assess the effects of model configuration variables on\nextraction performance. Methods and Materials: The study utilized two datasets:\n7,294 radiology reports annotated for Brain Tumor Reporting and Data System\n(BT-RADS) scores and 2,154 pathology reports annotated for isocitrate\ndehydrogenase (IDH) mutation status. An automated pipeline was developed to\nbenchmark the performance of various LMs and RAG configurations. The impact of\nmodel size, quantization, prompting strategies, output formatting, and\ninference parameters was systematically evaluated. Results: The best performing\nmodels achieved over 98% accuracy in extracting BT-RADS scores from radiology\nreports and over 90% for IDH mutation status extraction from pathology reports.\nThe top model being medical fine-tuned llama3. Larger, newer, and domain\nfine-tuned models consistently outperformed older and smaller models. Model\nquantization had minimal impact on performance. Few-shot prompting\nsignificantly improved accuracy. RAG improved performance for complex pathology\nreports but not for shorter radiology reports. Conclusions: Open LMs\ndemonstrate significant potential for automated extraction of structured\nclinical data from unstructured clinical reports with local privacy-preserving\napplication. Careful model selection, prompt engineering, and semi-automated\noptimization using annotated data are critical for optimal performance. These\napproaches could be reliable enough for practical use in research workflows,\nhighlighting the potential for human-machine collaboration in healthcare data\nextraction."},{"date":"2024-09","title":"TSELM: Target Speaker Extraction using Discrete Tokens and Language Models","author":"Beilong Tang, Bang Zeng, and Ming Li","link":"http://arxiv.org/abs/2409.07841v3","abstract":"We propose TSELM, a novel target speaker extraction network that leverages\ndiscrete tokens and language models. TSELM utilizes multiple discretized layers\nfrom WavLM as input tokens and incorporates cross-attention mechanisms to\nintegrate target speaker information. Language models are employed to capture\nthe sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the\naudio from the tokens. By applying a cross-entropy loss, TSELM models the\nprobability distribution of output tokens, thus converting the complex\nregression problem of audio generation into a classification task. Experimental\nresults show that TSELM achieves excellent results in speech quality and\ncomparable results in speech intelligibility."},{"date":"2024-09","title":"Alignment-Aware Model Extraction Attacks on Large Language Models","author":"Zi Liang, Qingqing Ye, Yanyun Wang, Sen Zhang, Yaxin Xiao, Ronghua Li, Jianliang Xu, and Haibo Hu","link":"http://arxiv.org/abs/2409.02718v1","abstract":"Model extraction attacks (MEAs) on large language models (LLMs) have received\nincreasing research attention lately. Existing attack methods on LLMs inherit\nthe extraction strategies from those designed for deep neural networks (DNNs)\nyet neglect the inconsistency of training tasks between MEA and LLMs'\nalignments. As such, they result in poor attack performances. To tackle this\nissue, we present Locality Reinforced Distillation (LoRD), a novel model\nextraction attack algorithm specifically for LLMs. In particular, we design a\npolicy-gradient-style training task, which utilizes victim models' responses as\na signal to guide the crafting of preference for the local model. Theoretical\nanalysis has shown that i) LoRD's convergence procedure in MEAs is consistent\nwith the alignments of LLMs, and ii) LoRD can reduce query complexity while\nmitigating watermark protection through exploration-based stealing. Extensive\nexperiments on domain-specific extractions demonstrate the superiority of our\nmethod by examining the extraction of various state-of-the-art commercial LLMs."},{"date":"2024-09","title":"AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models","author":"Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, and Zhiming Zheng","link":"http://arxiv.org/abs/2409.01579v1","abstract":"Retrieved documents containing noise will hinder RAG from detecting answer\nclues and make the inference process slow and expensive. Therefore, context\ncompression is necessary to enhance its accuracy and efficiency. Existing\ncontext compression methods use extractive or generative models to retain the\nmost query-relevant sentences or apply the information bottleneck theory to\npreserve sufficient information. However, these methods may face issues such as\nover-compression or high computational costs. We observe that the retriever\noften ranks relevant documents at the top, but the exact number of documents\nneeded to answer the query is uncertain due to the impact of query complexity\nand retrieval quality: complex queries like multi-hop questions may require\nretaining more documents than simpler queries, and a low-quality retrieval may\nneed to rely on more documents to generate accurate outputs. Therefore,\ndetermining the minimum number of required documents (compression rate) is\nstill a challenge for RAG. In this paper, we introduce AdaComp, a low-cost\nextractive context compression method that adaptively determines the\ncompression rate based on both query complexity and retrieval quality.\nSpecifically, we first annotate the minimum top-k documents necessary for the\nRAG system to answer the current query as the compression rate and then\nconstruct triplets of the query, retrieved documents, and its compression rate.\nThen, we use this triplet dataset to train a compression-rate predictor.\nExperiments on three QA datasets and one conversational Muiti-doc QA dataset\nshow that AdaComp significantly reduces inference costs while maintaining\nperformance nearly identical to uncompressed models, achieving a balance\nbetween efficiency and performance."},{"date":"2024-08","title":"Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis","author":"S. Nishio, H. Nonaka, N. Tsuchiya, A. Migita, Y. Banno, T. Hayashi, H. Sakaji, T. Sakumoto, and K. Watabe","link":"http://arxiv.org/abs/2408.12097v1","abstract":"Machine learning is widely utilized across various industries. Identifying\nthe appropriate machine learning models and datasets for specific tasks is\ncrucial for the effective industrial application of machine learning. However,\nthis requires expertise in both machine learning and the relevant domain,\nleading to a high learning cost. Therefore, research focused on extracting\ncombinations of tasks, machine learning models, and datasets from academic\npapers is critically important, as it can facilitate the automatic\nrecommendation of suitable methods. Conventional information extraction methods\nfrom academic papers have been limited to identifying machine learning models\nand other entities as named entities. To address this issue, this study\nproposes a methodology extracting tasks, machine learning methods, and dataset\nnames from scientific papers and analyzing the relationships between these\ninformation by using LLM, embedding model, and network clustering. The proposed\nmethod's expression extraction performance, when using Llama3, achieves an\nF-score exceeding 0.8 across various categories, confirming its practical\nutility. Benchmarking results on financial domain papers have demonstrated the\neffectiveness of this method, providing insights into the use of the latest\ndatasets, including those related to ESG (Environmental, Social, and\nGovernance) data."},{"date":"2024-08","title":"JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet","author":"Yujia Gu, Haofeng Li, Xinyu Fang, Zihan Peng, and Yinan Peng","link":"http://arxiv.org/abs/2408.11744v1","abstract":"This study proposes a novel approach to extract stylistic features of Jiehua:\nthe utilization of the Fine-tuned Stable Diffusion Model with ControlNet\n(FSDMC) to refine depiction techniques from artists' Jiehua. The training data\nfor FSDMC is based on the opensource Jiehua artist's work collected from the\nInternet, which were subsequently manually constructed in the format of\n(Original Image, Canny Edge Features, Text Prompt). By employing the optimal\nhyperparameters identified in this paper, it was observed FSDMC outperforms\nCycleGAN, another mainstream style transfer model. FSDMC achieves FID of 3.27\non the dataset and also surpasses CycleGAN in terms of expert evaluation. This\nnot only demonstrates the model's high effectiveness in extracting Jiehua's\nstyle features, but also preserves the original pre-trained semantic\ninformation. The findings of this study suggest that the application of FSDMC\nwith appropriate hyperparameters can enhance the efficacy of the Stable\nDiffusion Model in the field of traditional art style migration tasks,\nparticularly within the context of Jiehua."},{"date":"2024-08","title":"Extracting polygonal footprints in off-nadir images with Segment Anything Model","author":"Kai Li, Jingbo Chen, Yupeng Deng, Yu Meng, Diyou Liu, Junxian Ma, Chenhao Wang, and Xiangyu Zhao","link":"http://arxiv.org/abs/2408.08645v2","abstract":"Building Footprint Extraction (BFE) from off-nadir aerial images often\ninvolves roof segmentation and offset prediction to adjust roof boundaries to\nthe building footprint. However, this multi-stage approach typically produces\nlow-quality results, limiting its applicability in real-world data production.\nTo address this issue, we present OBMv2, an end-to-end and promptable model for\npolygonal footprint prediction. Unlike its predecessor OBM, OBMv2 introduces a\nnovel Self Offset Attention (SOFA) mechanism that improves performance across\ndiverse building types, from bungalows to skyscrapers, enabling end-to-end\nfootprint prediction without post-processing. Additionally, we propose a\nMulti-level Information System (MISS) to effectively leverage roof masks,\nbuilding masks, and offsets for accurate footprint prediction. We evaluate\nOBMv2 on the BONAI and OmniCity-view3 datasets and demonstrate its\ngeneralization on the Huizhou test set. The code will be available at\nhttps://github.com/likaiucas/OBMv2."},{"date":"2024-08","title":"Extracting Sentence Embeddings from Pretrained Transformer Models","author":"Lukas Stankevi\u010dius, and Mantas Luko\u0161evi\u010dius","link":"http://arxiv.org/abs/2408.08073v1","abstract":"Background/introduction: Pre-trained transformer models shine in many natural\nlanguage processing tasks and therefore are expected to bear the representation\nof the input sentence or text meaning. These sentence-level embeddings are also\nimportant in retrieval-augmented generation. But do commonly used plain\naveraging or prompt templates surface it enough?\n Methods: Given 110M parameters BERT's hidden representations from multiple\nlayers and multiple tokens we tried various ways to extract optimal sentence\nrepresentations. We tested various token aggregation and representation\npost-processing techniques. We also tested multiple ways of using a general\nWikitext dataset to complement BERTs sentence representations. All methods were\ntested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12\nclassification tasks. We also evaluated our representation-shaping techniques\non other static models, including random token representations.\n Results: Proposed representation extraction methods improved the performance\non STS and clustering tasks for all models considered. Very high improvements\nfor static token-based models, especially random embeddings for STS tasks\nalmost reach the performance of BERT-derived representations.\n Conclusions: Our work shows that for multiple tasks simple baselines with\nrepresentation shaping techniques reach or even outperform more complex\nBERT-based models or are able to contribute to their performance."},{"date":"2024-08","title":"Evaluating Large Language Model based Personal Information Extraction and Countermeasures","author":"Yupei Liu, Yuqi Jia, Jinyuan Jia, and Neil Zhenqiang Gong","link":"http://arxiv.org/abs/2408.07291v1","abstract":"Automatically extracting personal information--such as name, phone number,\nand email address--from publicly available profiles at a large scale is a\nstepstone to many other security attacks including spear phishing. Traditional\nmethods--such as regular expression, keyword search, and entity\ndetection--achieve limited success at such personal information extraction. In\nthis work, we perform a systematic measurement study to benchmark large\nlanguage model (LLM) based personal information extraction and countermeasures.\nTowards this goal, we present a framework for LLM-based extraction attacks;\ncollect three datasets including a synthetic dataset generated by GPT-4 and two\nreal-world datasets with manually labeled 8 categories of personal information;\nintroduce a novel mitigation strategy based on \\emph{prompt injection}; and\nsystematically benchmark LLM-based attacks and countermeasures using 10 LLMs\nand our 3 datasets. Our key findings include: LLM can be misused by attackers\nto accurately extract various personal information from personal profiles; LLM\noutperforms conventional methods at such extraction; and prompt injection can\nmitigate such risk to a large extent and outperforms conventional\ncountermeasures. Our code and data are available at:\n\\url{https://github.com/liu00222/LLM-Based-Personal-Profile-Extraction}."},{"date":"2024-08","title":"Automatic Feature Recognition and Dimensional Attributes Extraction From CAD Models for Hybrid Additive-Subtractive Manufacturing","author":"Muhammad Tayyab Khan, Wenhe Feng, Lequn Chen, Ye Han Ng, Nicholas Yew Jin Tan, and Seung Ki Moon","link":"http://arxiv.org/abs/2408.06891v2","abstract":"The integration of Computer-Aided Design (CAD), Computer-Aided Process\nPlanning (CAPP), and Computer-Aided Manufacturing (CAM) plays a crucial role in\nmodern manufacturing, facilitating seamless transitions from digital designs to\nphysical products. However, a significant challenge within this integration is\nthe Automatic Feature Recognition (AFR) of CAD models, especially in the\ncontext of hybrid manufacturing that combines subtractive and additive\nmanufacturing processes. Traditional AFR methods, focused mainly on the\nidentification of subtractive (machined) features including holes, fillets,\nchamfers, pockets, and slots, fail to recognize features pertinent to additive\nmanufacturing. Furthermore, the traditional methods fall short in accurately\nextracting geometric dimensions and orientations, which are also key factors\nfor effective manufacturing process planning. This paper presents a novel\napproach for creating a synthetic CAD dataset that encompasses features\nrelevant to both additive and subtractive machining through Python Open\nCascade. The Hierarchical Graph Convolutional Neural Network (HGCNN) model is\nimplemented to accurately identify the composite additive-subtractive features\nwithin the synthetic CAD dataset. The key novelty and contribution of the\nproposed methodology lie in its ability to recognize a wide range of\nmanufacturing features, and precisely extracting their dimensions,\norientations, and stock sizes. The proposed model demonstrates remarkable\nfeature recognition accuracy exceeding 97% and a dimension extraction accuracy\nof 100% for identified features. Therefore, the proposed methodology enhances\nthe integration of CAD, CAPP, and CAM within hybrid manufacturing by providing\nprecise feature recognition and dimension extraction. It facilitates improved\nmanufacturing process planning, by enabling more informed decision-making."},{"date":"2024-08","title":"Target Prompting for Information Extraction with Vision Language Model","author":"Dipankar Medhi","link":"http://arxiv.org/abs/2408.03834v1","abstract":"The recent trend in the Large Vision and Language model has brought a new\nchange in how information extraction systems are built. VLMs have set a new\nbenchmark with their State-of-the-art techniques in understanding documents and\nbuilding question-answering systems across various industries. They are\nsignificantly better at generating text from document images and providing\naccurate answers to questions. However, there are still some challenges in\neffectively utilizing these models to build a precise conversational system.\nGeneral prompting techniques used with large language models are often not\nsuitable for these specially designed vision language models. The output\ngenerated by such generic input prompts is ordinary and may contain information\ngaps when compared with the actual content of the document. To obtain more\naccurate and specific answers, a well-targeted prompt is required by the vision\nlanguage model, along with the document image. In this paper, a technique is\ndiscussed called Target prompting, which focuses on explicitly targeting parts\nof document images and generating related answers from those specific regions\nonly. The paper also covers the evaluation of response for each prompting\ntechnique using different user queries and input prompts."},{"date":"2024-08","title":"Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction","author":"Benjamin Matthias Ruppik, Michael Heck, Carel van Niekerk, Renato Vukovic, Hsien-chin Lin, Shutong Feng, Marcus Zibrowius, and Milica Ga\u0161i\u0107","link":"http://arxiv.org/abs/2408.03706v1","abstract":"A common approach for sequence tagging tasks based on contextual word\nrepresentations is to train a machine learning classifier directly on these\nembedding vectors. This approach has two shortcomings. First, such methods\nconsider single input sequences in isolation and are unable to put an\nindividual embedding vector in relation to vectors outside the current local\ncontext of use. Second, the high performance of these models relies on\nfine-tuning the embedding model in conjunction with the classifier, which may\nnot always be feasible due to the size or inaccessibility of the underlying\nfeature-generation model. It is thus desirable, given a collection of embedding\nvectors of a corpus, i.e., a datastore, to find features of each vector that\ndescribe its relation to other, similar vectors in the datastore. With this in\nmind, we introduce complexity measures of the local topology of the latent\nspace of a contextual language model with respect to a given datastore. The\neffectiveness of our features is demonstrated through their application to\ndialogue term extraction. Our work continues a line of research that explores\nthe manifold hypothesis for word embeddings, demonstrating that local structure\nin the space carved out by word embeddings can be exploited to infer semantic\nproperties."},{"date":"2024-08","title":"Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection","author":"Sajal Aggarwal, Ananya Pandey, and Dinesh Kumar Vishwakarma","link":"http://arxiv.org/abs/2408.02595v1","abstract":"Sarcasm is a type of irony, characterized by an inherent mismatch between the\nliteral interpretation and the intended connotation. Though sarcasm detection\nin text has been extensively studied, there are situations in which textual\ninput alone might be insufficient to perceive sarcasm. The inclusion of\nadditional contextual cues, such as images, is essential to recognize sarcasm\nin social media data effectively. This study presents a novel framework for\nmultimodal sarcasm detection that can process input triplets. Two components of\nthese triplets comprise the input text and its associated image, as provided in\nthe datasets. Additionally, a supplementary modality is introduced in the form\nof descriptive image captions. The motivation behind incorporating this visual\nsemantic representation is to more accurately capture the discrepancies between\nthe textual and visual content, which are fundamental to the sarcasm detection\ntask. The primary contributions of this study are: (1) a robust textual feature\nextraction branch that utilizes a cross-lingual language model; (2) a visual\nfeature extraction branch that incorporates a self-regulated residual ConvNet\nintegrated with a lightweight spatially aware attention module; (3) an\nadditional modality in the form of image captions generated using an\nencoder-decoder architecture capable of reading text embedded in images; (4)\ndistinct attention modules to effectively identify the incongruities between\nthe text and two levels of image representations; (5) multi-level cross-domain\nsemantic incongruity representation achieved through feature fusion. Compared\nwith cutting-edge baselines, the proposed model achieves the best accuracy of\n92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and\nMultiBully datasets."},{"date":"2024-08","title":"Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models","author":"Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li","link":"http://arxiv.org/abs/2408.02416v1","abstract":"The drastic increase of large language models' (LLMs) parameters has led to a\nnew research direction of fine-tuning-free downstream customization by prompts,\ni.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs)\nplay an important role in many businesses, there has emerged growing concerns\nabout the prompt leakage, which undermines the intellectual properties of these\nservices and causes downstream attacks. In this paper, we analyze the\nunderlying mechanism of prompt leakage, which we refer to as prompt\nmemorization, and develop corresponding defending strategies. By exploring the\nscaling laws in prompt extraction, we analyze key attributes that influence\nprompt extraction, including model sizes, prompt lengths, as well as the types\nof prompts. Then we propose two hypotheses that explain how LLMs expose their\nprompts. The first is attributed to the perplexity, i.e. the familiarity of\nLLMs to texts, whereas the second is based on the straightforward token\ntranslation path in attention matrices. To defend against such threats, we\ninvestigate whether alignments can undermine the extraction of prompts. We find\nthat current LLMs, even those with safety alignments like GPT-4, are highly\nvulnerable to prompt extraction attacks, even under the most straightforward\nuser attacks. Therefore, we put forward several defense strategies with the\ninspiration of our findings, which achieve 83.8\\% and 71.0\\% drop in the prompt\nextraction rate for Llama2-7B and GPT-3.5, respectively. Source code is\navaliable at \\url{https://github.com/liangzid/PromptExtractionEval}."},{"date":"2024-08","title":"A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models","author":"Vanni Zavarella, Juan Carlos Gamero-Salinas, and Sergio Consoli","link":"http://arxiv.org/abs/2408.02377v1","abstract":"Knowledge graphs (KGs) have been successfully applied to the analysis of\ncomplex scientific and technological domains, with automatic KG generation\nmethods typically building upon relation extraction models capturing\nfine-grained relations between domain entities in text. While these relations\nare fully applicable across scientific areas, existing models are trained on\nfew domain-specific datasets such as SciERC and do not perform well on new\ntarget domains. In this paper, we experiment with leveraging in-context\nlearning capabilities of Large Language Models to perform schema-constrained\ndata annotation, collecting in-domain training instances for a\nTransformer-based relation extraction model deployed on titles and abstracts of\nresearch papers in the Architecture, Construction, Engineering and Operations\n(AECO) domain. By assessing the performance gain with respect to a baseline\nDeep Learning architecture trained on off-domain data, we show that by using a\nfew-shot learning strategy with structured prompts and only minimal expert\nannotation the presented approach can potentially support domain adaptation of\na science KG generation model."},{"date":"2024-08","title":"VidModEx: Interpretable and Efficient Black Box Model Extraction for High-Dimensional Spaces","author":"Somnath Sendhil Kumar, Yuvaraj Govindarajulu, Pavan Kulkarni, and Manojkumar Parmar","link":"http://arxiv.org/abs/2408.02140v1","abstract":"In the domain of black-box model extraction, conventional methods reliant on\nsoft labels or surrogate datasets struggle with scaling to high-dimensional\ninput spaces and managing the complexity of an extensive array of interrelated\nclasses. In this work, we present a novel approach that utilizes SHAP (SHapley\nAdditive exPlanations) to enhance synthetic data generation. SHAP quantifies\nthe individual contributions of each input feature towards the victim model's\noutput, facilitating the optimization of an energy-based GAN towards a\ndesirable output. This method significantly boosts performance, achieving a\n16.45% increase in the accuracy of image classification models and extending to\nvideo classification models with an average improvement of 26.11% and a maximum\nof 33.36% on challenging datasets such as UCF11, UCF101, Kinetics 400, Kinetics\n600, and Something-Something V2. We further demonstrate the effectiveness and\npractical utility of our method under various scenarios, including the\navailability of top-k prediction probabilities, top-k prediction labels, and\ntop-1 labels."},{"date":"2024-08","title":"Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding","author":"Balaji Muralidharan, Hayden Beadles, Reza Marzban, and Kalyan Sashank Mupparaju","link":"http://arxiv.org/abs/2408.04651v1","abstract":"This project investigates the efficacy of Large Language Models (LLMs) in\nunderstanding and extracting scientific knowledge across specific domains and\nto create a deep learning framework: Knowledge AI. As a part of this framework,\nwe employ pre-trained models and fine-tune them on datasets in the scientific\ndomain. The models are adapted for four key Natural Language Processing (NLP)\ntasks: summarization, text generation, question answering, and named entity\nrecognition. Our results indicate that domain-specific fine-tuning\nsignificantly enhances model performance in each of these tasks, thereby\nimproving their applicability for scientific contexts. This adaptation enables\nnon-experts to efficiently query and extract information within targeted\nscientific fields, demonstrating the potential of fine-tuned LLMs as a tool for\nknowledge discovery in the sciences."},{"date":"2024-08","title":"Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data","author":"Antonio De Santis, Marco Balduini, Federico De Santis, Andrea Proia, Arsenio Leo, Marco Brambilla, and Emanuele Della Valle","link":"http://arxiv.org/abs/2408.01700v1","abstract":"Aerospace manufacturing companies, such as Thales Alenia Space, design,\ndevelop, integrate, verify, and validate products characterized by high\ncomplexity and low volume. They carefully document all phases for each product\nbut analyses across products are challenging due to the heterogeneity and\nunstructured nature of the data in documents. In this paper, we propose a\nhybrid methodology that leverages Knowledge Graphs (KGs) in conjunction with\nLarge Language Models (LLMs) to extract and validate data contained in these\ndocuments. We consider a case study focused on test data related to electronic\nboards for satellites. To do so, we extend the Semantic Sensor Network\nontology. We store the metadata of the reports in a KG, while the actual test\nresults are stored in parquet accessible via a Virtual Knowledge Graph. The\nvalidation process is managed using an LLM-based approach. We also conduct a\nbenchmarking study to evaluate the performance of state-of-the-art LLMs in\nexecuting this task. Finally, we analyze the costs and benefits of automating\npreexisting processes of manual data extraction and validation for subsequent\ncross-report analyses."},{"date":"2024-07","title":"FIARSE: Model-Heterogeneous Federated Learning via Importance-Aware Submodel Extraction","author":"Feijie Wu, Xingchen Wang, Yaqing Wang, Tianci Liu, Lu Su, and Jing Gao","link":"http://arxiv.org/abs/2407.19389v2","abstract":"In federated learning (FL), accommodating clients' varied computational\ncapacities poses a challenge, often limiting the participation of those with\nconstrained resources in global model training. To address this issue, the\nconcept of model heterogeneity through submodel extraction has emerged,\noffering a tailored solution that aligns the model's complexity with each\nclient's computational capacity. In this work, we propose Federated\nImportance-Aware Submodel Extraction (FIARSE), a novel approach that\ndynamically adjusts submodels based on the importance of model parameters,\nthereby overcoming the limitations of previous static and dynamic submodel\nextraction methods. Compared to existing works, the proposed method offers a\ntheoretical foundation for the submodel extraction and eliminates the need for\nadditional information beyond the model parameters themselves to determine\nparameter importance, significantly reducing the overhead on clients. Extensive\nexperiments are conducted on various datasets to showcase the superior\nperformance of the proposed FIARSE."},{"date":"2024-07","title":"Human-artificial intelligence teaming for scientific information extraction from data-driven additive manufacturing research using large language models","author":"Mutahar Safdar, Jiarui Xie, Andrei Mircea, and Yaoyao Fiona Zhao","link":"http://arxiv.org/abs/2407.18827v1","abstract":"Data-driven research in Additive Manufacturing (AM) has gained significant\nsuccess in recent years. This has led to a plethora of scientific literature to\nemerge. The knowledge in these works consists of AM and Artificial Intelligence\n(AI) contexts that have not been mined and formalized in an integrated way. It\nrequires substantial effort and time to extract scientific information from\nthese works. AM domain experts have contributed over two dozen review papers to\nsummarize these works. However, information specific to AM and AI contexts\nstill requires manual effort to extract. The recent success of foundation\nmodels such as BERT (Bidirectional Encoder Representations for Transformers) or\nGPT (Generative Pre-trained Transformers) on textual data has opened the\npossibility of expediting scientific information extraction. We propose a\nframework that enables collaboration between AM and AI experts to continuously\nextract scientific information from data-driven AM literature. A demonstration\ntool is implemented based on the proposed framework and a case study is\nconducted to extract information relevant to the datasets, modeling, sensing,\nand AM system categories. We show the ability of LLMs (Large Language Models)\nto expedite the extraction of relevant information from data-driven AM\nliterature. In the future, the framework can be used to extract information\nfrom the broader design and manufacturing literature in the engineering\ndiscipline."},{"date":"2024-07","title":"A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models","author":"Julian Neuberger, Lars Ackermann, Han van der Aa, and Stefan Jablonski","link":"http://arxiv.org/abs/2407.18540v1","abstract":"Over the past decade, extensive research efforts have been dedicated to the\nextraction of information from textual process descriptions. Despite the\nremarkable progress witnessed in natural language processing (NLP), information\nextraction within the Business Process Management domain remains predominantly\nreliant on rule-based systems and machine learning methodologies. Data scarcity\nhas so far prevented the successful application of deep learning techniques.\nHowever, the rapid progress in generative large language models (LLMs) makes it\npossible to solve many NLP tasks with very high quality without the need for\nextensive data. Therefore, we systematically investigate the potential of LLMs\nfor extracting information from textual process descriptions, targeting the\ndetection of process elements such as activities and actors, and relations\nbetween them. Using a heuristic algorithm, we demonstrate the suitability of\nthe extracted information for process model generation. Based on a novel\nprompting strategy, we show that LLMs are able to outperform state-of-the-art\nmachine learning approaches with absolute performance improvements of up to 8\\%\n$F_1$ score across three different datasets. We evaluate our prompting strategy\non eight different LLMs, showing it is universally applicable, while also\nanalyzing the impact of certain prompt parts on extraction quality. The number\nof example texts, the specificity of definitions, and the rigour of format\ninstructions are identified as key for improving the accuracy of extracted\ninformation. Our code, prompts, and data are publicly available."},{"date":"2024-07","title":"SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)","author":"Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, and Ying Ding","link":"http://arxiv.org/abs/2407.17126v1","abstract":"Extracting social determinants of health (SDoH) from unstructured medical\nnotes depends heavily on labor-intensive annotations, which are typically\ntask-specific, hampering reusability and limiting sharing. In this study we\nintroduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM)\nmethod leveraging contrastive examples and concise instructions to extract SDoH\nwithout relying on extensive medical annotations or costly human intervention.\nIt achieved tenfold and twentyfold reductions in time and cost respectively,\nand superior consistency with human annotators measured by Cohen's kappa of up\nto 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the\nstrengths of both, ensuring high accuracy and computational efficiency while\nconsistently maintaining 0.90+ AUROC scores. Testing across three distinct\ndatasets has confirmed its robustness and accuracy. This study highlights the\npotential of leveraging LLMs to revolutionize medical note classification,\ndemonstrating their capability to achieve highly accurate classifications with\nsignificantly reduced time and cost."},{"date":"2024-07","title":"From Text to Insight: Large Language Models for Materials Science Data Extraction","author":"Mara Schilling-Wilhelmi, Marti\u00f1o R\u00edos-Garc\u00eda, Sherjeel Shabih, Mar\u00eda Victoria Gil, Santiago Miret, Christoph T. Koch, Jos\u00e9 A. M\u00e1rquez, and Kevin Maik Jablonka","link":"http://arxiv.org/abs/2407.16867v1","abstract":"The vast majority of materials science knowledge exists in unstructured\nnatural language, yet structured data is crucial for innovative and systematic\nmaterials design. Traditionally, the field has relied on manual curation and\npartial automation for data extraction for specific use cases. The advent of\nlarge language models (LLMs) represents a significant shift, potentially\nenabling efficient extraction of structured, actionable data from unstructured\ntext by non-experts. While applying LLMs to materials science data extraction\npresents unique challenges, domain knowledge offers opportunities to guide and\nvalidate LLM outputs. This review provides a comprehensive overview of\nLLM-based structured data extraction in materials science, synthesizing current\nknowledge and outlining future directions. We address the lack of standardized\nguidelines and present frameworks for leveraging the synergy between LLMs and\nmaterials science expertise. This work serves as a foundational resource for\nresearchers aiming to harness LLMs for data-driven materials research. The\ninsights presented here could significantly enhance how researchers across\ndisciplines access and utilize scientific information, potentially accelerating\nthe development of novel materials for critical societal needs."},{"date":"2024-07","title":"Causality extraction from medical text using Large Language Models (LLMs)","author":"Seethalakshmi Gopalakrishnan, Luciana Garbayo, and Wlodek Zadrozny","link":"http://arxiv.org/abs/2407.10020v1","abstract":"This study explores the potential of natural language models, including large\nlanguage models, to extract causal relations from medical texts, specifically\nfrom Clinical Practice Guidelines (CPGs). The outcomes causality extraction\nfrom Clinical Practice Guidelines for gestational diabetes are presented,\nmarking a first in the field. We report on a set of experiments using variants\nof BERT (BioBERT, DistilBERT, and BERT) and using Large Language Models (LLMs),\nnamely GPT-4 and LLAMA2. Our experiments show that BioBERT performed better\nthan other models, including the Large Language Models, with an average\nF1-score of 0.72. GPT-4 and LLAMA2 results show similar performance but less\nconsistency. We also release the code and an annotated a corpus of causal\nstatements within the Clinical Practice Guidelines for gestational diabetes."},{"date":"2024-07","title":"Empowering Few-Shot Relation Extraction with The Integration of Traditional RE Methods and Large Language Models","author":"Ye Liu, Kai Zhang, Aoran Gan, Linan Yue, Feng Hu, Qi Liu, and Enhong Chen","link":"http://arxiv.org/abs/2407.08967v1","abstract":"Few-Shot Relation Extraction (FSRE), a subtask of Relation Extraction (RE)\nthat utilizes limited training instances, appeals to more researchers in\nNatural Language Processing (NLP) due to its capability to extract textual\ninformation in extremely low-resource scenarios. The primary methodologies\nemployed for FSRE have been fine-tuning or prompt tuning techniques based on\nPre-trained Language Models (PLMs). Recently, the emergence of Large Language\nModels (LLMs) has prompted numerous researchers to explore FSRE through\nIn-Context Learning (ICL). However, there are substantial limitations\nassociated with methods based on either traditional RE models or LLMs.\nTraditional RE models are hampered by a lack of necessary prior knowledge,\nwhile LLMs fall short in their task-specific capabilities for RE. To address\nthese shortcomings, we propose a Dual-System Augmented Relation Extractor\n(DSARE), which synergistically combines traditional RE models with LLMs.\nSpecifically, DSARE innovatively injects the prior knowledge of LLMs into\ntraditional RE models, and conversely enhances LLMs' task-specific aptitude for\nRE through relation extraction augmentation. Moreover, an Integrated Prediction\nmodule is employed to jointly consider these two respective predictions and\nderive the final results. Extensive experiments demonstrate the efficacy of our\nproposed method."},{"date":"2024-07","title":"Extracting Training Data from Document-Based VQA Models","author":"Francesco Pinto, Nathalie Rauschmayr, Florian Tram\u00e8r, Philip Torr, and Federico Tombari","link":"http://arxiv.org/abs/2407.08707v1","abstract":"Vision-Language Models (VLMs) have made remarkable progress in document-based\nVisual Question Answering (i.e., responding to queries about the contents of an\ninput document provided as an image). In this work, we show these models can\nmemorize responses for training samples and regurgitate them even when the\nrelevant visual information has been removed. This includes Personal\nIdentifiable Information (PII) repeated once in the training set, indicating\nthese models could divulge memorised sensitive information and therefore pose a\nprivacy risk. We quantitatively measure the extractability of information in\ncontrolled experiments and differentiate between cases where it arises from\ngeneralization capabilities or from memorization. We further investigate the\nfactors that influence memorization across multiple state-of-the-art models and\npropose an effective heuristic countermeasure that empirically prevents the\nextractability of PII."},{"date":"2024-07","title":"ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction","author":"Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K. Wong","link":"http://arxiv.org/abs/2407.07077v1","abstract":"While personalized text-to-image generation has enabled the learning of a\nsingle concept from multiple images, a more practical yet challenging scenario\ninvolves learning multiple concepts within a single image. However, existing\nworks tackling this scenario heavily rely on extensive human annotations. In\nthis paper, we introduce a novel task named Unsupervised Concept Extraction\n(UCE) that considers an unsupervised setting without any human knowledge of the\nconcepts. Given an image that contains multiple concepts, the task aims to\nextract and recreate individual concepts solely relying on the existing\nknowledge from pretrained diffusion models. To achieve this, we present\nConceptExpress that tackles UCE by unleashing the inherent capabilities of\npretrained diffusion models in two aspects. Specifically, a concept\nlocalization approach automatically locates and disentangles salient concepts\nby leveraging spatial correspondence from diffusion self-attention; and based\non the lookup association between a concept and a conceptual token, a\nconcept-wise optimization process learns discriminative tokens that represent\neach individual concept. Finally, we establish an evaluation protocol tailored\nfor the UCE task. Extensive experiments demonstrate that ConceptExpress is a\npromising solution to the UCE task. Our code and data are available at:\nhttps://github.com/haoosz/ConceptExpress"},{"date":"2024-07","title":"Large Language Models for Judicial Entity Extraction: A Comparative Study","author":"Atin Sakkeer Hussain, and Anu Thomas","link":"http://arxiv.org/abs/2407.05786v1","abstract":"Domain-specific Entity Recognition holds significant importance in legal\ncontexts, serving as a fundamental task that supports various applications such\nas question-answering systems, text summarization, machine translation,\nsentiment analysis, and information retrieval specifically within case law\ndocuments. Recent advancements have highlighted the efficacy of Large Language\nModels in natural language processing tasks, demonstrating their capability to\naccurately detect and classify domain-specific facts (entities) from\nspecialized texts like clinical and financial documents. This research\ninvestigates the application of Large Language Models in identifying\ndomain-specific entities (e.g., courts, petitioner, judge, lawyer, respondents,\nFIR nos.) within case law documents, with a specific focus on their aptitude\nfor handling domain-specific language complexity and contextual variations. The\nstudy evaluates the performance of state-of-the-art Large Language Model\narchitectures, including Large Language Model Meta AI 3, Mistral, and Gemma, in\nthe context of extracting judicial facts tailored to Indian judicial texts.\nMistral and Gemma emerged as the top-performing models, showcasing balanced\nprecision and recall crucial for accurate entity identification. These findings\nconfirm the value of Large Language Models in judicial documents and\ndemonstrate how they can facilitate and quicken scientific research by\nproducing precise, organised data outputs that are appropriate for in-depth\nexamination."},{"date":"2024-07","title":"Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation","author":"Pablo Messina, Ren\u00e9 Vidal, Denis Parra, \u00c1lvaro Soto, and Vladimir Araujo","link":"http://arxiv.org/abs/2407.01948v1","abstract":"Advancing representation learning in specialized fields like medicine remains\nchallenging due to the scarcity of expert annotations for text and images. To\ntackle this issue, we present a novel two-stage framework designed to extract\nhigh-quality factual statements from free-text radiology reports in order to\nimprove the representations of text encoders and, consequently, their\nperformance on various downstream tasks. In the first stage, we propose a\n\\textit{Fact Extractor} that leverages large language models (LLMs) to identify\nfactual statements from well-curated domain-specific datasets. In the second\nstage, we introduce a \\textit{Fact Encoder} (CXRFE) based on a BERT model\nfine-tuned with objective functions designed to improve its representations\nusing the extracted factual data. Our framework also includes a new\nembedding-based metric (CXRFEScore) for evaluating chest X-ray text generation\nsystems, leveraging both stages of our approach. Extensive evaluations show\nthat our fact extractor and encoder outperform current state-of-the-art methods\nin tasks such as sentence ranking, natural language inference, and label\nextraction from radiology reports. Additionally, our metric proves to be more\nrobust and effective than existing metrics commonly used in the radiology\nreport generation literature. The code of this project is available at\n\\url{https://github.com/PabloMessina/CXR-Fact-Encoder}."},{"date":"2024-07","title":"QUEEN: Query Unlearning against Model Extraction","author":"Huajie Chen, Tianqing Zhu, Lefeng Zhang, Bo Liu, Derui Wang, Wanlei Zhou, and Minhui Xue","link":"http://arxiv.org/abs/2407.01251v1","abstract":"Model extraction attacks currently pose a non-negligible threat to the\nsecurity and privacy of deep learning models. By querying the model with a\nsmall dataset and usingthe query results as the ground-truth labels, an\nadversary can steal a piracy model with performance comparable to the original\nmodel. Two key issues that cause the threat are, on the one hand, accurate and\nunlimited queries can be obtained by the adversary; on the other hand, the\nadversary can aggregate the query results to train the model step by step. The\nexisting defenses usually employ model watermarking or fingerprinting to\nprotect the ownership. However, these methods cannot proactively prevent the\nviolation from happening. To mitigate the threat, we propose QUEEN (QUEry\nunlEarNing) that proactively launches counterattacks on potential model\nextraction attacks from the very beginning. To limit the potential threat,\nQUEEN has sensitivity measurement and outputs perturbation that prevents the\nadversary from training a piracy model with high performance. In sensitivity\nmeasurement, QUEEN measures the single query sensitivity by its distance from\nthe center of its cluster in the feature space. To reduce the learning accuracy\nof attacks, for the highly sensitive query batch, QUEEN applies query\nunlearning, which is implemented by gradient reverse to perturb the softmax\noutput such that the piracy model will generate reverse gradients to worsen its\nperformance unconsciously. Experiments show that QUEEN outperforms the\nstate-of-the-art defenses against various model extraction attacks with a\nrelatively low cost to the model accuracy. The artifact is publicly available\nat https://anonymous.4open.science/r/queen implementation-5408/."},{"date":"2024-06","title":"Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs","author":"Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, and Peter Staar","link":"http://arxiv.org/abs/2406.19102v1","abstract":"Environment, Social, and Governance (ESG) KPIs assess an organization's\nperformance on issues such as climate change, greenhouse gas emissions, water\nconsumption, waste management, human rights, diversity, and policies. ESG\nreports convey this valuable quantitative information through tables.\nUnfortunately, extracting this information is difficult due to high variability\nin the table structure as well as content. We propose Statements, a novel\ndomain agnostic data structure for extracting quantitative facts and related\ninformation. We propose translating tables to statements as a new supervised\ndeep-learning universal information extraction task. We introduce SemTabNet - a\ndataset of over 100K annotated tables. Investigating a family of T5-based\nStatement Extraction Models, our best model generates statements which are 82%\nsimilar to the ground-truth (compared to baseline of 21%). We demonstrate the\nadvantages of statements by applying our model to over 2700 tables from ESG\nreports. The homogeneous nature of statements permits exploratory data analysis\non expansive information found in large collections of ESG reports."},{"date":"2024-06","title":"Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model","author":"Yiming Chen, Haobin Chen, Simin Liu, Yunyun Liu, Fanhao Zhou, and Bing Wei","link":"http://arxiv.org/abs/2406.18364v1","abstract":"With the continuous advancement of artificial intelligence, natural language\nprocessing technology has become widely utilized in various fields. At the same\ntime, there are many challenges in creating Chinese news summaries. First of\nall, the semantics of Chinese news is complex, and the amount of information is\nenormous. Extracting critical information from Chinese news presents a\nsignificant challenge. Second, the news summary should be concise and clear,\nfocusing on the main content and avoiding redundancy. In addition, the\nparticularity of the Chinese language, such as polysemy, word segmentation,\netc., makes it challenging to generate Chinese news summaries. Based on the\nabove, this paper studies the information extraction method of the LCSTS\ndataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM\nmodel to make it perform better in generating Chinese news summaries. The\nexperimental results show that the proposed method has a good effect on\ncreating news summaries, which is of great importance to the construction of\nnews summaries."},{"date":"2024-06","title":"Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources","author":"Yiming Li, Deepthi Viswaroopan, William He, Jianfu Li, Xu Zuo, Hua Xu, and Cui Tao","link":"http://arxiv.org/abs/2406.18049v1","abstract":"Adverse event (AE) extraction following COVID-19 vaccines from text data is\ncrucial for monitoring and analyzing the safety profiles of immunizations.\nTraditional deep learning models are adept at learning intricate feature\nrepresentations and dependencies in sequential data, but often require\nextensive labeled data. In contrast, large language models (LLMs) excel in\nunderstanding contextual information, but exhibit unstable performance on named\nentity recognition tasks, possibly due to their broad but unspecific training.\nThis study aims to evaluate the effectiveness of LLMs and traditional deep\nlearning models in AE extraction, and to assess the impact of ensembling these\nmodels on performance. In this study, we utilized reports and posts from the\nVAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal\nwas to extract three types of entities: \"vaccine\", \"shot\", and \"ae\". We\nexplored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5,\nGPT-4, and Llama-2, as well as traditional deep learning models like RNN and\nBioBERT. To enhance performance, we created ensembles of the three models with\nthe best performance. For evaluation, we used strict and relaxed F1 scores to\nevaluate the performance for each entity type, and micro-average F1 was used to\nassess the overall performance. The ensemble model achieved the highest\nperformance in \"vaccine\", \"shot\", and \"ae\" with strict F1-scores of 0.878,\n0.930, and 0.925, respectively, along with a micro-average score of 0.903. In\nconclusion, this study demonstrates the effectiveness and robustness of\nensembling fine-tuned traditional deep learning models and LLMs, for extracting\nAE-related information. This study contributes to the advancement of biomedical\nnatural language processing, providing valuable insights into improving AE\nextraction from text data for pharmacovigilance and public health surveillance."},{"date":"2024-06","title":"Enabling Regional Explainability by Automatic and Model-agnostic Rule Extraction","author":"Yu Chen, Tianyu Cui, Alexander Capstick, Nan Fletcher-Loyd, and Payam Barnaghi","link":"http://arxiv.org/abs/2406.17885v3","abstract":"In Explainable AI, rule extraction translates model knowledge into logical\nrules, such as IF-THEN statements, crucial for understanding patterns learned\nby black-box models. This could significantly aid in fields like disease\ndiagnosis, disease progression estimation, or drug discovery. However, such\napplication domains often contain imbalanced data, with the class of interest\nunderrepresented. Existing methods inevitably compromise the performance of\nrules for the minor class to maximise the overall performance. As the first\nattempt in this field, we propose a model-agnostic approach for extracting\nrules from specific subgroups of data, featuring automatic rule generation for\nnumerical features. This method enhances the regional explainability of machine\nlearning models and offers wider applicability compared to existing methods. We\nadditionally introduce a new method for selecting features to compose rules,\nreducing computational costs in high-dimensional spaces. Experiments across\nvarious datasets and models demonstrate the effectiveness of our methods."},{"date":"2024-06","title":"Compact Model Parameter Extraction via Derivative-Free Optimization","author":"Rafael Perez Martinez, Masaya Iwamoto, Kelly Woo, Zhengliang Bian, Roberto Tinti, Stephen Boyd, and Srabanti Chowdhury","link":"http://arxiv.org/abs/2406.16355v1","abstract":"In this paper, we address the problem of compact model parameter extraction\nto simultaneously extract tens of parameters via derivative-free optimization.\nTraditionally, parameter extraction is performed manually by dividing the\ncomplete set of parameters into smaller subsets, each targeting different\noperational regions of the device, a process that can take several days or even\nweeks. Our approach streamlines this process by employing derivative-free\noptimization to identify a good parameter set that best fits the compact model\nwithout performing an exhaustive number of simulations. We further enhance the\noptimization process to address critical issues in device modeling by carefully\nchoosing a loss function that evaluates model performance consistently across\nvarying magnitudes by focusing on relative errors (as opposed to absolute\nerrors), prioritizing accuracy in key operational regions of the device above a\ncertain threshold, and reducing sensitivity to outliers. Furthermore, we\nutilize the concept of train-test split to assess the model fit and avoid\noverfitting. This is done by fitting 80% of the data and testing the model\nefficacy with the remaining 20%. We demonstrate the effectiveness of our\nmethodology by successfully modeling two semiconductor devices: a diamond\nSchottky diode and a GaN-on-SiC HEMT, with the latter involving the ASM-HEMT DC\nmodel, which requires simultaneously extracting 35 model parameters to fit the\nmodel to the measured data. These examples demonstrate the effectiveness of our\napproach and showcase the practical benefits of derivative-free optimization in\ndevice modeling."},{"date":"2024-06","title":"Large Language Models for Link Stealing Attacks Against Graph Neural Networks","author":"Faqian Guan, Tianqing Zhu, Hui Sun, Wanlei Zhou, and Philip S. Yu","link":"http://arxiv.org/abs/2406.16963v1","abstract":"Graph data contains rich node features and unique edge information, which\nhave been applied across various domains, such as citation networks or\nrecommendation systems. Graph Neural Networks (GNNs) are specialized for\nhandling such data and have shown impressive performance in many applications.\nHowever, GNNs may contain of sensitive information and susceptible to privacy\nattacks. For example, link stealing is a type of attack in which attackers\ninfer whether two nodes are linked or not. Previous link stealing attacks\nprimarily relied on posterior probabilities from the target GNN model,\nneglecting the significance of node features. Additionally, variations in node\nclasses across different datasets lead to different dimensions of posterior\nprobabilities. The handling of these varying data dimensions posed a challenge\nin using a single model to effectively conduct link stealing attacks on\ndifferent datasets. To address these challenges, we introduce Large Language\nModels (LLMs) to perform link stealing attacks on GNNs. LLMs can effectively\nintegrate textual features and exhibit strong generalizability, enabling\nattacks to handle diverse data dimensions across various datasets. We design\ntwo distinct LLM prompts to effectively combine textual features and posterior\nprobabilities of graph nodes. Through these designed prompts, we fine-tune the\nLLM to adapt to the link stealing attack task. Furthermore, we fine-tune the\nLLM using multiple datasets and enable the LLM to learn features from different\ndatasets simultaneously. Experimental results show that our approach\nsignificantly enhances the performance of existing link stealing attack tasks\nin both white-box and black-box scenarios. Our method can execute link stealing\nattacks across different datasets using only a single model, making link\nstealing attacks more applicable to real-world scenarios."},{"date":"2024-06","title":"Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks","author":"Sefika Efeoglu, and Adrian Paschke","link":"http://arxiv.org/abs/2406.14745v2","abstract":"Information Extraction (IE) is crucial for converting unstructured data into\nstructured formats like Knowledge Graphs (KGs). A key task within IE is\nRelation Extraction (RE), which identifies relationships between entities in\ntext. Various RE methods exist, including supervised, unsupervised, weakly\nsupervised, and rule-based approaches. Recent studies leveraging pre-trained\nlanguage models (PLMs) have shown significant success in this area. In the\ncurrent era dominated by Large Language Models (LLMs), fine-tuning these models\ncan overcome limitations associated with zero-shot LLM prompting-based RE\nmethods, especially regarding domain adaptation challenges and identifying\nimplicit relations between entities in sentences. These implicit relations,\nwhich cannot be easily extracted from a sentence's dependency tree, require\nlogical inference for accurate identification. This work explores the\nperformance of fine-tuned LLMs and their integration into the Retrieval\nAugmented-based (RAG) RE approach to address the challenges of identifying\nimplicit relations at the sentence level, particularly when LLMs act as\ngenerators within the RAG framework. Empirical evaluations on the TACRED,\nTACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant\nperformance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B,\nand T5 (Large). Notably, our approach achieves substantial gains on SemEVAL,\nwhere implicit relations are common, surpassing previous results on this\ndataset. Additionally, our method outperforms previous works on TACRED, TACREV,\nand Re-TACRED, demonstrating exceptional performance across diverse evaluation\nscenarios."},{"date":"2024-06","title":"Extracting Training Data from Unconditional Diffusion Models","author":"Yunhao Chen, Xingjun Ma, Difan Zou, and Yu-Gang Jiang","link":"http://arxiv.org/abs/2406.12752v2","abstract":"As diffusion probabilistic models (DPMs) are being employed as mainstream\nmodels for generative artificial intelligence (AI), the study of their\nmemorization of the raw training data has attracted growing attention. Existing\nworks in this direction aim to establish an understanding of whether or to what\nextent DPMs learn by memorization. Such an understanding is crucial for\nidentifying potential risks of data leakage and copyright infringement in\ndiffusion models and, more importantly, for more controllable generation and\ntrustworthy application of Artificial Intelligence Generated Content (AIGC).\nWhile previous works have made important observations of when DPMs are prone to\nmemorization, these findings are mostly empirical, and the developed data\nextraction methods only work for conditional diffusion models. In this work, we\naim to establish a theoretical understanding of memorization in DPMs with 1) a\nmemorization metric for theoretical analysis, 2) an analysis of conditional\nmemorization with informative and random labels, and 3) two better evaluation\nmetrics for measuring memorization. Based on the theoretical analysis, we\nfurther propose a novel data extraction method called \\textbf{Surrogate\ncondItional Data Extraction (SIDE)} that leverages a classifier trained on\ngenerated data as a surrogate condition to extract training data directly from\nunconditional diffusion models. Our empirical results demonstrate that SIDE can\nextract training data from diffusion models where previous methods fail, and it\nis on average over 50\\% more effective across different scales of the CelebA\ndataset."},{"date":"2024-06","title":"Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction","author":"Zepeng Ding, Ruiyang Ke, Wenhao Huang, Guochao Jiang, Yanda Li, Deqing Yang, and Jiaqing Liang","link":"http://arxiv.org/abs/2406.11455v2","abstract":"Existing research on large language models (LLMs) shows that they can solve\ninformation extraction tasks through multi-step planning. However, their\nextraction behavior on complex sentences and tasks is unstable, emerging issues\nsuch as false positives and missing elements. We observe that decomposing\ncomplex extraction tasks and extracting them step by step can effectively\nimprove LLMs' performance, and the extraction orders of entities significantly\naffect the final results of LLMs. This paper proposes a two-stage multi-step\nmethod for LLM-based information extraction and adopts the RL framework to\nexecute the multi-step planning. We regard sequential extraction as a Markov\ndecision process, build an LLM-based extraction environment, design a decision\nmodule to adaptively provide the optimal order for sequential entity extraction\non different sentences, and utilize the DDQN algorithm to train the decision\nmodel. We also design the rewards and evaluation metrics suitable for the\nextraction results of LLMs. We conduct extensive experiments on multiple public\ndatasets to demonstrate the effectiveness of our method in improving the\ninformation extraction capabilities of LLMs."},{"date":"2024-06","title":"How Should We Extract Discrete Audio Tokens from Self-Supervised Models?","author":"Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli","link":"http://arxiv.org/abs/2406.10735v1","abstract":"Discrete audio tokens have recently gained attention for their potential to\nbridge the gap between audio and language processing. Ideal audio tokens must\npreserve content, paralinguistic elements, speaker identity, and many other\naudio details. Current audio tokenization methods fall into two categories:\nSemantic tokens, acquired through quantization of Self-Supervised Learning\n(SSL) models, and Neural compression-based tokens (codecs). Although previous\nstudies have benchmarked codec models to identify optimal configurations, the\nideal setup for quantizing pretrained SSL models remains unclear. This paper\nexplores the optimal configuration of semantic tokens across discriminative and\ngenerative tasks. We propose a scalable solution to train a universal vocoder\nacross multiple SSL layers. Furthermore, an attention mechanism is employed to\nidentify task-specific influential layers, enhancing the adaptability and\nperformance of semantic tokens in diverse audio applications."},{"date":"2024-06","title":"GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks","author":"Ihor Stepanov, and Mykhailo Shtopko","link":"http://arxiv.org/abs/2406.12925v2","abstract":"Information extraction tasks require both accurate, efficient, and\ngeneralisable models. Classical supervised deep learning approaches can achieve\nthe required performance, but they need large datasets and are limited in their\nability to adapt to different tasks. On the other hand, large language models\n(LLMs) demonstrate good generalization, meaning that they can adapt to many\ndifferent tasks based on user requests. However, LLMs are computationally\nexpensive and tend to fail to generate structured outputs. In this article, we\nwill introduce a new kind of GLiNER model that can be used for various\ninformation extraction tasks while being a small encoder model. Our model\nachieved SoTA performance on zero-shot NER benchmarks and leading performance\non question-answering, summarization and relation extraction tasks.\nAdditionally, in this article, we will cover experimental results on\nself-learning approaches for named entity recognition using GLiNER models."},{"date":"2024-06","title":"Beyond Slow Signs in High-fidelity Model Extraction","author":"Hanna Foerster, Robert Mullins, Ilia Shumailov, and Jamie Hayes","link":"http://arxiv.org/abs/2406.10011v1","abstract":"Deep neural networks, costly to train and rich in intellectual property\nvalue, are increasingly threatened by model extraction attacks that compromise\ntheir confidentiality. Previous attacks have succeeded in reverse-engineering\nmodel parameters up to a precision of float64 for models trained on random data\nwith at most three hidden layers using cryptanalytical techniques. However, the\nprocess was identified to be very time consuming and not feasible for larger\nand deeper models trained on standard benchmarks. Our study evaluates the\nfeasibility of parameter extraction methods of Carlini et al. [1] further\nenhanced by Canales-Mart\\'inez et al. [2] for models trained on standard\nbenchmarks. We introduce a unified codebase that integrates previous methods\nand reveal that computational tools can significantly influence performance. We\ndevelop further optimisations to the end-to-end attack and improve the\nefficiency of extracting weight signs by up to 14.8 times compared to former\nmethods through the identification of easier and harder to extract neurons.\nContrary to prior assumptions, we identify extraction of weights, not\nextraction of weight signs, as the critical bottleneck. With our improvements,\na 16,721 parameter model with 2 hidden layers trained on MNIST is extracted\nwithin only 98 minutes compared to at least 150 minutes previously. Finally,\naddressing methodological deficiencies observed in previous studies, we propose\nnew ways of robust benchmarking for future model extraction attacks."},{"date":"2024-06","title":"RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models","author":"Daniel Reichenpfader, Jonas Knupp, Andr\u00e9 Sander, and Kerstin Denecke","link":"http://arxiv.org/abs/2406.15465v1","abstract":"Annually and globally, over three billion radiography examinations and\ncomputer tomography scans result in mostly unstructured radiology reports\ncontaining free text. Despite the potential benefits of structured reporting,\nits adoption is limited by factors such as established processes, resource\nconstraints and potential loss of information. However, structured information\nwould be necessary for various use cases, including automatic analysis,\nclinical trial matching, and prediction of health outcomes. This study\nintroduces RadEx, an end-to-end framework comprising 15 software components and\nten artifacts to develop systems that perform automated information extraction\nfrom radiology reports. It covers the complete process from annotating training\ndata to extracting information by offering a consistent generic information\nmodel and setting boundaries for model development. Specifically, RadEx allows\nclinicians to define relevant information for clinical domains (e.g.,\nmammography) and to create report templates. The framework supports both\ngenerative and encoder-only models and the decoupling of information extraction\nfrom template filling enables independent model improvements. Developing\ninformation extraction systems according to the RadEx framework facilitates\nimplementation and maintenance as components are easily exchangeable, while\nstandardized artifacts ensure interoperability between components."},{"date":"2024-06","title":"Zero-Shot Learning Over Large Output Spaces : Utilizing Indirect Knowledge Extraction from Large Language Models","author":"Jinbin Zhang, Nasib Ullah, and Rohit Babbar","link":"http://arxiv.org/abs/2406.09288v1","abstract":"Extreme Multi-label Learning (XMC) is a task that allocates the most relevant\nlabels for an instance from a predefined label set. Extreme Zero-shot XMC\n(EZ-XMC) is a special setting of XMC wherein no supervision is provided; only\nthe instances (raw text of the document) and the predetermined label set are\ngiven. The scenario is designed to address cold-start problems in\ncategorization and recommendation. Traditional state-of-the-art methods extract\npseudo labels from the document title or segments. These labels from the\ndocument are used to train a zero-shot bi-encoder model. The main issue with\nthese generated labels is their misalignment with the tagging task. In this\nwork, we propose a framework to train a small bi-encoder model via the feedback\nfrom the large language model (LLM), the bi-encoder model encodes the document\nand labels into embeddings for retrieval. Our approach leverages the zero-shot\nability of LLM to assess the correlation between labels and the document\ninstead of using the low-quality labels extracted from the document itself. Our\nmethod also guarantees fast inference without the involvement of LLM. The\nperformance of our approach outperforms the SOTA methods on various datasets\nwhile retaining a similar training time for large datasets."},{"date":"2024-06","title":"Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network","author":"Houze Liu, Iris Li, Yaxin Liang, Dan Sun, Yining Yang, and Haowei Yang","link":"http://arxiv.org/abs/2406.08837v1","abstract":"Neural networks with relatively shallow layers and simple structures may have\nlimited ability in accurately identifying pneumonia. In addition, deep neural\nnetworks also have a large demand for computing resources, which may cause\nconvolutional neural networks to be unable to be implemented on terminals.\nTherefore, this paper will carry out the optimal classification of\nconvolutional neural networks. Firstly, according to the characteristics of\npneumonia images, AlexNet and InceptionV3 were selected to obtain better image\nrecognition results. Combining the features of medical images, the forward\nneural network with deeper and more complex structure is learned. Finally,\nknowledge extraction technology is used to extract the obtained data into the\nAlexNet model to achieve the purpose of improving computing efficiency and\nreducing computing costs. The results showed that the prediction accuracy,\nspecificity, and sensitivity of the trained AlexNet model increased by 4.25\npercentage points, 7.85 percentage points, and 2.32 percentage points,\nrespectively. The graphics processing usage has decreased by 51% compared to\nthe InceptionV3 mode."},{"date":"2024-06","title":"A Combination Model for Time Series Prediction using LSTM via Extracting Dynamic Features Based on Spatial Smoothing and Sequential General Variational Mode Decomposition","author":"Jianyu Liu, Wei Chen, Yong Zhang, Zhenfeng Chen, Bin Wan, and Jinwei Hu","link":"http://arxiv.org/abs/2406.03144v1","abstract":"In order to solve the problems such as difficult to extract effective\nfeatures and low accuracy of sales volume prediction caused by complex\nrelationships such as market sales volume in time series prediction, we\nproposed a time series prediction method of market sales volume based on\nSequential General VMD and spatial smoothing Long short-term memory neural\nnetwork (SS-LSTM) combination model. Firstly, the spatial smoothing algorithm\nis used to decompose and calculate the sample data of related industry sectors\naffected by the linkage effect of market sectors, extracting modal features\ncontaining information via Sequential General VMD on overall market and\nspecific price trends; Then, according to the background of different Market\ndata sets, LSTM network is used to model and predict the price of fundamental\ndata and modal characteristics. The experimental results of data prediction\nwith seasonal and periodic trends show that this method can achieve higher\nprice prediction accuracy and more accurate accuracy in specific market\ncontexts compared to traditional prediction methods Describe the changes in\nmarket sales volume."},{"date":"2024-06","title":"Stealing Image-to-Image Translation Models With a Single Query","author":"Nurit Spingarn-Eliezer, and Tomer Michaeli","link":"http://arxiv.org/abs/2406.00828v1","abstract":"Training deep neural networks requires significant computational resources\nand large datasets that are often confidential or expensive to collect. As a\nresult, owners tend to protect their models by allowing access only via an API.\nMany works demonstrated the possibility of stealing such protected models by\nrepeatedly querying the API. However, to date, research has predominantly\nfocused on stealing classification models, for which a very large number of\nqueries has been found necessary. In this paper, we study the possibility of\nstealing image-to-image models. Surprisingly, we find that many such models can\nbe stolen with as little as a single, small-sized, query image using simple\ndistillation. We study this phenomenon on a wide variety of model\narchitectures, datasets, and tasks, including denoising, deblurring, deraining,\nsuper-resolution, and biological image-to-image translation. Remarkably, we\nfind that the vulnerability to stealing attacks is shared by CNNs and by models\nwith attention mechanisms, and that stealing is commonly possible even without\nknowing the architecture of the target model."},{"date":"2024-05","title":"Large Language Model Watermark Stealing With Mixed Integer Programming","author":"Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shengshan Hu, Asif Gill, and Shirui Pan","link":"http://arxiv.org/abs/2405.19677v1","abstract":"The Large Language Model (LLM) watermark is a newly emerging technique that\nshows promise in addressing concerns surrounding LLM copyright, monitoring\nAI-generated text, and preventing its misuse. The LLM watermark scheme commonly\nincludes generating secret keys to partition the vocabulary into green and red\nlists, applying a perturbation to the logits of tokens in the green list to\nincrease their sampling likelihood, thus facilitating watermark detection to\nidentify AI-generated text if the proportion of green tokens exceeds a\nthreshold. However, recent research indicates that watermarking methods using\nnumerous keys are susceptible to removal attacks, such as token editing,\nsynonym substitution, and paraphrasing, with robustness declining as the number\nof keys increases. Therefore, the state-of-the-art watermark schemes that\nemploy fewer or single keys have been demonstrated to be more robust against\ntext editing and paraphrasing. In this paper, we propose a novel green list\nstealing attack against the state-of-the-art LLM watermark scheme and\nsystematically examine its vulnerability to this attack. We formalize the\nattack as a mixed integer programming problem with constraints. We evaluate our\nattack under a comprehensive threat model, including an extreme scenario where\nthe attacker has no prior knowledge, lacks access to the watermark detector\nAPI, and possesses no information about the LLM's parameter settings or\nwatermark injection/detection scheme. Extensive experiments on LLMs, such as\nOPT and LLaMA, demonstrate that our attack can successfully steal the green\nlist and remove the watermark across all settings."},{"date":"2024-05","title":"Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning","author":"Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan Liu, and Subbarao Kambhampati","link":"http://arxiv.org/abs/2405.15194v2","abstract":"Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward\ndomains, and the problem is further pronounced in case of stochastic\ntransitions. To improve the sample efficiency, reward shaping is a well-studied\napproach to introduce intrinsic rewards that can help the RL agent converge to\nan optimal policy faster. However, designing a useful reward shaping function\nfor all desirable states in the Markov Decision Process (MDP) is challenging,\neven for domain experts. Given that Large Language Models (LLMs) have\ndemonstrated impressive performance across a magnitude of natural language\ntasks, we aim to answer the following question: `Can we obtain heuristics using\nLLMs for constructing a reward shaping function that can boost an RL agent's\nsample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to\ngenerate a plan for an abstraction of the underlying MDP. We further use this\nLLM-generated plan as a heuristic to construct the reward shaping signal for\nthe downstream RL agent. By characterizing the type of abstraction based on the\nMDP horizon length, we analyze the quality of heuristics when generated using\nan LLM, with and without a verifier in the loop. Our experiments across\nmultiple domains with varying horizon length and number of sub-goals from the\nBabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the\nadvantages and limitations of querying LLMs with and without a verifier to\ngenerate a reward shaping heuristic, and, 2) a significant improvement in the\nsample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated\nheuristics."},{"date":"2024-05","title":"Evaluating Large Language Models for Public Health Classification and Extraction Tasks","author":"Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed, Thomas Finnie, Luke Hounsome, and Steven Riley","link":"http://arxiv.org/abs/2405.14766v1","abstract":"Advances in Large Language Models (LLMs) have led to significant interest in\ntheir potential to support human experts across a range of domains, including\npublic health. In this work we present automated evaluations of LLMs for public\nhealth tasks involving the classification and extraction of free text. We\ncombine six externally annotated datasets with seven new internally annotated\ndatasets to evaluate LLMs for processing text related to: health burden,\nepidemiological risk factors, and public health interventions. We initially\nevaluate five open-weight LLMs (7-70 billion parameters) across all tasks using\nzero-shot in-context learning. We find that Llama-3-70B-Instruct is the highest\nperforming model, achieving the best results on 15/17 tasks (using micro-F1\nscores). We see significant variation across tasks with all open-weight LLMs\nscoring below 60% micro-F1 on some challenging tasks, such as Contact\nClassification, while all LLMs achieve greater than 80% micro-F1 on others,\nsuch as GI Illness Classification. For a subset of 12 tasks, we also evaluate\nGPT-4 and find comparable results to Llama-3-70B-Instruct, which scores equally\nor outperforms GPT-4 on 6 of the 12 tasks. Overall, based on these initial\nresults we find promising signs that LLMs may be useful tools for public health\nexperts to extract information from a wide variety of free text sources, and\nsupport public health surveillance, research, and interventions."},{"date":"2024-05","title":"Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study","author":"Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, and James Thomas","link":"http://arxiv.org/abs/2405.14445v1","abstract":"This paper describes a rapid feasibility study of using GPT-4, a large\nlanguage model (LLM), to (semi)automate data extraction in systematic reviews.\nDespite the recent surge of interest in LLMs there is still a lack of\nunderstanding of how to design LLM-based automation tools and how to robustly\nevaluate their performance. During the 2023 Evidence Synthesis Hackathon we\nconducted two feasibility studies. Firstly, to automatically extract study\ncharacteristics from human clinical, animal, and social science domain studies.\nWe used two studies from each category for prompt-development; and ten for\nevaluation. Secondly, we used the LLM to predict Participants, Interventions,\nControls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP\ndataset. Overall, results indicated an accuracy of around 80%, with some\nvariability between domains (82% for human clinical, 80% for animal, and 72%\nfor studies of human social sciences). Causal inference methods and study\ndesign were the data extraction items with the most errors. In the PICO study,\nparticipants and intervention/control showed high accuracy (>80%), outcomes\nwere more challenging. Evaluation was done manually; scoring methods such as\nBLEU and ROUGE showed limited value. We observed variability in the LLMs\npredictions and changes in response quality. This paper presents a template for\nfuture evaluations of LLMs in the context of data extraction for systematic\nreview automation. Our results show that there might be value in using LLMs,\nfor example as second or third reviewers. However, caution is advised when\nintegrating models such as GPT-4 into tools. Further research on stability and\nreliability in practical settings is warranted for each type of data that is\nprocessed by the LLM."},{"date":"2024-05","title":"A Set-based Approach for Feature Extraction of 3D CAD Models","author":"Peng Xu, Qi Gao, and Ying-Jie Wu","link":"http://arxiv.org/abs/2406.18543v1","abstract":"Feature extraction is a critical technology to realize the automatic\ntransmission of feature information throughout product life cycles. As CAD\nmodels primarily capture the 3D geometry of products, feature extraction\nheavily relies on geometric information. However, existing feature extraction\nmethods often yield inaccurate outcomes due to the diverse interpretations of\ngeometric information. This report presents a set-based feature extraction\napproach to address this uncertainty issue. Unlike existing methods that seek\naccurate feature results, our approach aims to transform the uncertainty of\ngeometric information into a set of feature subgraphs. First, we define the\nconvexity of basic geometric entities and introduce the concept of two-level\nattributed adjacency graphs. Second, a feature extraction workflow is designed\nto determine feature boundaries and identify feature subgraphs from CAD models.\nThis set of feature subgraphs can be used for further feature recognition. A\nfeature extraction system is programmed using C++ and UG/Open to demonstrate\nthe feasibility of our proposed approach."},{"date":"2024-05","title":"Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model","author":"Tong Zeng, and Daniel Acuna","link":"http://arxiv.org/abs/2405.13135v1","abstract":"Datasets are critical for scientific research, playing an important role in\nreplication, reproducibility, and efficiency. Researchers have recently shown\nthat datasets are becoming more important for science to function properly,\neven serving as artifacts of study themselves. However, citing datasets is not\na common or standard practice in spite of recent efforts by data repositories\nand funding agencies. This greatly affects our ability to track their usage and\nimportance. A potential solution to this problem is to automatically extract\ndataset mentions from scientific articles. In this work, we propose to achieve\nsuch extraction by using a neural network based on a Bi-LSTM-CRF architecture.\nOur method achieves F1 = 0.885 in social science articles released as part of\nthe Rich Context Dataset. We discuss the limitations of the current datasets\nand propose modifications to the model to be done in the future."},{"date":"2024-05","title":"Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks","author":"Marcin Podhajski, Jan Dubi\u0144ski, Franziska Boenisch, Adam Dziedzic, and Agnieszka Pregowska And Tomasz Michalak","link":"http://arxiv.org/abs/2405.12295v3","abstract":"Graph Neural Networks (GNNs) are recognized as potent tools for processing\nreal-world data organized in graph structures. Especially inductive GNNs, which\nallow for the processing of graph-structured data without relying on predefined\ngraph structures, are becoming increasingly important in a wide range of\napplications. As such these networks become attractive targets for\nmodel-stealing attacks where an adversary seeks to replicate the functionality\nof the targeted network. Significant efforts have been devoted to developing\nmodel-stealing attacks that extract models trained on images and texts.\nHowever, little attention has been given to stealing GNNs trained on graph\ndata. This paper identifies a new method of performing unsupervised\nmodel-stealing attacks against inductive GNNs, utilizing graph contrastive\nlearning and spectral graph augmentations to efficiently extract information\nfrom the targeted model. The new type of attack is thoroughly evaluated on six\ndatasets and the results show that our approach outperforms the current\nstate-of-the-art by Shen et al. (2021). In particular, our attack surpasses the\nbaseline across all benchmarks, attaining superior fidelity and downstream\naccuracy of the stolen model while necessitating fewer queries directed toward\nthe target model."},{"date":"2024-05","title":"Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing","author":"Yunlong Zhao, Xiaoheng Deng, Yijing Liu, Xinjun Pei, Jiazhi Xia, and Wei Chen","link":"http://arxiv.org/abs/2406.18540v1","abstract":"Model stealing (MS) involves querying and observing the output of a machine\nlearning model to steal its capabilities. The quality of queried data is\ncrucial, yet obtaining a large amount of real data for MS is often challenging.\nRecent works have reduced reliance on real data by using generative models.\nHowever, when high-dimensional query data is required, these methods are\nimpractical due to the high costs of querying and the risk of model collapse.\nIn this work, we propose using sample gradients (SG) to enhance the utility of\neach real sample, as SG provides crucial guidance on the decision boundaries of\nthe victim model. However, utilizing SG in the model stealing scenario faces\ntwo challenges: 1. Pixel-level gradient estimation requires extensive query\nvolume and is susceptible to defenses. 2. The estimation of sample gradients\nhas a significant variance. This paper proposes Superpixel Sample Gradient\nstealing (SPSG) for model stealing under the constraint of limited real\nsamples. With the basic idea of imitating the victim model's low-variance\npatch-level gradients instead of pixel-level gradients, SPSG achieves efficient\nsample gradient estimation through two steps. First, we perform patch-wise\nperturbations on query images to estimate the average gradient in different\nregions of the image. Then, we filter the gradients through a threshold\nstrategy to reduce variance. Exhaustive experiments demonstrate that, with the\nsame number of real samples, SPSG achieves accuracy, agreements, and\nadversarial success rate significantly surpassing the current state-of-the-art\nMS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack."},{"date":"2024-05","title":"Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction","author":"Chinedu Ekuma","link":"http://arxiv.org/abs/2405.10448v2","abstract":"The advent of natural language processing and large language models (LLMs)\nhas revolutionized the extraction of data from unstructured scholarly papers.\nHowever, ensuring data trustworthiness remains a significant challenge. In this\npaper, we introduce PropertyExtractor, an open-source tool that leverages\nadvanced conversational LLMs like Google gemini-pro and OpenAI gpt-4, blends\nzero-shot with few-shot in-context learning, and employs engineered prompts for\nthe dynamic refinement of structured information hierarchies - enabling\nautonomous, efficient, scalable, and accurate identification, extraction, and\nverification of material property data. Our tests on material data demonstrate\nprecision and recall that exceed 95\\% with an error rate of approximately 9%,\nhighlighting the effectiveness and versatility of the toolkit. Finally,\ndatabases for 2D material thicknesses, a critical parameter for device\nintegration, and energy bandgap values are developed using PropertyExtractor.\nSpecifically for the thickness database, the rapid evolution of the field has\noutpaced both experimental measurements and computational methods, creating a\nsignificant data gap. Our work addresses this gap and showcases the potential\nof PropertyExtractor as a reliable and efficient tool for the autonomous\ngeneration of various material property databases, advancing the field."},{"date":"2024-05","title":"Unsupervised Work Behavior Pattern Extraction Based on Hierarchical Probabilistic Model","author":"Issei Saito, Tomoaki Nakamura, Toshiyuki Hatta, Wataru Fujita, Shintaro Watanabe, and Shotaro Miwa","link":"http://arxiv.org/abs/2405.09838v1","abstract":"Evolving consumer demands and market trends have led to businesses\nincreasingly embracing a production approach that prioritizes flexibility and\ncustomization. Consequently, factory workers must engage in tasks that are more\ncomplex than before. Thus, productivity depends on each worker's skills in\nassembling products. Therefore, analyzing the behavior of a worker is crucial\nfor work improvement. However, manual analysis is time consuming and does not\nprovide quick and accurate feedback. Machine learning have been attempted to\nautomate the analyses; however, most of these methods need several labels for\ntraining. To this end, we extend the Gaussian process hidden semi-Markov model\n(GP-HSMM), to enable the rapid and automated analysis of worker behavior\nwithout pre-training. The model does not require labeled data and can\nautomatically and accurately segment continuous motions into motion classes.\nThe proposed model is a probabilistic model that hierarchically connects\nGP-HSMM and HSMM, enabling the extraction of behavioral patterns with different\ngranularities. Furthermore, it mutually infers the parameters between the\nGP-HSMM and HSMM, resulting in accurate motion pattern extraction. We applied\nthe proposed method to motion data in which workers assembled products at an\nactual production site. The accuracy of behavior pattern extraction was\nevaluated using normalized Levenshtein distance (NLD). The smaller the value of\nNLD, the more accurate is the pattern extraction. The NLD of motion patterns\ncaptured by GP-HSMM and HSMM layers in our proposed method was 0.50 and 0.33,\nrespectively, which are the smallest compared to that of the baseline methods."},{"date":"2024-05","title":"The object detection model uses combined extraction with KNN and RF classification","author":"Florentina Tatrin Kurniati, Daniel HF Manongga, Irwan Sembiring, Sutarto Wijono, and Roy Rudolf Huizen","link":"http://arxiv.org/abs/2405.05551v1","abstract":"Object detection plays an important role in various fields. Developing\ndetection models for 2D objects that experience rotation and texture variations\nis a challenge. In this research, the initial stage of the proposed model\nintegrates the gray-level co-occurrence matrix (GLCM) and local binary patterns\n(LBP) texture feature extraction to obtain feature vectors. The next stage is\nclassifying features using k-nearest neighbors (KNN) and random forest (RF), as\nwell as voting ensemble (VE). System testing used a dataset of 4,437 2D images,\nthe results for KNN accuracy were 92.7% and F1-score 92.5%, while RF\nperformance was lower. Although GLCM features improve performance on both\nalgorithms, KNN is more consistent. The VE approach provides the best\nperformance with an accuracy of 93.9% and an F1 score of 93.8%, this shows the\neffectiveness of the ensemble technique in increasing object detection\naccuracy. This study contributes to the field of object detection with a new\napproach combining GLCM and LBP as feature vectors as well as VE for\nclassification"},{"date":"2024-05","title":"Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models","author":"Yang Bai, Ge Pei, Jindong Gu, Yong Yang, and Xingjun Ma","link":"http://arxiv.org/abs/2405.05990v2","abstract":"Large language models (LLMs) have achieved remarkable performance on a wide\nrange of tasks. However, recent studies have shown that LLMs can memorize\ntraining data and simple repeated tokens can trick the model to leak the data.\nIn this paper, we take a step further and show that certain special characters\nor their combinations with English letters are stronger memory triggers,\nleading to more severe data leakage. The intuition is that, since LLMs are\ntrained with massive data that contains a substantial amount of special\ncharacters (e.g. structural symbols {, } of JSON files, and @, # in emails and\nonline posts), the model may memorize the co-occurrence between these special\ncharacters and the raw texts. This motivates us to propose a simple but\neffective Special Characters Attack (SCA) to induce training data leakage. Our\nexperiments verify the high effectiveness of SCA against state-of-the-art LLMs:\nthey can leak diverse training data, such as code corpus, web pages, and\npersonally identifiable information, and sometimes generate non-stop outputs as\na byproduct. We further show that the composition of the training data corpus\ncan be revealed by inspecting the leaked data -- one crucial piece of\ninformation for pre-training high-performance LLMs. Our work can help\nunderstand the sensitivity of LLMs to special characters and identify potential\nareas for improvement."},{"date":"2024-05","title":"Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents","author":"Yanfei Dong, Lambert Deng, Jiazheng Zhang, Xiaodong Yu, Ting Lin, Francesco Gelli, Soujanya Poria, and Wee Sun Lee","link":"http://arxiv.org/abs/2405.06701v1","abstract":"Documents that consist of diverse templates and exhibit complex spatial\nstructures pose a challenge for document entity classification. We propose\nKNN-former, which incorporates a new kind of spatial bias in attention\ncalculation based on the K-nearest-neighbor (KNN) graph of document entities.\nWe limit entities' attention only to their local radius defined by the KNN\ngraph. We also use combinatorial matching to address the one-to-one mapping\nproperty that exists in many documents, where one field has only one\ncorresponding entity. Moreover, our method is highly parameter-efficient\ncompared to existing approaches in terms of the number of trainable parameters.\nDespite this, experiments across various datasets show our method outperforms\nbaselines in most entity types. Many real-world documents exhibit combinatorial\nproperties which can be leveraged as inductive biases to improve extraction\naccuracy, but existing datasets do not cover these documents. To facilitate\nfuture research into these types of documents, we release a new ID document\ndataset that covers diverse templates and languages. We also release enhanced\nannotations for an existing dataset."},{"date":"2024-05","title":"ModelShield: Adaptive and Robust Watermark against Model Extraction Attack","author":"Kaiyi Pang, Tao Qi, Chuhan Wu, Minhao Bai, Minghu Jiang, and Yongfeng Huang","link":"http://arxiv.org/abs/2405.02365v3","abstract":"Large language models (LLMs) demonstrate general intelligence across a\nvariety of machine learning tasks, thereby enhancing the commercial value of\ntheir intellectual property (IP). To protect this IP, model owners typically\nallow user access only in a black-box manner, however, adversaries can still\nutilize model extraction attacks to steal the model intelligence encoded in\nmodel generation. Watermarking technology offers a promising solution for\ndefending against such attacks by embedding unique identifiers into the\nmodel-generated content. However, existing watermarking methods often\ncompromise the quality of generated content due to heuristic alterations and\nlack robust mechanisms to counteract adversarial strategies, thus limiting\ntheir practicality in real-world scenarios. In this paper, we introduce an\nadaptive and robust watermarking method (named ModelShield) to protect the IP\nof LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs\nto autonomously insert watermarks into their generated content to avoid the\ndegradation of model content. We also propose a robust watermark detection\nmechanism capable of effectively identifying watermark signals under the\ninterference of varying adversarial strategies. Besides, ModelShield is a\nplug-and-play method that does not require additional model training, enhancing\nits applicability in LLM deployments. Extensive evaluations on two real-world\ndatasets and three LLMs demonstrate that our method surpasses existing methods\nin terms of defense effectiveness and robustness while significantly reducing\nthe degradation of watermarking on the model-generated content."},{"date":"2024-05","title":"Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models","author":"Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, and Byron C. Wallace","link":"http://arxiv.org/abs/2405.01686v2","abstract":"Meta-analyses statistically aggregate the findings of different randomized\ncontrolled trials (RCTs) to assess treatment effectiveness. Because this yields\nrobust estimates of treatment effectiveness, results from meta-analyses are\nconsidered the strongest form of evidence. However, rigorous evidence syntheses\nare time-consuming and labor-intensive, requiring manual extraction of data\nfrom individual trials to be synthesized. Ideally, language technologies would\npermit fully automatic meta-analysis, on demand. This requires accurately\nextracting numerical results from individual trials, which has been beyond the\ncapabilities of natural language processing (NLP) models to date. In this work,\nwe evaluate whether modern large language models (LLMs) can reliably perform\nthis task. We annotate (and release) a modest but granular evaluation dataset\nof clinical trial reports with numerical findings attached to interventions,\ncomparators, and outcomes. Using this dataset, we evaluate the performance of\nseven LLMs applied zero-shot for the task of conditionally extracting numerical\nfindings from trial reports. We find that massive LLMs that can accommodate\nlengthy inputs are tantalizingly close to realizing fully automatic\nmeta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality).\nHowever, LLMs -- including ones trained on biomedical texts -- perform poorly\nwhen the outcome measures are complex and tallying the results requires\ninference. This work charts a path toward fully automatic meta-analysis of RCTs\nvia LLMs, while also highlighting the limitations of existing models for this\naim."},{"date":"2024-05","title":"Enhancing Language Models for Financial Relation Extraction with Named Entities and Part-of-Speech","author":"Menglin Li, and Kwan Hui Lim","link":"http://arxiv.org/abs/2405.06665v1","abstract":"The Financial Relation Extraction (FinRE) task involves identifying the\nentities and their relation, given a piece of financial statement/text. To\nsolve this FinRE problem, we propose a simple but effective strategy that\nimproves the performance of pre-trained language models by augmenting them with\nNamed Entity Recognition (NER) and Part-Of-Speech (POS), as well as different\napproaches to combine these information. Experiments on a financial relations\ndataset show promising results and highlights the benefits of incorporating NER\nand POS in existing models. Our dataset and codes are available at\nhttps://github.com/kwanhui/FinRelExtract."},{"date":"2024-04","title":"ECC Analyzer: Extract Trading Signal from Earnings Conference Calls using Large Language Model for Stock Performance Prediction","author":"Yupeng Cao, Zhi Chen, Qingyun Pei, Nathan Jinseok Lee, K. P. Subbalakshmi, and Papa Momar Ndiaye","link":"http://arxiv.org/abs/2404.18470v2","abstract":"In the realm of financial analytics, leveraging unstructured data, such as\nearnings conference calls (ECCs), to forecast stock volatility is a critical\nchallenge that has attracted both academics and investors. While previous\nstudies have used multimodal deep learning-based models to obtain a general\nview of ECCs for volatility predicting, they often fail to capture detailed,\ncomplex information. Our research introduces a novel framework: \\textbf{ECC\nAnalyzer}, which utilizes large language models (LLMs) to extract richer, more\npredictive content from ECCs to aid the model's prediction performance. We use\nthe pre-trained large models to extract textual and audio features from ECCs\nand implement a hierarchical information extraction strategy to extract more\nfine-grained information. This strategy first extracts paragraph-level general\ninformation by summarizing the text and then extracts fine-grained focus\nsentences using Retrieval-Augmented Generation (RAG). These features are then\nfused through multimodal feature fusion to perform volatility prediction.\nExperimental results demonstrate that our model outperforms traditional\nanalytical benchmarks, confirming the effectiveness of advanced LLM techniques\nin financial analysis."},{"date":"2024-04","title":"Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models","author":"Minhao Bai, Kaiyi Pang, and Yongfeng Huang","link":"http://arxiv.org/abs/2405.01509v1","abstract":"In the rapidly evolving domain of artificial intelligence, safeguarding the\nintellectual property of Large Language Models (LLMs) is increasingly crucial.\nCurrent watermarking techniques against model extraction attacks, which rely on\nsignal insertion in model logits or post-processing of generated text, remain\nlargely heuristic. We propose a novel method for embedding learnable linguistic\nwatermarks in LLMs, aimed at tracing and preventing model extraction attacks.\nOur approach subtly modifies the LLM's output distribution by introducing\ncontrolled noise into token frequency distributions, embedding an statistically\nidentifiable controllable watermark.We leverage statistical hypothesis testing\nand information theory, particularly focusing on Kullback-Leibler Divergence,\nto differentiate between original and modified distributions effectively. Our\nwatermarking method strikes a delicate well balance between robustness and\noutput quality, maintaining low false positive/negative rates and preserving\nthe LLM's original performance."},{"date":"2024-04","title":"Utilizing Large Language Models for Information Extraction from Real Estate Transactions","author":"Yu Zhao, and Haoxiang Gao","link":"http://arxiv.org/abs/2404.18043v1","abstract":"Real estate sales contracts contain crucial information for property\ntransactions, but manual extraction of data can be time-consuming and\nerror-prone. This paper explores the application of large language models,\nspecifically transformer-based architectures, for automated information\nextraction from real estate contracts. We discuss challenges, techniques, and\nfuture directions in leveraging these models to improve efficiency and accuracy\nin real estate contract analysis."},{"date":"2024-04","title":"Empirical Analysis of Dialogue Relation Extraction with Large Language Models","author":"Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, and Yikai Guo","link":"http://arxiv.org/abs/2404.17802v1","abstract":"Dialogue relation extraction (DRE) aims to extract relations between two\narguments within a dialogue, which is more challenging than standard RE due to\nthe higher person pronoun frequency and lower information density in dialogues.\nHowever, existing DRE methods still suffer from two serious issues: (1) hard to\ncapture long and sparse multi-turn information, and (2) struggle to extract\ngolden relations based on partial dialogues, which motivates us to discover\nmore effective methods that can alleviate the above issues. We notice that the\nrise of large language models (LLMs) has sparked considerable interest in\nevaluating their performance across diverse tasks. To this end, we initially\ninvestigate the capabilities of different LLMs in DRE, considering both\nproprietary models and open-source models. Interestingly, we discover that LLMs\nsignificantly alleviate two issues in existing DRE methods. Generally, we have\nfollowing findings: (1) scaling up model size substantially boosts the overall\nDRE performance and achieves exceptional results, tackling the difficulty of\ncapturing long and sparse multi-turn information; (2) LLMs encounter with much\nsmaller performance drop from entire dialogue setting to partial dialogue\nsetting compared to existing methods; (3) LLMs deliver competitive or superior\nperformances under both full-shot and few-shot settings compared to current\nstate-of-the-art; (4) LLMs show modest performances on inverse relations but\nmuch stronger improvements on general relations, and they can handle dialogues\nof various lengths especially for longer sequences."},{"date":"2024-04","title":"GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction","author":"Urchade Zaratiana, Nadi Tomeh, Niama El Khbir, Pierre Holat, and Thierry Charnois","link":"http://arxiv.org/abs/2404.12491v1","abstract":"Information extraction (IE) is an important task in Natural Language\nProcessing (NLP), involving the extraction of named entities and their\nrelationships from unstructured text. In this paper, we propose a novel\napproach to this task by formulating it as graph structure learning (GSL). By\nformulating IE as GSL, we enhance the model's ability to dynamically refine and\noptimize the graph structure during the extraction process. This formulation\nallows for better interaction and structure-informed decisions for entity and\nrelation prediction, in contrast to previous models that have separate or\nuntied predictions for these tasks. When compared against state-of-the-art\nbaselines on joint entity and relation extraction benchmarks, our model,\nGraphER, achieves competitive results."},{"date":"2024-04","title":"AI-Enhanced Cognitive Behavioral Therapy: Deep Learning and Large Language Models for Extracting Cognitive Pathways from Social Media Texts","author":"Meng Jiang, Yi Jing Yu, Qing Zhao, Jianqiang Li, Changwei Song, Hongzhi Qi, Wei Zhai, Dan Luo, Xiaoqin Wang, Guanghui Fu, and Bing Xiang Yang","link":"http://arxiv.org/abs/2404.11449v1","abstract":"Cognitive Behavioral Therapy (CBT) is an effective technique for addressing\nthe irrational thoughts stemming from mental illnesses, but it necessitates\nprecise identification of cognitive pathways to be successfully implemented in\npatient care. In current society, individuals frequently express negative\nemotions on social media on specific topics, often exhibiting cognitive\ndistortions, including suicidal behaviors in extreme cases. Yet, there is a\nnotable absence of methodologies for analyzing cognitive pathways that could\naid psychotherapists in conducting effective interventions online. In this\nstudy, we gathered data from social media and established the task of\nextracting cognitive pathways, annotating the data based on a cognitive\ntheoretical framework. We initially categorized the task of extracting\ncognitive pathways as a hierarchical text classification with four main\ncategories and nineteen subcategories. Following this, we structured a text\nsummarization task to help psychotherapists quickly grasp the essential\ninformation. Our experiments evaluate the performance of deep learning and\nlarge language models (LLMs) on these tasks. The results demonstrate that our\ndeep learning method achieved a micro-F1 score of 62.34% in the hierarchical\ntext classification task. Meanwhile, in the text summarization task, GPT-4\nattained a Rouge-1 score of 54.92 and a Rouge-2 score of 30.86, surpassing the\nexperimental deep learning model's performance. However, it may suffer from an\nissue of hallucination. We have made all models and codes publicly available to\nsupport further research in this field."},{"date":"2024-04","title":"TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment","author":"Qinfeng Li, Zhiqiang Shen, Zhenghan Qin, Yangfan Xie, Xuhong Zhang, Tianyu Du, and Jianwei Yin","link":"http://arxiv.org/abs/2404.11121v1","abstract":"Proprietary large language models (LLMs) have been widely applied in various\nscenarios. Additionally, deploying LLMs on edge devices is trending for\nefficiency and privacy reasons. However, edge deployment of proprietary LLMs\nintroduces new security challenges: edge-deployed models are exposed as\nwhite-box accessible to users, enabling adversaries to conduct effective model\nstealing (MS) attacks. Unfortunately, existing defense mechanisms fail to\nprovide effective protection. Specifically, we identify four critical\nprotection properties that existing methods fail to simultaneously satisfy: (1)\nmaintaining protection after a model is physically copied; (2) authorizing\nmodel access at request level; (3) safeguarding runtime reverse engineering;\n(4) achieving high security with negligible runtime overhead. To address the\nabove issues, we propose TransLinkGuard, a plug-and-play model protection\napproach against model stealing on edge devices. The core part of\nTransLinkGuard is a lightweight authorization module residing in a secure\nenvironment, e.g., TEE. The authorization module can freshly authorize each\nrequest based on its input. Extensive experiments show that TransLinkGuard\nachieves the same security protection as the black-box security guarantees with\nnegligible overhead."},{"date":"2024-04","title":"A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents","author":"Wiam Adnan, Joel Tang, Yassine Bel Khayat Zouggari, Seif Edinne Laatiri, Laurent Lam, and Fabien Caspani","link":"http://arxiv.org/abs/2404.10848v1","abstract":"Document Understanding is an evolving field in Natural Language Processing\n(NLP). In particular, visual and spatial features are essential in addition to\nthe raw text itself and hence, several multimodal models were developed in the\nfield of Visual Document Understanding (VDU). However, while research is mainly\nfocused on Key Information Extraction (KIE), Relation Extraction (RE) between\nidentified entities is still under-studied. For instance, RE is crucial to\nregroup entities or obtain a comprehensive hierarchy of data in a document. In\nthis paper, we present a model that, initialized from LayoutLMv3, can match or\noutperform the current state-of-the-art results in RE applied to Visually-Rich\nDocuments (VRD) on FUNSD and CORD datasets, without any specific pre-training\nand with fewer parameters. We also report an extensive ablation study performed\non FUNSD, highlighting the great impact of certain features and modelization\nchoices on the performances."},{"date":"2024-04","title":"Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations","author":"Yiming Li, Xueqing Peng, Jianfu Li, Xu Zuo, Suyuan Peng, Donghong Pei, Cui Tao, Hua Xu, and Na Hong","link":"http://arxiv.org/abs/2404.05415v2","abstract":"In acupuncture therapy, the accurate location of acupoints is essential for\nits effectiveness. The advanced language understanding capabilities of large\nlanguage models (LLMs) like Generative Pre-trained Transformers (GPT) present a\nsignificant opportunity for extracting relations related to acupoint locations\nfrom textual knowledge sources. This study aims to compare the performance of\nGPT with traditional deep learning models (Long Short-Term Memory (LSTM) and\nBidirectional Encoder Representations from Transformers for Biomedical Text\nMining (BioBERT)) in extracting acupoint-related location relations and assess\nthe impact of pretraining and fine-tuning on GPT's performance. We utilized the\nWorld Health Organization Standard Acupuncture Point Locations in the Western\nPacific Region (WHO Standard) as our corpus, which consists of descriptions of\n361 acupoints. Five types of relations ('direction_of,' 'distance_of,'\n'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints\nwere annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5,\nfine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included\nmicro-average exact match precision, recall, and F1 scores. Our results\ndemonstrate that fine-tuned GPT-3.5 consistently outperformed other models in\nF1 scores across all relation types. Overall, it achieved the highest\nmicro-average F1 score of 0.92. This study underscores the effectiveness of\nLLMs like GPT in extracting relations related to acupoint locations, with\nimplications for accurately modeling acupuncture knowledge and promoting\nstandard implementation in acupuncture training and practice. The findings also\ncontribute to advancing informatics applications in traditional and\ncomplementary medicine, showcasing the potential of LLMs in natural language\nprocessing."},{"date":"2024-04","title":"PerkwE_COQA: Enhanced Persian Conversational Question Answering by combining contextual keyword extraction with Large Language Models","author":"Pardis Moradbeiki, and Nasser Ghadiri","link":"http://arxiv.org/abs/2404.05406v2","abstract":"Smart cities need the involvement of their residents to enhance quality of\nlife. Conversational query-answering is an emerging approach for user\nengagement. There is an increasing demand of an advanced conversational\nquestion-answering that goes beyond classic systems. Existing approaches have\nshown that LLMs offer promising capabilities for CQA, but may struggle to\ncapture the nuances of conversational contexts. The new approach involves\nunderstanding the content and engaging in a multi-step conversation with the\nuser to fulfill their needs. This paper presents a novel method to elevate the\nperformance of Persian Conversational question-answering (CQA) systems. It\ncombines the strengths of Large Language Models (LLMs) with contextual keyword\nextraction. Our method extracts keywords specific to the conversational flow,\nproviding the LLM with additional context to understand the user's intent and\ngenerate more relevant and coherent responses. We evaluated the effectiveness\nof this combined approach through various metrics, demonstrating significant\nimprovements in CQA performance compared to an LLM-only baseline. The proposed\nmethod effectively handles implicit questions, delivers contextually relevant\nanswers, and tackles complex questions that rely heavily on conversational\ncontext. The findings indicate that our method outperformed the evaluation\nbenchmarks up to 8% higher than existing methods and the LLM-only baseline."},{"date":"2024-04","title":"GLCM-Based Feature Combination for Extraction Model Optimization in Object Detection Using Machine Learning","author":"Florentina Tatrin Kurniati, Daniel HF Manongga, Eko Sediyono, Sri Yulianto Joko Prasetyo, and Roy Rudolf Huizen","link":"http://arxiv.org/abs/2404.04578v1","abstract":"In the era of modern technology, object detection using the Gray Level\nCo-occurrence Matrix (GLCM) extraction method plays a crucial role in object\nrecognition processes. It finds applications in real-time scenarios such as\nsecurity surveillance and autonomous vehicle navigation, among others.\nComputational efficiency becomes a critical factor in achieving real-time\nobject detection. Hence, there is a need for a detection model with low\ncomplexity and satisfactory accuracy. This research aims to enhance\ncomputational efficiency by selecting appropriate features within the GLCM\nframework. Two classification models, namely K-Nearest Neighbours (K-NN) and\nSupport Vector Machine (SVM), were employed, with the results indicating that\nK-Nearest Neighbours (K-NN) outperforms SVM in terms of computational\ncomplexity. Specifically, K-NN, when utilizing a combination of Correlation,\nEnergy, and Homogeneity features, achieves a 100% accuracy rate with low\ncomplexity. Moreover, when using a combination of Energy and Homogeneity\nfeatures, K-NN attains an almost perfect accuracy level of 99.9889%, while\nmaintaining low complexity. On the other hand, despite SVM achieving 100%\naccuracy in certain feature combinations, its high or very high complexity can\npose challenges, particularly in real-time applications. Therefore, based on\nthe trade-off between accuracy and complexity, the K-NN model with a\ncombination of Correlation, Energy, and Homogeneity features emerges as a more\nsuitable choice for real-time applications that demand high accuracy and low\ncomplexity. This research provides valuable insights for optimizing object\ndetection in various applications requiring both high accuracy and rapid\nresponsiveness."},{"date":"2024-04","title":"Knowledge Distillation-Based Model Extraction Attack using GAN-based Private Counterfactual Explanations","author":"Fatima Ezzeddine, Omran Ayoub, and Silvia Giordano","link":"http://arxiv.org/abs/2404.03348v2","abstract":"In recent years, there has been a notable increase in the deployment of\nmachine learning (ML) models as services (MLaaS) across diverse production\nsoftware applications. In parallel, explainable AI (XAI) continues to evolve,\naddressing the necessity for transparency and trustworthiness in ML models. XAI\ntechniques aim to enhance the transparency of ML models by providing insights,\nin terms of model's explanations, into their decision-making process.\nSimultaneously, some MLaaS platforms now offer explanations alongside the ML\nprediction outputs. This setup has elevated concerns regarding vulnerabilities\nin MLaaS, particularly in relation to privacy leakage attacks such as model\nextraction attacks (MEA). This is due to the fact that explanations can unveil\ninsights about the inner workings of the model which could be exploited by\nmalicious users. In this work, we focus on investigating how model\nexplanations, particularly counterfactual explanations (CFs), can be exploited\nfor performing MEA within the MLaaS platform. We also delve into assessing the\neffectiveness of incorporating differential privacy (DP) as a mitigation\nstrategy. To this end, we first propose a novel approach for MEA based on\nKnowledge Distillation (KD) to enhance the efficiency of extracting a\nsubstitute model of a target model exploiting CFs, without any knowledge about\nthe training data distribution by the attacker. Then, we advise an approach for\ntraining CF generators incorporating DP to generate private CFs. We conduct\nthorough experimental evaluations on real-world datasets and demonstrate that\nour proposed KD-based MEA can yield a high-fidelity substitute model with a\nreduced number of queries with respect to baseline approaches. Furthermore, our\nfindings reveal that including a privacy layer can allow mitigating the MEA.\nHowever, on the account of the quality of CFs, impacts the performance of the\nexplanations."},{"date":"2024-04","title":"Comparative Study of Domain Driven Terms Extraction Using Large Language Models","author":"Sandeep Chataut, Tuyen Do, Bichar Dip Shrestha Gurung, Shiva Aryal, Anup Khanal, Carol Lushbough, and Etienne Gnimpieba","link":"http://arxiv.org/abs/2404.02330v1","abstract":"Keywords play a crucial role in bridging the gap between human understanding\nand machine processing of textual data. They are essential to data enrichment\nbecause they form the basis for detailed annotations that provide a more\ninsightful and in-depth view of the underlying data. Keyword/domain driven term\nextraction is a pivotal task in natural language processing, facilitating\ninformation retrieval, document summarization, and content categorization. This\nreview focuses on keyword extraction methods, emphasizing the use of three\nmajor Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We\nemployed a custom Python package to interface with these LLMs, simplifying\nkeyword extraction. Our study, utilizing the Inspec and PubMed datasets,\nevaluates the performance of these models. The Jaccard similarity index was\nused for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for\nGPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This\npaper underlines the role of prompt engineering in LLMs for better keyword\nextraction and discusses the impact of hallucination in LLMs on result\nevaluation. It also sheds light on the challenges in using LLMs for keyword\nextraction, including model complexity, resource demands, and optimization\ntechniques."},{"date":"2024-04","title":"Towards System Modelling to Support Diseases Data Extraction from the Electronic Health Records for Physicians Research Activities","author":"Bushra F. Alsaqer, Alaa F. Alsaqer, and Amna Asif","link":"http://arxiv.org/abs/2404.01218v1","abstract":"The use of Electronic Health Records (EHRs) has increased dramatically in the\npast 15 years, as, it is considered an important source of managing data od\npatients. The EHRs are primary sources of disease diagnosis and demographic\ndata of patients worldwide. Therefore, the data can be utilized for secondary\ntasks such as research. This paper aims to make such data usable for research\nactivities such as monitoring disease statistics for a specific population. As\na result, the researchers can detect the disease causes for the behavior and\nlifestyle of the target group. One of the limitations of EHRs systems is that\nthe data is not available in the standard format but in various forms.\nTherefore, it is required to first convert the names of the diseases and\ndemographics data into one standardized form to make it usable for research\nactivities. There is a large amount of EHRs available, and solving the\nstandardizing issues requires some optimized techniques. We used a first-hand\nEHR dataset extracted from EHR systems. Our application uploads the dataset\nfrom the EHRs and converts it to the ICD-10 coding system to solve the\nstandardization problem. So, we first apply the steps of pre-processing,\nannotation, and transforming the data to convert it into the standard form. The\ndata pre-processing is applied to normalize demographic formats. In the\nannotation step, a machine learning model is used to recognize the diseases\nfrom the text. Furthermore, the transforming step converts the disease name to\nthe ICD-10 coding format. The model was evaluated manually by comparing its\nperformance in terms of disease recognition with an available dictionary-based\nsystem (MetaMap). The accuracy of the proposed machine learning model is 81%,\nthat outperformed MetaMap accuracy of 67%. This paper contributed to system\nmodelling for EHR data extraction to support research activities."},{"date":"2024-03","title":"MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models","author":"Zebang Cheng, Fuqiang Niu, Yuxiang Lin, Zhi-Qi Cheng, Bowen Zhang, and Xiaojiang Peng","link":"http://arxiv.org/abs/2404.00511v3","abstract":"This paper presents our winning submission to Subtask 2 of SemEval 2024 Task\n3 on multimodal emotion cause analysis in conversations. We propose a novel\nMultimodal Emotion Recognition and Multimodal Emotion Cause Extraction\n(MER-MCE) framework that integrates text, audio, and visual modalities using\nspecialized emotion encoders. Our approach sets itself apart from\ntop-performing teams by leveraging modality-specific features for enhanced\nemotion understanding and causality inference. Experimental evaluation\ndemonstrates the advantages of our multimodal approach, with our submission\nachieving a competitive weighted F1 score of 0.3435, ranking third with a\nmargin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team.\nProject: https://github.com/MIPS-COLT/MER-MCE.git"},{"date":"2024-03","title":"Privacy Backdoors: Stealing Data with Corrupted Pretrained Models","author":"Shanglun Feng, and Florian Tram\u00e8r","link":"http://arxiv.org/abs/2404.00473v1","abstract":"Practitioners commonly download pretrained machine learning models from open\nrepositories and finetune them to fit specific applications. We show that this\npractice introduces a new risk of privacy backdoors. By tampering with a\npretrained model's weights, an attacker can fully compromise the privacy of the\nfinetuning data. We show how to build privacy backdoors for a variety of\nmodels, including transformers, which enable an attacker to reconstruct\nindividual finetuning samples, with a guaranteed success! We further show that\nbackdoored models allow for tight privacy attacks on models trained with\ndifferential privacy (DP). The common optimistic practice of training DP models\nwith loose privacy guarantees is thus insecure if the model is not trusted.\nOverall, our work highlights a crucial and overlooked supply chain attack on\nmachine learning privacy."},{"date":"2024-03","title":"Efficient Data-Free Model Stealing with Label Diversity","author":"Yiyong Liu, Rui Wen, Michael Backes, and Yang Zhang","link":"http://arxiv.org/abs/2404.00108v1","abstract":"Machine learning as a Service (MLaaS) allows users to query the machine\nlearning model in an API manner, which provides an opportunity for users to\nenjoy the benefits brought by the high-performance model trained on valuable\ndata. This interface boosts the proliferation of machine learning based\napplications, while on the other hand, it introduces the attack surface for\nmodel stealing attacks. Existing model stealing attacks have relaxed their\nattack assumptions to the data-free setting, while keeping the effectiveness.\nHowever, these methods are complex and consist of several components, which\nobscure the core on which the attack really depends. In this paper, we revisit\nthe model stealing problem from a diversity perspective and demonstrate that\nkeeping the generated data samples more diverse across all the classes is the\ncritical point for improving the attack performance. Based on this conjecture,\nwe provide a simplified attack framework. We empirically signify our conjecture\nby evaluating the effectiveness of our attack, and experimental results show\nthat our approach is able to achieve comparable or even better performance\ncompared with the state-of-the-art method. Furthermore, benefiting from the\nabsence of redundant components, our method demonstrates its advantages in\nattack efficiency and query budget."},{"date":"2024-03","title":"Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models","author":"Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Dahyun Kim, and Chanjun Park","link":"http://arxiv.org/abs/2403.19340v1","abstract":"To address the challenges associated with data processing at scale, we\npropose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline\nfor large language models (LLMs) with a user-friendly design at its core. Easy\naddition of custom processors with block-based interface in Dataverse allows\nusers to readily and efficiently use Dataverse to build their own ETL pipeline.\nWe hope that Dataverse will serve as a vital tool for LLM development and open\nsource the entire library to welcome community contribution. Additionally, we\nprovide a concise, two-minute video demonstration of our system, illustrating\nits capabilities and implementation."},{"date":"2024-03","title":"MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction","author":"Mahendra Gurve, Sankar Behera, Satyadev Ahlawat, and Yamuna Prasad","link":"http://arxiv.org/abs/2403.18580v1","abstract":"The rise of Machine Learning as a Service (MLaaS) has led to the widespread\ndeployment of machine learning models trained on diverse datasets. These models\nare employed for predictive services through APIs, raising concerns about the\nsecurity and confidentiality of the models due to emerging vulnerabilities in\nprediction APIs. Of particular concern are model cloning attacks, where\nindividuals with limited data and no knowledge of the training dataset manage\nto replicate a victim model's functionality through black-box query access.\nThis commonly entails generating adversarial queries to query the victim model,\nthereby creating a labeled dataset.\n This paper proposes \"MisGUIDE\", a two-step defense framework for Deep\nLearning models that disrupts the adversarial sample generation process by\nproviding a probabilistic response when the query is deemed OOD. The first step\nemploys a Vision Transformer-based framework to identify OOD queries, while the\nsecond step perturbs the response for such queries, introducing a probabilistic\nloss function to MisGUIDE the attackers. The aim of the proposed defense method\nis to reduce the accuracy of the cloned model while maintaining accuracy on\nauthentic queries. Extensive experiments conducted on two benchmark datasets\ndemonstrate that the proposed framework significantly enhances the resistance\nagainst state-of-the-art data-free model extraction in black-box settings."},{"date":"2024-03","title":"A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks","author":"Axel Constant, Hannes Westermann, Bryan Wilson, Alex Kiefer, Ines Hipolito, Sylvain Pronovost, Steven Swanson, Mahault Albarracin, and Maxwell J. D. Ramstead","link":"http://arxiv.org/abs/2403.18537v1","abstract":"Legal autonomy - the lawful activity of artificial intelligence agents - can\nbe achieved in one of two ways. It can be achieved either by imposing\nconstraints on AI actors such as developers, deployers and users, and on AI\nresources such as data, or by imposing constraints on the range and scope of\nthe impact that AI agents can have on the environment. The latter approach\ninvolves encoding extant rules concerning AI driven devices into the software\nof AI agents controlling those devices (e.g., encoding rules about limitations\non zones of operations into the agent software of an autonomous drone device).\nThis is a challenge since the effectivity of such an approach requires a method\nof extracting, loading, transforming and computing legal information that would\nbe both explainable and legally interoperable, and that would enable AI agents\nto reason about the law. In this paper, we sketch a proof of principle for such\na method using large language models (LLMs), expert legal systems known as\nlegal decision paths, and Bayesian networks. We then show how the proposed\nmethod could be applied to extant regulation in matters of autonomous cars,\nsuch as the California Vehicle Code."},{"date":"2024-03","title":"Segment Anything Model for Road Network Graph Extraction","author":"Congrui Hetang, Haoru Xue, Cindy Le, Tianwei Yue, Wenping Wang, and Yihui He","link":"http://arxiv.org/abs/2403.16051v3","abstract":"We propose SAM-Road, an adaptation of the Segment Anything Model (SAM) for\nextracting large-scale, vectorized road network graphs from satellite imagery.\nTo predict graph geometry, we formulate it as a dense semantic segmentation\ntask, leveraging the inherent strengths of SAM. The image encoder of SAM is\nfine-tuned to produce probability masks for roads and intersections, from which\nthe graph vertices are extracted via simple non-maximum suppression. To predict\ngraph topology, we designed a lightweight transformer-based graph neural\nnetwork, which leverages the SAM image embeddings to estimate the edge\nexistence probabilities between vertices. Our approach directly predicts the\ngraph vertices and edges for large regions without expensive and complex\npost-processing heuristics, and is capable of building complete road network\ngraphs spanning multiple square kilometers in a matter of seconds. With its\nsimple, straightforward, and minimalist design, SAM-Road achieves comparable\naccuracy with the state-of-the-art method RNGDet++, while being 40 times faster\non the City-scale dataset. We thus demonstrate the power of a foundational\nvision model when applied to a graph learning task. The code is available at\nhttps://github.com/htcr/sam_road."},{"date":"2024-03","title":"AutoRE: Document-Level Relation Extraction with Large Language Models","author":"Lilong Xue, Dan Zhang, Yuxiao Dong, and Jie Tang","link":"http://arxiv.org/abs/2403.14888v3","abstract":"Large Language Models (LLMs) have demonstrated exceptional abilities in\ncomprehending and generating text, motivating numerous researchers to utilize\nthem for Information Extraction (IE) purposes, including Relation Extraction\n(RE). Nonetheless, most existing methods are predominantly designed for\nSentence-level Relation Extraction (SentRE) tasks, which typically encompass a\nrestricted set of relations and triplet facts within a single sentence.\nFurthermore, certain approaches resort to treating relations as candidate\nchoices integrated into prompt templates, leading to inefficient processing and\nsuboptimal performance when tackling Document-Level Relation Extraction (DocRE)\ntasks, which entail handling multiple relations and triplet facts distributed\nacross a given document, posing distinct challenges. To overcome these\nlimitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel\nRE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing\napproaches, AutoRE does not rely on the assumption of known relation options,\nmaking it more reflective of real-world scenarios. Additionally, we have\ndeveloped an easily extensible RE framework using a Parameters Efficient Fine\nTuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset\nshowcase AutoRE's best performance, achieving state-of-the-art results,\nsurpassing TAG by 10.03\\% and 9.03\\% respectively on the dev and test set. The\ncode is available at https://github.com/THUDM/AutoRE and the demonstration\nvideo is provided at https://www.youtube.com/watch?v=IhKRsZUAxKk."},{"date":"2024-03","title":"Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation","author":"Mathias \u00d6ttl, Frauke Wilm, Jana Steenpass, Jingna Qiu, Matthias R\u00fcbner, Arndt Hartmann, Matthias Beckmann, Peter Fasching, Andreas Maier, Ramona Erber, Bernhard Kainz, and Katharina Breininger","link":"http://arxiv.org/abs/2403.14429v1","abstract":"Deep learning-based image generation has seen significant advancements with\ndiffusion models, notably improving the quality of generated images. Despite\nthese developments, generating images with unseen characteristics beneficial\nfor downstream tasks has received limited attention. To bridge this gap, we\npropose Style-Extracting Diffusion Models, featuring two conditioning\nmechanisms. Specifically, we utilize 1) a style conditioning mechanism which\nallows to inject style information of previously unseen images during image\ngeneration and 2) a content conditioning which can be targeted to a downstream\ntask, e.g., layout for segmentation. We introduce a trainable style encoder to\nextract style information from images, and an aggregation block that merges\nstyle information from multiple style inputs. This architecture enables the\ngeneration of images with unseen styles in a zero-shot manner, by leveraging\nstyles from unseen images, resulting in more diverse generations. In this work,\nwe use the image layout as target condition and first show the capability of\nour method on a natural image dataset as a proof-of-concept. We further\ndemonstrate its versatility in histopathology, where we combine prior knowledge\nabout tissue composition and unannotated data to create diverse synthetic\nimages with known layouts. This allows us to generate additional synthetic data\nto train a segmentation network in a semi-supervised fashion. We verify the\nadded value of the generated images by showing improved segmentation results\nand lower performance variability between patients when synthetic images are\nincluded during segmentation training. Our code will be made publicly available\nat [LINK]."},{"date":"2024-03","title":"Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting","author":"Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab, Christina Kiriakou, Nicolas Geis, Christoph Dieterich, and Anette Frank","link":"http://arxiv.org/abs/2403.13369v2","abstract":"Automatic extraction of medical information from clinical documents poses\nseveral challenges: high costs of required clinical expertise, limited\ninterpretability of model predictions, restricted computational resources and\nprivacy regulations. Recent advances in domain-adaptation and prompting methods\nshowed promising results with minimal training data using lightweight masked\nlanguage models, which are suited for well-established interpretability\nmethods. We are first to present a systematic evaluation of these methods in a\nlow-resource setting, by performing multi-class section classification on\nGerman doctor's letters. We conduct extensive class-wise evaluations supported\nby Shapley values, to validate the quality of our small training data set and\nto ensure the interpretability of model predictions. We demonstrate that a\nlightweight, domain-adapted pretrained model, prompted with just 20 shots,\noutperforms a traditional classification model by 30.5% accuracy. Our results\nserve as a process-oriented guideline for clinical information extraction\nprojects working with low-resource."},{"date":"2024-03","title":"Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models","author":"Joana Ribeiro de Faria, Huiyuan Xie, and Felix Steffek","link":"http://arxiv.org/abs/2403.12936v1","abstract":"Court transcripts and judgments are rich repositories of legal knowledge,\ndetailing the intricacies of cases and the rationale behind judicial decisions.\nThe extraction of key information from these documents provides a concise\noverview of a case, crucial for both legal experts and the public. With the\nadvent of large language models (LLMs), automatic information extraction has\nbecome increasingly feasible and efficient. This paper presents a comprehensive\nstudy on the application of GPT-4, a large language model, for automatic\ninformation extraction from UK Employment Tribunal (UKET) cases. We\nmeticulously evaluated GPT-4's performance in extracting critical information\nwith a manual verification process to ensure the accuracy and relevance of the\nextracted data. Our research is structured around two primary extraction tasks:\nthe first involves a general extraction of eight key aspects that hold\nsignificance for both legal specialists and the general public, including the\nfacts of the case, the claims made, references to legal statutes, references to\nprecedents, general case outcomes and corresponding labels, detailed order and\nremedies and reasons for the decision. The second task is more focused, aimed\nat analysing three of those extracted features, namely facts, claims and\noutcomes, in order to facilitate the development of a tool capable of\npredicting the outcome of employment law disputes. Through our analysis, we\ndemonstrate that LLMs like GPT-4 can obtain high accuracy in legal information\nextraction, highlighting the potential of LLMs in revolutionising the way legal\ninformation is processed and utilised, offering significant implications for\nlegal research and practice."},{"date":"2024-03","title":"Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales","author":"Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, and Huan Liu","link":"http://arxiv.org/abs/2403.12403v2","abstract":"Although social media platforms are a prominent arena for users to engage in\ninterpersonal discussions and express opinions, the facade and anonymity\noffered by social media may allow users to spew hate speech and offensive\ncontent. Given the massive scale of such platforms, there arises a need to\nautomatically identify and flag instances of hate speech. Although several hate\nspeech detection methods exist, most of these black-box methods are not\ninterpretable or explainable by design. To address the lack of\ninterpretability, in this paper, we propose to use state-of-the-art Large\nLanguage Models (LLMs) to extract features in the form of rationales from the\ninput text, to train a base hate speech classifier, thereby enabling faithful\ninterpretability by design. Our framework effectively combines the textual\nunderstanding capabilities of LLMs and the discriminative power of\nstate-of-the-art hate speech classifiers to make these classifiers faithfully\ninterpretable. Our comprehensive evaluation on a variety of English language\nsocial media hate speech datasets demonstrate: (1) the goodness of the\nLLM-extracted rationales, and (2) the surprising retention of detector\nperformance even after training to ensure interpretability. All code and data\nwill be made available at https://github.com/AmritaBh/shield."},{"date":"2024-03","title":"Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach","author":"Maria Mahbub, Gregory M. Dams, Sudarshan Srinivasan, Caitlin Rizy, Ioana Danciu, Jodie Trafton, and Kathryn Knight","link":"http://arxiv.org/abs/2403.12297v1","abstract":"Substance use disorder (SUD) poses a major concern due to its detrimental\neffects on health and society. SUD identification and treatment depend on a\nvariety of factors such as severity, co-determinants (e.g., withdrawal\nsymptoms), and social determinants of health. Existing diagnostic coding\nsystems used by American insurance providers, like the International\nClassification of Diseases (ICD-10), lack granularity for certain diagnoses,\nbut clinicians will add this granularity (as that found within the Diagnostic\nand Statistical Manual of Mental Disorders classification or DSM-5) as\nsupplemental unstructured text in clinical notes. Traditional natural language\nprocessing (NLP) methods face limitations in accurately parsing such diverse\nclinical language. Large Language Models (LLMs) offer promise in overcoming\nthese challenges by adapting to diverse language patterns. This study\ninvestigates the application of LLMs for extracting severity-related\ninformation for various SUD diagnoses from clinical notes. We propose a\nworkflow employing zero-shot learning of LLMs with carefully crafted prompts\nand post-processing techniques. Through experimentation with Flan-T5, an\nopen-source LLM, we demonstrate its superior recall compared to the rule-based\napproach. Focusing on 11 categories of SUD diagnoses, we show the effectiveness\nof LLMs in extracting severity information, contributing to improved risk\nassessment and treatment planning for SUD patients."},{"date":"2024-03","title":"Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation","author":"Di Mi, Yanjun Zhang, Leo Yu Zhang, Shengshan Hu, Qi Zhong, Haizhuan Yuan, and Shirui Pan","link":"http://arxiv.org/abs/2403.07673v3","abstract":"Model extraction attacks (MEAs) enable an attacker to replicate the\nfunctionality of a victim deep neural network (DNN) model by only querying its\nAPI service remotely, posing a severe threat to the security and integrity of\npay-per-query DNN-based services. Although the majority of current research on\nMEAs has primarily concentrated on neural classifiers, there is a growing\nprevalence of image-to-image translation (I2IT) tasks in our everyday\nactivities. However, techniques developed for MEA of DNN classifiers cannot be\ndirectly transferred to the case of I2IT, rendering the vulnerability of I2IT\nmodels to MEA attacks often underestimated. This paper unveils the threat of\nMEA in I2IT tasks from a new perspective. Diverging from the traditional\napproach of bridging the distribution gap between attacker queries and victim\ntraining samples, we opt to mitigate the effect caused by the different\ndistributions, known as the domain shift. This is achieved by introducing a new\nregularization term that penalizes high-frequency noise, and seeking a flatter\nminimum to avoid overfitting to the shifted distribution. Extensive experiments\non different image translation tasks, including image super-resolution and\nstyle transfer, are performed on different backbone victim models, and the new\ndesign consistently outperforms the baseline by a large margin across all\nmetrics. A few real-life I2IT APIs are also verified to be extremely vulnerable\nto our attack, emphasizing the need for enhanced defenses and potentially\nrevised API publishing policies."},{"date":"2024-03","title":"RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model","author":"Mingze Wang, Lili Su, Cilin Yan, Sheng Xu, Pengcheng Yuan, Xiaolong Jiang, and Baochang Zhang","link":"http://arxiv.org/abs/2403.07564v2","abstract":"The intelligent interpretation of buildings plays a significant role in urban\nplanning and management, macroeconomic analysis, population dynamics, etc.\nRemote sensing image building interpretation primarily encompasses building\nextraction and change detection. However, current methodologies often treat\nthese two tasks as separate entities, thereby failing to leverage shared\nknowledge. Moreover, the complexity and diversity of remote sensing image\nscenes pose additional challenges, as most algorithms are designed to model\nindividual small datasets, thus lacking cross-scene generalization. In this\npaper, we propose a comprehensive remote sensing image building understanding\nmodel, termed RSBuilding, developed from the perspective of the foundation\nmodel. RSBuilding is designed to enhance cross-scene generalization and task\nuniversality. Specifically, we extract image features based on the prior\nknowledge of the foundation model and devise a multi-level feature sampler to\naugment scale information. To unify task representation and integrate image\nspatiotemporal clues, we introduce a cross-attention decoder with task prompts.\nAddressing the current shortage of datasets that incorporate annotations for\nboth tasks, we have developed a federated training strategy to facilitate\nsmooth model convergence even when supervision for some tasks is missing,\nthereby bolstering the complementarity of different tasks. Our model was\ntrained on a dataset comprising up to 245,000 images and validated on multiple\nbuilding extraction and change detection datasets. The experimental results\nsubstantiate that RSBuilding can concurrently handle two structurally distinct\ntasks and exhibits robust zero-shot generalization capabilities."},{"date":"2024-03","title":"A Semantic Mention Graph Augmented Model for Document-Level Event Argument Extraction","author":"Jian Zhang, Changlin Yang, Haiping Zhu, Qika Lin, Fangzhi Xu, and Jun Liu","link":"http://arxiv.org/abs/2403.09721v1","abstract":"Document-level Event Argument Extraction (DEAE) aims to identify arguments\nand their specific roles from an unstructured document. The advanced approaches\non DEAE utilize prompt-based methods to guide pre-trained language models\n(PLMs) in extracting arguments from input documents. They mainly concentrate on\nestablishing relations between triggers and entity mentions within documents,\nleaving two unresolved problems: a) independent modeling of entity mentions; b)\ndocument-prompt isolation. To this end, we propose a semantic mention Graph\nAugmented Model (GAM) to address these two problems in this paper. Firstly, GAM\nconstructs a semantic mention graph that captures relations within and between\ndocuments and prompts, encompassing co-existence, co-reference and co-type\nrelations. Furthermore, we introduce an ensembled graph transformer module to\naddress mentions and their three semantic relations effectively. Later, the\ngraph-augmented encoder-decoder module incorporates the relation-specific graph\ninto the input embedding of PLMs and optimizes the encoder section with\ntopology information, enhancing the relations comprehensively. Extensive\nexperiments on the RAMS and WikiEvents datasets demonstrate the effectiveness\nof our approach, surpassing baseline methods and achieving a new\nstate-of-the-art performance."},{"date":"2024-03","title":"Stealing Part of a Production Language Model","author":"Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tram\u00e8r","link":"http://arxiv.org/abs/2403.06634v2","abstract":"We introduce the first model-stealing attack that extracts precise,\nnontrivial information from black-box production language models like OpenAI's\nChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding\nprojection layer (up to symmetries) of a transformer model, given typical API\naccess. For under \\$20 USD, our attack extracts the entire projection matrix of\nOpenAI's Ada and Babbage language models. We thereby confirm, for the first\ntime, that these black-box models have a hidden dimension of 1024 and 2048,\nrespectively. We also recover the exact hidden dimension size of the\ngpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to\nrecover the entire projection matrix. We conclude with potential defenses and\nmitigations, and discuss the implications of possible future work that could\nextend our attack."},{"date":"2024-03","title":"Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples","author":"Eda Yilmaz, and Hacer Yalim Keles","link":"http://arxiv.org/abs/2403.05181v2","abstract":"We introduce Adversarial Sparse Teacher (AST), a robust defense method\nagainst distillation-based model stealing attacks. Our approach trains a\nteacher model using adversarial examples to produce sparse logit responses and\nincrease the entropy of the output distribution. Typically, a model generates a\npeak in its output corresponding to its prediction. By leveraging adversarial\nexamples, AST modifies the teacher model's original response, embedding a few\naltered logits into the output while keeping the primary response slightly\nhigher. Concurrently, all remaining logits are elevated to further increase the\noutput distribution's entropy. All these complex manipulations are performed\nusing an optimization function with our proposed Exponential Predictive\nDivergence (EPD) loss function. EPD allows us to maintain higher entropy levels\ncompared to traditional KL divergence, effectively confusing attackers.\nExperiments on CIFAR-10 and CIFAR-100 datasets demonstrate that AST outperforms\nstate-of-the-art methods, providing effective defense against model stealing\nwhile preserving high accuracy. The source codes will be made publicly\navailable here soon."},{"date":"2024-03","title":"ChatUIE: Exploring Chat-based Unified Information Extraction using Large Language Models","author":"Jun Xu, Mengshu Sun, Zhiqiang Zhang, and Jun Zhou","link":"http://arxiv.org/abs/2403.05132v1","abstract":"Recent advancements in large language models have shown impressive\nperformance in general chat. However, their domain-specific capabilities,\nparticularly in information extraction, have certain limitations. Extracting\nstructured information from natural language that deviates from known schemas\nor instructions has proven challenging for previous prompt-based methods. This\nmotivated us to explore domain-specific modeling in chat-based language models\nas a solution for extracting structured information from natural language. In\nthis paper, we present ChatUIE, an innovative unified information extraction\nframework built upon ChatGLM. Simultaneously, reinforcement learning is\nemployed to improve and align various tasks that involve confusing and limited\nsamples. Furthermore, we integrate generation constraints to address the issue\nof generating elements that are not present in the input. Our experimental\nresults demonstrate that ChatUIE can significantly improve the performance of\ninformation extraction with a slight decrease in chatting ability."},{"date":"2024-03","title":"Precise Extraction of Deep Learning Models via Side-Channel Attacks on Edge/Endpoint Devices","author":"Younghan Lee, Sohee Jun, Yungi Cho, Woorim Han, Hyungon Moon, and Yunheung Paek","link":"http://arxiv.org/abs/2403.02870v1","abstract":"With growing popularity, deep learning (DL) models are becoming larger-scale,\nand only the companies with vast training datasets and immense computing power\ncan manage their business serving such large models. Most of those DL models\nare proprietary to the companies who thus strive to keep their private models\nsafe from the model extraction attack (MEA), whose aim is to steal the model by\ntraining surrogate models. Nowadays, companies are inclined to offload the\nmodels from central servers to edge/endpoint devices. As revealed in the latest\nstudies, adversaries exploit this opportunity as new attack vectors to launch\nside-channel attack (SCA) on the device running victim model and obtain various\npieces of the model information, such as the model architecture (MA) and image\ndimension (ID). Our work provides a comprehensive understanding of such a\nrelationship for the first time and would benefit future MEA studies in both\noffensive and defensive sides in that they may learn which pieces of\ninformation exposed by SCA are more important than the others. Our analysis\nadditionally reveals that by grasping the victim model information from SCA,\nMEA can get highly effective and successful even without any prior knowledge of\nthe model. Finally, to evince the practicality of our analysis results, we\nempirically apply SCA, and subsequently, carry out MEA under realistic threat\nassumptions. The results show up to 5.8 times better performance than when the\nadversary has no model information about the victim model."},{"date":"2024-03","title":"Towards Intent-Based Network Management: Large Language Models for Intent Extraction in 5G Core Networks","author":"Dimitrios Michael Manias, Ali Chouman, and Abdallah Shami","link":"http://arxiv.org/abs/2403.02238v2","abstract":"The integration of Machine Learning and Artificial Intelligence (ML/AI) into\nfifth-generation (5G) networks has made evident the limitations of network\nintelligence with ever-increasing, strenuous requirements for current and\nnext-generation devices. This transition to ubiquitous intelligence demands\nhigh connectivity, synchronicity, and end-to-end communication between users\nand network operators, and will pave the way towards full network automation\nwithout human intervention. Intent-based networking is a key factor in the\nreduction of human actions, roles, and responsibilities while shifting towards\nnovel extraction and interpretation of automated network management. This paper\npresents the development of a custom Large Language Model (LLM) for 5G and\nnext-generation intent-based networking and provides insights into future LLM\ndevelopments and integrations to realize end-to-end intent-based networking for\nfully automated network intelligence."},{"date":"2024-03","title":"Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction","author":"Edward Whittaker, and Ikuo Kitagishi","link":"http://arxiv.org/abs/2403.00528v1","abstract":"Language Models (LMs) such as BERT, have been shown to perform well on the\ntask of identifying Named Entities (NE) in text. A BERT LM is typically used as\na classifier to classify individual tokens in the input text, or to classify\nspans of tokens, as belonging to one of a set of possible NE categories.\n In this paper, we hypothesise that decoder-only Large Language Models (LLMs)\ncan also be used generatively to extract both the NE, as well as potentially\nrecover the correct surface form of the NE, where any spelling errors that were\npresent in the input text get automatically corrected.\n We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on\nthe task of producing NEs from text that was obtained by applying Optical\nCharacter Recognition (OCR) to images of Japanese shop receipts; in this work,\nwe do not attempt to find or evaluate the location of NEs in the text.\n We show that the best fine-tuned LLM performs as well as, or slightly better\nthan, the best fine-tuned BERT LM, although the differences are not\nsignificant. However, the best LLM is also shown to correct OCR errors in some\ncases, as initially hypothesised."},{"date":"2024-03","title":"Teach LLMs to Phish: Stealing Private Information from Language Models","author":"Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal","link":"http://arxiv.org/abs/2403.00871v1","abstract":"When large language models are trained on private data, it can be a\nsignificant privacy risk for them to memorize and regurgitate sensitive\ninformation. In this work, we propose a new practical data extraction attack\nthat we call \"neural phishing\". This attack enables an adversary to target and\nextract sensitive or personally identifiable information (PII), e.g., credit\ncard numbers, from a model trained on user data with upwards of 10% attack\nsuccess rates, at times, as high as 50%. Our attack assumes only that an\nadversary can insert as few as 10s of benign-appearing sentences into the\ntraining dataset using only vague priors on the structure of the user data."},{"date":"2024-02","title":"LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction","author":"Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan","link":"http://arxiv.org/abs/2403.00863v2","abstract":"Product attribute value extraction is a pivotal component in Natural Language\nProcessing (NLP) and the contemporary e-commerce industry. The provision of\nprecise product attribute values is fundamental in ensuring high-quality\nrecommendations and enhancing customer satisfaction. The recently emerging\nLarge Language Models (LLMs) have demonstrated state-of-the-art performance in\nnumerous attribute extraction tasks, without the need for domain-specific\ntraining data. Nevertheless, varying strengths and weaknesses are exhibited by\ndifferent LLMs due to the diversity in data, architectures, and\nhyperparameters. This variation makes them complementary to each other, with no\nsingle LLM dominating all others. Considering the diverse strengths and\nweaknesses of LLMs, it becomes necessary to develop an ensemble method that\nleverages their complementary potentials. In this paper, we propose a novel\nalgorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute\nvalue extraction. We iteratively learn the weights for different LLMs to\naggregate the labels with weights to predict the final attribute value. Not\nonly can our proposed method be proven theoretically optimal, but it also\nensures efficient computation, fast convergence, and safe deployment. We have\nalso conducted extensive experiments with various state-of-the-art LLMs,\nincluding Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's\ninternal data. Our offline metrics demonstrate that the LLM-ensemble method\noutperforms all the state-of-the-art single LLMs on Walmart's internal dataset.\nThis method has been launched in several production models, leading to improved\nGross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate\n(CVR), and Add-to-Cart Rate (ATC)."},{"date":"2024-02","title":"Watermark Stealing in Large Language Models","author":"Nikola Jovanovi\u0107, Robin Staab, and Martin Vechev","link":"http://arxiv.org/abs/2402.19361v2","abstract":"LLM watermarking has attracted attention as a promising way to detect\nAI-generated content, with some works suggesting that current schemes may\nalready be fit for deployment. In this work we dispute this claim, identifying\nwatermark stealing (WS) as a fundamental vulnerability of these schemes. We\nshow that querying the API of the watermarked LLM to approximately\nreverse-engineer a watermark enables practical spoofing attacks, as\nhypothesized in prior work, but also greatly boosts scrubbing attacks, which\nwas previously unnoticed. We are the first to propose an automated WS algorithm\nand use it in the first comprehensive study of spoofing and scrubbing in\nrealistic settings. We show that for under $50 an attacker can both spoof and\nscrub state-of-the-art schemes previously considered safe, with average success\nrate of over 80%. Our findings challenge common beliefs about LLM watermarking,\nstressing the need for more robust schemes. We make all our code and additional\nexamples available at https://watermark-stealing.org."},{"date":"2024-02","title":"PRSA: PRompt Stealing Attacks against Large Language Models","author":"Yong Yang, Changjiang Li, Yi Jiang, Xi Chen, Haoyu Wang, Xuhong Zhang, Zonghui Wang, and Shouling Ji","link":"http://arxiv.org/abs/2402.19200v2","abstract":"In recent years, \"prompt as a service\" has greatly enhanced the utility of\nlarge language models (LLMs) by enabling them to perform various downstream\ntasks efficiently without fine-tuning. This has also increased the commercial\nvalue of prompts. However, the potential risk of leakage in these\ncommercialized prompts remains largely underexplored. In this paper, we\nintroduce a novel attack framework, PRSA, designed for prompt stealing attacks\nagainst LLMs. The main idea of PRSA is to infer the intent behind a prompt by\nanalyzing its input-output content, enabling the generation of a surrogate\nprompt that replicates the original's functionality. Specifically, PRSA mainly\nconsists of two key phases: prompt mutation and prompt pruning. In the mutation\nphase, we propose a prompt attention algorithm based on output difference. The\nalgorithm facilitates the generation of effective surrogate prompts by learning\nkey factors that influence the accurate inference of prompt intent. During the\npruning phase, we employ a two-step related word identification strategy to\ndetect and mask words that are highly related to the input, thus improving the\ngeneralizability of the surrogate prompts. We verify the actual threat of PRSA\nthrough evaluation in both real-world settings, non-interactive and interactive\nprompt services. The results strongly confirm the PRSA's effectiveness and\ngeneralizability. We have reported these findings to prompt service providers\nand actively collaborate with them to implement defensive measures."},{"date":"2024-02","title":"Enhancing Steganographic Text Extraction: Evaluating the Impact of NLP Models on Accuracy and Semantic Coherence","author":"Mingyang Li, Maoqin Yuan, Luyao Li, and Han Pengsihua","link":"http://arxiv.org/abs/2402.18849v1","abstract":"This study discusses a new method combining image steganography technology\nwith Natural Language Processing (NLP) large models, aimed at improving the\naccuracy and robustness of extracting steganographic text. Traditional Least\nSignificant Bit (LSB) steganography techniques face challenges in accuracy and\nrobustness of information extraction when dealing with complex character\nencoding, such as Chinese characters. To address this issue, this study\nproposes an innovative LSB-NLP hybrid framework. This framework integrates the\nadvanced capabilities of NLP large models, such as error detection, correction,\nand semantic consistency analysis, as well as information reconstruction\ntechniques, thereby significantly enhancing the robustness of steganographic\ntext extraction. Experimental results show that the LSB-NLP hybrid framework\nexcels in improving the extraction accuracy of steganographic text, especially\nin handling Chinese characters. The findings of this study not only confirm the\neffectiveness of combining image steganography technology and NLP large models\nbut also propose new ideas for research and application in the field of\ninformation hiding. The successful implementation of this interdisciplinary\napproach demonstrates the great potential of integrating image steganography\ntechnology with natural language processing technology in solving complex\ninformation processing problems."},{"date":"2024-02","title":"Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction","author":"Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki","link":"http://arxiv.org/abs/2402.17969v1","abstract":"Given the accelerating progress of vision and language modeling, accurate\nevaluation of machine-generated image captions remains critical. In order to\nevaluate captions more closely to human preferences, metrics need to\ndiscriminate between captions of varying quality and content. However,\nconventional metrics fail short of comparing beyond superficial matches of\nwords or embedding similarities; thus, they still need improvement. This paper\npresents VisCE$^2$, a vision language model-based caption evaluation method.\nOur method focuses on visual context, which refers to the detailed content of\nimages, including objects, attributes, and relationships. By extracting and\norganizing them into a structured format, we replace the human-written\nreferences with visual contexts and help VLMs better understand the image,\nenhancing evaluation performance. Through meta-evaluation on multiple datasets,\nwe validated that VisCE$^2$ outperforms the conventional pre-trained metrics in\ncapturing caption quality and demonstrates superior consistency with human\njudgment."},{"date":"2024-02","title":"Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models","author":"Jeffrey G. Wang, Jason Wang, Marvin Li, and Seth Neel","link":"http://arxiv.org/abs/2402.17012v4","abstract":"In this paper we develop state-of-the-art privacy attacks against Large\nLanguage Models (LLMs), where an adversary with some access to the model tries\nto learn something about the underlying training data. Our headline results are\nnew membership inference attacks (MIAs) against pretrained LLMs that perform\nhundreds of times better than baseline attacks, and a pipeline showing that\nover 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM\nin natural settings. We consider varying degrees of access to the underlying\nmodel, pretraining and fine-tuning data, and both MIAs and training data\nextraction. For pretraining data, we propose two new MIAs: a supervised neural\nnetwork classifier that predicts training data membership on the basis of\n(dimensionality-reduced) model gradients, as well as a variant of this attack\nthat only requires logit access to the model by leveraging recent\nmodel-stealing work on LLMs. To our knowledge this is the first MIA that\nexplicitly incorporates model-stealing information. Both attacks outperform\nexisting black-box baselines, and our supervised attack closes the gap between\nMIA attack success against LLMs and the strongest known attacks for other\nmachine learning models. In fine-tuning, we find that a simple attack based on\nthe ratio of the loss between the base and fine-tuned models is able to achieve\nnear-perfect MIA performance; we then leverage our MIA to extract a large\nfraction of the fine-tuning dataset from fine-tuned Pythia and Llama models.\nOur code is available at github.com/safr-ai-lab/pandora-llm."},{"date":"2024-02","title":"IPED: An Implicit Perspective for Relational Triple Extraction based on Diffusion Model","author":"Jianli Zhao, Changhao Xu, and Bin Jiang","link":"http://arxiv.org/abs/2403.00808v1","abstract":"Relational triple extraction is a fundamental task in the field of\ninformation extraction, and a promising framework based on table filling has\nrecently gained attention as a potential baseline for entity relation\nextraction. However, inherent shortcomings such as redundant information and\nincomplete triple recognition remain problematic. To address these challenges,\nwe propose an Implicit Perspective for relational triple Extraction based on\nDiffusion model (IPED), an innovative approach for extracting relational\ntriples. Our classifier-free solution adopts an implicit strategy using block\ncoverage to complete the tables, avoiding the limitations of explicit tagging\nmethods. Additionally, we introduce a generative model structure, the\nblock-denoising diffusion model, to collaborate with our implicit perspective\nand effectively circumvent redundant information disruptions. Experimental\nresults on two popular datasets demonstrate that IPED achieves state-of-the-art\nperformance while gaining superior inference speed and low computational\ncomplexity. To support future research, we have made our source code publicly\navailable online."},{"date":"2024-02","title":"Prompt Stealing Attacks Against Large Language Models","author":"Zeyang Sha, and Yang Zhang","link":"http://arxiv.org/abs/2402.12959v1","abstract":"The increasing reliance on large language models (LLMs) such as ChatGPT in\nvarious fields emphasizes the importance of ``prompt engineering,'' a\ntechnology to improve the quality of model outputs. With companies investing\nsignificantly in expert prompt engineers and educational resources rising to\nmeet market demand, designing high-quality prompts has become an intriguing\nchallenge. In this paper, we propose a novel attack against LLMs, named prompt\nstealing attacks. Our proposed prompt stealing attack aims to steal these\nwell-designed prompts based on the generated answers. The prompt stealing\nattack contains two primary modules: the parameter extractor and the prompt\nreconstruction. The goal of the parameter extractor is to figure out the\nproperties of the original prompts. We first observe that most prompts fall\ninto one of three categories: direct prompt, role-based prompt, and in-context\nprompt. Our parameter extractor first tries to distinguish the type of prompts\nbased on the generated answers. Then, it can further predict which role or how\nmany contexts are used based on the types of prompts. Following the parameter\nextractor, the prompt reconstructor can be used to reconstruct the original\nprompts based on the generated answers and the extracted features. The final\ngoal of the prompt reconstructor is to generate the reversed prompts, which are\nsimilar to the original prompts. Our experimental results show the remarkable\nperformance of our proposed attacks. Our proposed attacks add a new dimension\nto the study of prompt engineering and call for more attention to the security\nissues on LLMs."},{"date":"2024-02","title":"Stealing the Invisible: Unveiling Pre-Trained CNN Models through Adversarial Examples and Timing Side-Channels","author":"Shubhi Shukla, Manaar Alam, Pabitra Mitra, and Debdeep Mukhopadhyay","link":"http://arxiv.org/abs/2402.11953v1","abstract":"Machine learning, with its myriad applications, has become an integral\ncomponent of numerous technological systems. A common practice in this domain\nis the use of transfer learning, where a pre-trained model's architecture,\nreadily available to the public, is fine-tuned to suit specific tasks. As\nMachine Learning as a Service (MLaaS) platforms increasingly use pre-trained\nmodels in their backends, it's crucial to safeguard these architectures and\nunderstand their vulnerabilities. In this work, we present an approach based on\nthe observation that the classification patterns of adversarial images can be\nused as a means to steal the models. Furthermore, the adversarial image\nclassifications in conjunction with timing side channels can lead to a model\nstealing method. Our approach, designed for typical user-level access in remote\nMLaaS environments exploits varying misclassifications of adversarial images\nacross different models to fingerprint several renowned Convolutional Neural\nNetwork (CNN) and Vision Transformer (ViT) architectures. We utilize the\nprofiling of remote model inference times to reduce the necessary adversarial\nimages, subsequently decreasing the number of queries required. We have\npresented our results over 27 pre-trained models of different CNN and ViT\narchitectures using CIFAR-10 dataset and demonstrate a high accuracy of 88.8%\nwhile keeping the query budget under 20."},{"date":"2024-02","title":"Evaluating Efficacy of Model Stealing Attacks and Defenses on Quantum Neural Networks","author":"Satwik Kundu, Debarshi Kundu, and Swaroop Ghosh","link":"http://arxiv.org/abs/2402.11687v1","abstract":"Cloud hosting of quantum machine learning (QML) models exposes them to a\nrange of vulnerabilities, the most significant of which is the model stealing\nattack. In this study, we assess the efficacy of such attacks in the realm of\nquantum computing. We conducted comprehensive experiments on various datasets\nwith multiple QML model architectures. Our findings revealed that model\nstealing attacks can produce clone models achieving up to $0.9\\times$ and\n$0.99\\times$ clone test accuracy when trained using Top-$1$ and Top-$k$ labels,\nrespectively ($k:$ num\\_classes). To defend against these attacks, we leverage\nthe unique properties of current noisy hardware and perturb the victim model\noutputs and hinder the attacker's training process. In particular, we propose:\n1) hardware variation-induced perturbation (HVIP) and 2) hardware and\narchitecture variation-induced perturbation (HAVIP). Although noise and\narchitectural variability can provide up to $\\sim16\\%$ output obfuscation, our\ncomprehensive analysis revealed that models cloned under noisy conditions tend\nto be resilient, suffering little to no performance degradation due to such\nobfuscations. Despite limited success with our defense techniques, this outcome\nhas led to an important discovery: QML models trained on noisy hardwares are\nnaturally resistant to perturbation or obfuscation-based defenses or attacks."},{"date":"2024-02","title":"GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models","author":"Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, and Jiawei Han","link":"http://arxiv.org/abs/2402.10744v1","abstract":"The field of relation extraction (RE) is experiencing a notable shift towards\ngenerative relation extraction (GRE), leveraging the capabilities of large\nlanguage models (LLMs). However, we discovered that traditional relation\nextraction (RE) metrics like precision and recall fall short in evaluating GRE\nmethods. This shortfall arises because these metrics rely on exact matching\nwith human-annotated reference relations, while GRE methods often produce\ndiverse and semantically accurate relations that differ from the references. To\nfill this gap, we introduce GenRES for a multi-dimensional assessment in terms\nof the topic similarity, uniqueness, granularity, factualness, and completeness\nof the GRE results. With GenRES, we empirically identified that (1)\nprecision/recall fails to justify the performance of GRE methods; (2)\nhuman-annotated referential relations can be incomplete; (3) prompting LLMs\nwith a fixed set of relations or entities can cause hallucinations. Next, we\nconducted a human evaluation of GRE methods that shows GenRES is consistent\nwith human preferences for RE quality. Last, we made a comprehensive evaluation\nof fourteen leading LLMs using GenRES across document, bag, and sentence level\nRE datasets, respectively, to set the benchmark for future research in GRE"},{"date":"2024-02","title":"Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction","author":"Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, and Yoshitaka Ushiku","link":"http://arxiv.org/abs/2402.12170v2","abstract":"Large language models require updates to remain up-to-date or adapt to new\ndomains by fine-tuning them with new documents. One key is memorizing the\nlatest information in a way that the memorized information is extractable with\na query prompt. However, LLMs suffer from a phenomenon called perplexity curse;\ndespite minimizing document perplexity during fine-tuning, LLMs struggle to\nextract information through a prompt sentence. In this new knowledge\nacquisition and extraction, we find a very intriguing fact that LLMs can\naccurately answer questions about the first sentence, but they struggle to\nextract information described in the middle or end of the documents used for\nfine-tuning. Our study suggests that the auto-regressive training causes this\nissue; each token is prompted by reliance on all previous tokens, which hinders\nthe model from recalling information from training documents by question\nprompts. To conduct the in-depth study, we publish both synthetic and real\ndatasets, enabling the evaluation of the QA performance w.r.t. the position of\nthe corresponding answer in a document. Our investigation shows that even a\nlarge model suffers from the perplexity curse, but regularization such as\ndenoising auto-regressive loss can enhance the information extraction from\ndiverse positions. These findings will be (i) a key to improving knowledge\nextraction from LLMs and (ii) new elements to discuss the trade-off between RAG\nand fine-tuning in adapting LLMs to a new domain."},{"date":"2024-02","title":"Learning to Extract Structured Entities Using Language Models","author":"Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, and Bhaskar Mitra","link":"http://arxiv.org/abs/2402.04437v5","abstract":"Recent advances in machine learning have significantly impacted the field of\ninformation extraction, with Language Models (LMs) playing a pivotal role in\nextracting structured information from unstructured text. Prior works typically\nrepresent information extraction as triplet-centric and use classical metrics\nsuch as precision and recall for evaluation. We reformulate the task to be\nentity-centric, enabling the use of diverse metrics that can provide more\ninsights from various perspectives. We contribute to the field by introducing\nStructured Entity Extraction and proposing the Approximate Entity Set OverlaP\n(AESOP) metric, designed to appropriately assess model performance. Later, we\nintroduce a new Multistage Structured Entity Extraction (MuSEE) model that\nharnesses the power of LMs for enhanced effectiveness and efficiency by\ndecomposing the extraction task into multiple stages. Quantitative and human\nside-by-side evaluations confirm that our model outperforms baselines, offering\npromising directions for future advancements in structured entity extraction.\nOur source code and datasets are available at\nhttps://github.com/microsoft/Structured-Entity-Extraction."},{"date":"2024-01","title":"Contextual Feature Extraction Hierarchies Converge in Large Language Models and the Brain","author":"Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D. Mehta, and Nima Mesgarani","link":"http://arxiv.org/abs/2401.17671v1","abstract":"Recent advancements in artificial intelligence have sparked interest in the\nparallels between large language models (LLMs) and human neural processing,\nparticularly in language comprehension. While prior research has established\nsimilarities in the representation of LLMs and the brain, the underlying\ncomputational principles that cause this convergence, especially in the context\nof evolving LLMs, remain elusive. Here, we examined a diverse selection of\nhigh-performance LLMs with similar parameter sizes to investigate the factors\ncontributing to their alignment with the brain's language processing\nmechanisms. We find that as LLMs achieve higher performance on benchmark tasks,\nthey not only become more brain-like as measured by higher performance when\npredicting neural responses from LLM embeddings, but also their hierarchical\nfeature extraction pathways map more closely onto the brain's while using fewer\nlayers to do the same encoding. We also compare the feature extraction pathways\nof the LLMs to each other and identify new ways in which high-performing models\nhave converged toward similar hierarchical processing mechanisms. Finally, we\nshow the importance of contextual information in improving model performance\nand brain similarity. Our findings reveal the converging aspects of language\nprocessing in the brain and LLMs and offer new directions for developing models\nthat align more closely with human cognitive processing."},{"date":"2024-01","title":"LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement","author":"Renyuan Peng, Xinyue Cai, Hang Xu, Jiachen Lu, Feng Wen, Wei Zhang, and Li Zhang","link":"http://arxiv.org/abs/2401.17609v2","abstract":"Understanding road structures is crucial for autonomous driving. Intricate\nroad structures are often depicted using lane graphs, which include centerline\ncurves and connections forming a Directed Acyclic Graph (DAG). Accurate\nextraction of lane graphs relies on precisely estimating vertex and edge\ninformation within the DAG. Recent research highlights Transformer-based\nlanguage models' impressive sequence prediction abilities, making them\neffective for learning graph representations when graph data are encoded as\nsequences. However, existing studies focus mainly on modeling vertices\nexplicitly, leaving edge information simply embedded in the network.\nConsequently, these approaches fall short in the task of lane graph extraction.\nTo address this, we introduce LaneGraph2Seq, a novel approach for lane graph\nextraction. It leverages a language model with vertex-edge encoding and\nconnectivity enhancement. Our serialization strategy includes a vertex-centric\ndepth-first traversal and a concise edge-based partition sequence.\nAdditionally, we use classifier-free guidance combined with nucleus sampling to\nimprove lane connectivity. We validate our method on prominent datasets,\nnuScenes and Argoverse 2, showcasing consistent and compelling results. Our\nLaneGraph2Seq approach demonstrates superior performance compared to\nstate-of-the-art techniques in lane graph extraction."},{"date":"2024-01","title":"Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately","author":"Liang Zhang, Katherine Jijo, Spurthi Setty, Eden Chung, Fatima Javid, Natan Vidra, and Tommy Clifford","link":"http://arxiv.org/abs/2402.01722v1","abstract":"Large Language Models (LLMs) generate responses to questions; however, their\neffectiveness is often hindered by sub-optimal quality of answers and\noccasional failures to provide accurate responses to questions. To address\nthese challenges, a fine-tuning process is employed, involving feedback and\nexamples to refine models. The objective is to enhance AI models through\ncontinuous feedback loops, utilizing metrics such as cosine similarity, LLM\nevaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like\nGPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on\nfinancial datasets, including the FinanceBench and RAG Instruct Benchmark\nTester Dataset, illustrating the necessity of fine-tuning. The results showcase\nthe capability of fine-tuned models to surpass the accuracy of zero-shot LLMs,\nproviding superior question and answering capabilities. Notably, the\ncombination of fine-tuning the LLM with a process known as Retrieval Augmented\nGeneration (RAG) proves to generate responses with improved accuracy."},{"date":"2024-01","title":"MEA-Defender: A Robust Watermark against Model Extraction Attack","author":"Peizhuo Lv, Hualong Ma, Kai Chen, Jiachen Zhou, Shengzhi Zhang, Ruigang Liang, Shenchen Zhu, Pan Li, and Yingjun Zhang","link":"http://arxiv.org/abs/2401.15239v1","abstract":"Recently, numerous highly-valuable Deep Neural Networks (DNNs) have been\ntrained using deep learning algorithms. To protect the Intellectual Property\n(IP) of the original owners over such DNN models, backdoor-based watermarks\nhave been extensively studied. However, most of such watermarks fail upon model\nextraction attack, which utilizes input samples to query the target model and\nobtains the corresponding outputs, thus training a substitute model using such\ninput-output pairs. In this paper, we propose a novel watermark to protect IP\nof DNN models against model extraction, named MEA-Defender. In particular, we\nobtain the watermark by combining two samples from two source classes in the\ninput domain and design a watermark loss function that makes the output domain\nof the watermark within that of the main task samples. Since both the input\ndomain and the output domain of our watermark are indispensable parts of those\nof the main task samples, the watermark will be extracted into the stolen model\nalong with the main task during model extraction. We conduct extensive\nexperiments on four model extraction attacks, using five datasets and six\nmodels trained based on supervised learning and self-supervised learning\nalgorithms. The experimental results demonstrate that MEA-Defender is highly\nrobust against different model extraction attacks, and various watermark\nremoval/detection approaches."},{"date":"2024-01","title":"Extracting Process-Aware Decision Models from Object-Centric Process Data","author":"Alexandre Goossens, Johannes De Smedt, and Jan Vanthienen","link":"http://arxiv.org/abs/2401.14847v1","abstract":"Organizations execute decisions within business processes on a daily basis\nwhilst having to take into account multiple stakeholders who might require\nmultiple point of views of the same process. Moreover, the complexity of the\ninformation systems running these business processes is generally high as they\nare linked to databases storing all the relevant data and aspects of the\nprocesses. Given the presence of multiple objects within an information system\nwhich support the processes in their enactment, decisions are naturally\ninfluenced by both these perspectives, logged in object-centric process logs.\nHowever, the discovery of such decisions from object-centric process logs is\nnot straightforward as it requires to correctly link the involved objects\nwhilst considering the sequential constraints that business processes impose as\nwell as correctly discovering what a decision actually does. This paper\nproposes the first object-centric decision-mining algorithm called Integrated\nObject-centric Decision Discovery Algorithm (IODDA). IODDA is able to discover\nhow a decision is structured as well as how a decision is made. Moreover, IODDA\nis able to discover which activities and object types are involved in the\ndecision-making process. Next, IODDA is demonstrated with the first artificial\nknowledge-intensive process logs whose log generators are provided to the\nresearch community."},{"date":"2024-01","title":"Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes","author":"Darren Liu, Cheng Ding, Delgersuren Bold, Monique Bouvier, Jiaying Lu, Benjamin Shickel, Craig S. Jabaley, Wenhui Zhang, Soojin Park, Michael J. Young, Mark S. Wainwright, Gilles Clermont, Parisa Rashidi, Eric S. Rosenthal, Laurie Dimisko, Ran Xiao, Joo Heung Yoon, Carl Yang, and Xiao Hu","link":"http://arxiv.org/abs/2401.13588v1","abstract":"The field of healthcare has increasingly turned its focus towards Large\nLanguage Models (LLMs) due to their remarkable performance. However, their\nperformance in actual clinical applications has been underexplored. Traditional\nevaluations based on question-answering tasks don't fully capture the nuanced\ncontexts. This gap highlights the need for more in-depth and practical\nassessments of LLMs in real-world healthcare settings. Objective: We sought to\nevaluate the performance of LLMs in the complex clinical context of adult\ncritical care medicine using systematic and comprehensible analytic methods,\nincluding clinician annotation and adjudication. Methods: We investigated the\nperformance of three general LLMs in understanding and processing real-world\nclinical notes. Concepts from 150 clinical notes were identified by MetaMap and\nthen labeled by 9 clinicians. Each LLM's proficiency was evaluated by\nidentifying the temporality and negation of these concepts using different\nprompts for an in-depth analysis. Results: GPT-4 showed overall superior\nperformance compared to other LLMs. In contrast, both GPT-3.5 and\ntext-davinci-003 exhibit enhanced performance when the appropriate prompting\nstrategies are employed. The GPT family models have demonstrated considerable\nefficiency, evidenced by their cost-effectiveness and time-saving capabilities.\nConclusion: A comprehensive qualitative performance evaluation framework for\nLLMs is developed and operationalized. This framework goes beyond singular\nperformance aspects. With expert annotations, this methodology not only\nvalidates LLMs' capabilities in processing complex medical data but also\nestablishes a benchmark for future LLM evaluations across specialized domains."},{"date":"2024-01","title":"Large Language Models for Scientific Information Extraction: An Empirical Study for Virology","author":"Mahsa Shamsabadi, Jennifer D'Souza, and S\u00f6ren Auer","link":"http://arxiv.org/abs/2401.10040v1","abstract":"In this paper, we champion the use of structured and semantic content\nrepresentation of discourse-based scholarly communication, inspired by tools\nlike Wikipedia infoboxes or structured Amazon product descriptions. These\nrepresentations provide users with a concise overview, aiding scientists in\nnavigating the dense academic landscape. Our novel automated approach leverages\nthe robust text generation capabilities of LLMs to produce structured scholarly\ncontribution summaries, offering both a practical solution and insights into\nLLMs' emergent abilities.\n For LLMs, the prime focus is on improving their general intelligence as\nconversational agents. We argue that these models can also be applied\neffectively in information extraction (IE), specifically in complex IE tasks\nwithin terse domains like Science. This paradigm shift replaces the traditional\nmodular, pipelined machine learning approach with a simpler objective expressed\nthrough instructions. Our results show that finetuned FLAN-T5 with 1000x fewer\nparameters than the state-of-the-art GPT-davinci is competitive for the task."},{"date":"2024-01","title":"Code-Based English Models Surprising Performance on Chinese QA Pair Extraction Task","author":"Linghan Zheng, Hui Liu, Xiaojun Lin, Jiayuan Dong, Yue Sheng, Gang Shi, Zhiwei Liu, and Hongwei Chen","link":"http://arxiv.org/abs/2401.10286v3","abstract":"In previous studies, code-based models have consistently outperformed\ntext-based models in reasoning-intensive scenarios. When generating our\nknowledge base for Retrieval-Augmented Generation (RAG), we observed that\ncode-based models also perform exceptionally well in Chinese QA Pair Extraction\ntask. Further, our experiments and the metrics we designed discovered that\ncode-based models containing a certain amount of Chinese data achieve even\nbetter performance. Additionally, the capabilities of code-based English models\nin specified Chinese tasks offer a distinct perspective for discussion on the\nphilosophical \"Chinese Room\" thought experiment."},{"date":"2024-01","title":"MatSAM: Efficient Extraction of Microstructures of Materials via Visual Large Model","author":"Changtai Li, Xu Han, Chao Yao, and Xiaojuan Ban","link":"http://arxiv.org/abs/2401.05638v2","abstract":"Efficient and accurate extraction of microstructures in micrographs of\nmaterials is essential in process optimization and the exploration of\nstructure-property relationships. Deep learning-based image segmentation\ntechniques that rely on manual annotation are laborious and time-consuming and\nhardly meet the demand for model transferability and generalization on various\nsource images. Segment Anything Model (SAM), a large visual model with powerful\ndeep feature representation and zero-shot generalization capabilities, has\nprovided new solutions for image segmentation. In this paper, we propose\nMatSAM, a general and efficient microstructure extraction solution based on\nSAM. A simple yet effective point-based prompt generation strategy is designed,\ngrounded on the distribution and shape of microstructures. Specifically, in an\nunsupervised and training-free way, it adaptively generates prompt points for\ndifferent microscopy images, fuses the centroid points of the coarsely\nextracted region of interest (ROI) and native grid points, and integrates\ncorresponding post-processing operations for quantitative characterization of\nmicrostructures of materials. For common microstructures including grain\nboundary and multiple phases, MatSAM achieves superior zero-shot segmentation\nperformance to conventional rule-based methods and is even preferable to\nsupervised learning methods evaluated on 16 microscopy datasets whose\nmicrographs are imaged by the optical microscope (OM) and scanning electron\nmicroscope (SEM). Especially, on 4 public datasets, MatSAM shows unexpected\ncompetitive segmentation performance against their specialist models. We\nbelieve that, without the need for human labeling, MatSAM can significantly\nreduce the cost of quantitative characterization and statistical analysis of\nextensive microstructures of materials, and thus accelerate the design of new\nmaterials."},{"date":"2024-01","title":"Large Model based Sequential Keyframe Extraction for Video Summarization","author":"Kailong Tan, Yuxiang Zhou, Qianchen Xia, Rui Liu, and Yong Chen","link":"http://arxiv.org/abs/2401.04962v1","abstract":"Keyframe extraction aims to sum up a video's semantics with the minimum\nnumber of its frames. This paper puts forward a Large Model based Sequential\nKeyframe Extraction for video summarization, dubbed LMSKE, which contains three\nstages as below. First, we use the large model \"TransNetV21\" to cut the video\ninto consecutive shots, and employ the large model \"CLIP2\" to generate each\nframe's visual feature within each shot; Second, we develop an adaptive\nclustering algorithm to yield candidate keyframes for each shot, with each\ncandidate keyframe locating nearest to a cluster center; Third, we further\nreduce the above candidate keyframes via redundancy elimination within each\nshot, and finally concatenate them in accordance with the sequence of shots as\nthe final sequential keyframes. To evaluate LMSKE, we curate a benchmark\ndataset and conduct rich experiments, whose results exhibit that LMSKE performs\nmuch better than quite a few SOTA competitors with average F1 of 0.5311,\naverage fidelity of 0.8141, and average compression ratio of 0.9922."},{"date":"2024-01","title":"Segment anything model (SAM) for brain extraction in fMRI studies","author":"Dwith Chenna, and Suyash Bhogawar","link":"http://arxiv.org/abs/2401.04740v1","abstract":"Brain extraction and removal of skull artifacts from magnetic resonance\nimages (MRI) is an important preprocessing step in neuroimaging analysis. There\nare many tools developed to handle human fMRI images, which could involve\nmanual steps for verifying results from brain segmentation that makes it time\nconsuming and inefficient. In this study, we will use the segment anything\nmodel (SAM), a freely available neural network released by Meta[4], which has\nshown promising results in many generic segmentation applications. We will\nanalyze the efficiency of SAM for neuroimaging brain segmentation by removing\nskull artifacts. The results of the experiments showed promising results that\nexplore using automated segmentation algorithms for neuroimaging without the\nneed to train on custom medical imaging dataset."},{"date":"2024-01","title":"A Span-based Model for Extracting Overlapping PICO Entities from RCT Publications","author":"Gongbo Zhang, Yiliang Zhou, Yan Hu, Hua Xu, Chunhua Weng, and Yifan Peng","link":"http://arxiv.org/abs/2401.06791v1","abstract":"Objectives Extraction of PICO (Populations, Interventions, Comparison, and\nOutcomes) entities is fundamental to evidence retrieval. We present a novel\nmethod PICOX to extract overlapping PICO entities.\n Materials and Methods PICOX first identifies entities by assessing whether a\nword marks the beginning or conclusion of an entity. Then it uses a multi-label\nclassifier to assign one or more PICO labels to a span candidate. PICOX was\nevaluated using one of the best-performing baselines, EBM-NLP, and three more\ndatasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or\nCOVID-19, using entity-level precision, recall, and F1 scores.\n Results PICOX achieved superior precision, recall, and F1 scores across the\nboard, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On\nthe PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline\nand improved the micro recall score from 56.66 to 67.33. On the COVID-19\ndataset, PICOX also outperformed the baseline and improved the micro F1 score\nfrom 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores\nwith higher precision when compared to the baseline.\n Conclusion PICOX excels in identifying overlapping entities and consistently\nsurpasses a leading baseline across multiple datasets. Ablation studies reveal\nthat its data augmentation strategy effectively minimizes false positives and\nimproves precision."},{"date":"2024-01","title":"Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models","author":"Uday Allu, Biddwan Ahmed, and Vishesh Tripathi","link":"http://arxiv.org/abs/2401.02333v3","abstract":"The conventional use of the Retrieval-Augmented Generation (RAG) architecture\nhas proven effective for retrieving information from diverse documents.\nHowever, challenges arise in handling complex table queries, especially within\nPDF documents containing intricate tabular structures.This research introduces\nan innovative approach to enhance the accuracy of complex table queries in\nRAG-based systems. Our methodology involves storing PDFs in the retrieval\ndatabase and extracting tabular content separately. The extracted tables\nundergo a process of context enrichment, concatenating headers with\ncorresponding values. To ensure a comprehensive understanding of the enriched\ndata, we employ a fine-tuned version of the Llama-2-chat language model for\nsummarisation within the RAG architecture. Furthermore, we augment the tabular\ndata with contextual sense using the ChatGPT 3.5 API through a one-shot prompt.\nThis enriched data is then fed into the retrieval database alongside other\nPDFs. Our approach aims to significantly improve the precision of complex table\nqueries, offering a promising solution to a longstanding challenge in\ninformation retrieval."},{"date":"2024-01","title":"Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques","author":"Weijian Huang, Cheng Li, Hong-Yu Zhou, Jiarun Liu, Hao Yang, Yong Liang, Guangming Shi, Hairong Zheng, and Shanshan Wang","link":"http://arxiv.org/abs/2401.01583v2","abstract":"The development of medical vision-language foundation models has attracted\nsignificant attention in the field of medicine and healthcare due to their\npromising prospect in various clinical applications. While previous studies\nhave commonly focused on feature learning at a single learning scale,\ninvestigation on integrating multi-scale information is lacking, which may\nhinder the potential for mutual reinforcement among these features. This paper\naims to bridge this gap by proposing a method that effectively exploits\nmulti-scale information to enhance the performance of medical foundation\nmodels. The proposed method simultaneously exploits features at the local,\ninstance, modality and global aspects, facilitating comprehensive\nrepresentation learning within the models. We evaluate the effectiveness of the\nproposed method on six open-source datasets across different clinical tasks,\ndemonstrating its ability to enhance the performance of medical foundation\nmodels."},{"date":"2023-12","title":"Robust Knowledge Extraction from Large Language Models using Social Choice Theory","author":"Nico Potyka, Yuqicheng Zhu, Yunjie He, Evgeny Kharlamov, and Steffen Staab","link":"http://arxiv.org/abs/2312.14877v2","abstract":"Large-language models (LLMs) can support a wide range of applications like\nconversational agents, creative writing or general query answering. However,\nthey are ill-suited for query answering in high-stake domains like medicine\nbecause they are typically not robust - even the same query can result in\ndifferent answers when prompted multiple times. In order to improve the\nrobustness of LLM queries, we propose using ranking queries repeatedly and to\naggregate the queries using methods from social choice theory. We study ranking\nqueries in diagnostic settings like medical and fault diagnosis and discuss how\nthe Partial Borda Choice function from the literature can be applied to merge\nmultiple query results. We discuss some additional interesting properties in\nour setting and evaluate the robustness of our approach empirically."},{"date":"2023-12","title":"MEAOD: Model Extraction Attack against Object Detectors","author":"Zeyu Li, Chenghui Shi, Yuwen Pu, Xuhong Zhang, Yu Li, Jinbao Li, and Shouling Ji","link":"http://arxiv.org/abs/2312.14677v1","abstract":"The widespread use of deep learning technology across various industries has\nmade deep neural network models highly valuable and, as a result, attractive\ntargets for potential attackers. Model extraction attacks, particularly\nquery-based model extraction attacks, allow attackers to replicate a substitute\nmodel with comparable functionality to the victim model and present a\nsignificant threat to the confidentiality and security of MLaaS platforms.\nWhile many studies have explored threats of model extraction attacks against\nclassification models in recent years, object detection models, which are more\nfrequently used in real-world scenarios, have received less attention. In this\npaper, we investigate the challenges and feasibility of query-based model\nextraction attacks against object detection models and propose an effective\nattack method called MEAOD. It selects samples from the attacker-possessed\ndataset to construct an efficient query dataset using active learning and\nenhances the categories with insufficient objects. We additionally improve the\nextraction effectiveness by updating the annotations of the query dataset.\nAccording to our gray-box and black-box scenarios experiments, we achieve an\nextraction performance of over 70% under the given condition of a 10k query\nbudget."},{"date":"2023-12","title":"Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models","author":"Fei Pan, Sangryul Jeon, Brian Wang, Frank Mckenna, and Stella X. Yu","link":"http://arxiv.org/abs/2312.12479v1","abstract":"Existing building recognition methods, exemplified by BRAILS, utilize\nsupervised learning to extract information from satellite and street-view\nimages for classification and segmentation. However, each task module requires\nhuman-annotated data, hindering the scalability and robustness to regional\nvariations and annotation imbalances. In response, we propose a new zero-shot\nworkflow for building attribute extraction that utilizes large-scale vision and\nlanguage models to mitigate reliance on external annotations. The proposed\nworkflow contains two key components: image-level captioning and segment-level\ncaptioning for the building images based on the vocabularies pertinent to\nstructural and civil engineering. These two components generate descriptive\ncaptions by computing feature representations of the image and the\nvocabularies, and facilitating a semantic match between the visual and textual\nrepresentations. Consequently, our framework offers a promising avenue to\nenhance AI-driven captioning for building attribute extraction in the\nstructural and civil engineering domains, ultimately reducing reliance on human\nannotations while bolstering performance and adaptability."},{"date":"2023-12","title":"Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity","author":"Zhihao Zhu, Chenwang Wu, Rui Fan, Yi Yang, Zhen Wang, Defu Lian, and Enhong Chen","link":"http://arxiv.org/abs/2312.10943v3","abstract":"Recent research demonstrates that GNNs are vulnerable to the model stealing\nattack, a nefarious endeavor geared towards duplicating the target model via\nquery permissions. However, they mainly focus on node classification tasks,\nneglecting the potential threats entailed within the domain of graph\nclassification tasks. Furthermore, their practicality is questionable due to\nunreasonable assumptions, specifically concerning the large data requirements\nand extensive model knowledge. To this end, we advocate following strict\nsettings with limited real data and hard-label awareness to generate synthetic\ndata, thereby facilitating the stealing of the target model. Specifically,\nfollowing important data generation principles, we introduce three model\nstealing attacks to adapt to different actual scenarios: MSA-AU is inspired by\nactive learning and emphasizes the uncertainty to enhance query value of\ngenerated samples; MSA-AD introduces diversity based on Mixup augmentation\nstrategy to alleviate the query inefficiency issue caused by over-similar\nsamples generated by MSA-AU; MSA-AUD combines the above two strategies to\nseamlessly integrate the authenticity, uncertainty, and diversity of the\ngenerated samples. Finally, extensive experiments consistently demonstrate the\nsuperiority of the proposed methods in terms of concealment, query efficiency,\nand stealing performance."},{"date":"2023-12","title":"Model Stealing Attack against Recommender System","author":"Zhihao Zhu, Rui Fan, Chenwang Wu, Yi Yang, Defu Lian, and Enhong Chen","link":"http://arxiv.org/abs/2312.11571v2","abstract":"Recent studies have demonstrated the vulnerability of recommender systems to\ndata privacy attacks. However, research on the threat to model privacy in\nrecommender systems, such as model stealing attacks, is still in its infancy.\nSome adversarial attacks have achieved model stealing attacks against\nrecommender systems, to some extent, by collecting abundant training data of\nthe target model (target data) or making a mass of queries. In this paper, we\nconstrain the volume of available target data and queries and utilize auxiliary\ndata, which shares the item set with the target data, to promote model stealing\nattacks. Although the target model treats target and auxiliary data\ndifferently, their similar behavior patterns allow them to be fused using an\nattention mechanism to assist attacks. Besides, we design stealing functions to\neffectively extract the recommendation list obtained by querying the target\nmodel. Experimental results show that the proposed methods are applicable to\nmost recommender systems and various scenarios and exhibit excellent attack\nperformance on multiple datasets."},{"date":"2023-12","title":"SAME: Sample Reconstruction against Model Extraction Attacks","author":"Yi Xie, Jie Zhang, Shiqian Zhao, Tianwei Zhang, and Xiaofeng Chen","link":"http://arxiv.org/abs/2312.10578v2","abstract":"While deep learning models have shown significant performance across various\ndomains, their deployment needs extensive resources and advanced computing\ninfrastructure. As a solution, Machine Learning as a Service (MLaaS) has\nemerged, lowering the barriers for users to release or productize their deep\nlearning models. However, previous studies have highlighted potential privacy\nand security concerns associated with MLaaS, and one primary threat is model\nextraction attacks. To address this, there are many defense solutions but they\nsuffer from unrealistic assumptions and generalization issues, making them less\npractical for reliable protection. Driven by these limitations, we introduce a\nnovel defense mechanism, SAME, based on the concept of sample reconstruction.\nThis strategy imposes minimal prerequisites on the defender's capabilities,\neliminating the need for auxiliary Out-of-Distribution (OOD) datasets, user\nquery history, white-box model access, and additional intervention during model\ntraining. It is compatible with existing active defense methods. Our extensive\nexperiments corroborate the superior efficacy of SAME over state-of-the-art\nsolutions. Our code is available at https://github.com/xythink/SAME."},{"date":"2023-12","title":"High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models","author":"Songchi Zhou, and Sheng Yu","link":"http://arxiv.org/abs/2312.08274v4","abstract":"Objective: To develop a high-throughput biomedical relation extraction system\nthat takes advantage of the large language models'(LLMs) reading comprehension\nability and biomedical world knowledge in a scalable and evidential manner.\nMethods: We formulate the relation extraction task as binary classifications\nfor large language models. Specifically, LLMs make the decision based on the\nexternal corpus and its world knowledge, giving the reason for the judgment for\nfactual verification. This method is tailored for semi-structured web articles,\nwherein we designate the main title as the tail entity and explicitly\nincorporate it into the context, and the potential head entities are matched\nbased on a biomedical thesaurus. Moreover, lengthy contents are sliced into\ntext chunks, embedded, and retrieved with additional embedding models. Results:\nUsing an open-source LLM, we extracted 248659 relation triplets of three\ndistinct relation types from three reputable biomedical websites. To assess the\nefficacy of the basic pipeline employed for biomedical relation extraction, we\ncurated a benchmark dataset annotated by a medical expert. Evaluation results\nindicate that the pipeline exhibits performance comparable to that of GPT-4.\nCase studies further illuminate challenges faced by contemporary LLMs in the\ncontext of biomedical relation extraction for semi-structured web articles.\nConclusion: The proposed method has demonstrated its effectiveness in\nleveraging the strengths of LLMs for high-throughput biomedical relation\nextraction. Its adaptability is evident, as it can be seamlessly extended to\ndiverse semi-structured biomedical websites, facilitating the extraction of\nvarious types of biomedical relations with ease."},{"date":"2023-12","title":"BED: Bi-Encoder-Decoder Model for Canonical Relation Extraction","author":"Nantao Zheng, Siyu Long, and Xinyu Dai","link":"http://arxiv.org/abs/2312.07088v1","abstract":"Canonical relation extraction aims to extract relational triples from\nsentences, where the triple elements (entity pairs and their relationship) are\nmapped to the knowledge base. Recently, methods based on the encoder-decoder\narchitecture are proposed and achieve promising results. However, these methods\ncannot well utilize the entity information, which is merely used as augmented\ntraining data. Moreover, they are incapable of representing novel entities,\nsince no embeddings have been learned for them. In this paper, we propose a\nnovel framework, Bi-Encoder-Decoder (BED), to solve the above issues.\nSpecifically, to fully utilize entity information, we employ an encoder to\nencode semantics of this information, leading to high-quality entity\nrepresentations. For novel entities, given a trained entity encoder, their\nrepresentations can be easily generated. Experimental results on two datasets\nshow that, our method achieves a significant performance improvement over the\nprevious state-of-the-art and handle novel entities well without retraining."},{"date":"2023-12","title":"Model Extraction Attacks Revisited","author":"Jiacheng Liang, Ren Pang, Changjiang Li, and Ting Wang","link":"http://arxiv.org/abs/2312.05386v1","abstract":"Model extraction (ME) attacks represent one major threat to\nMachine-Learning-as-a-Service (MLaaS) platforms by ``stealing'' the\nfunctionality of confidential machine-learning models through querying\nblack-box APIs. Over seven years have passed since ME attacks were first\nconceptualized in the seminal work. During this period, substantial advances\nhave been made in both ME attacks and MLaaS platforms, raising the intriguing\nquestion: How has the vulnerability of MLaaS platforms to ME attacks been\nevolving? In this work, we conduct an in-depth study to answer this critical\nquestion. Specifically, we characterize the vulnerability of current,\nmainstream MLaaS platforms to ME attacks from multiple perspectives including\nattack strategies, learning techniques, surrogate-model design, and benchmark\ntasks. Many of our findings challenge previously reported results, suggesting\nemerging patterns of ME vulnerability. Further, by analyzing the vulnerability\nof the same MLaaS platforms using historical datasets from the past four years,\nwe retrospectively characterize the evolution of ME vulnerability over time,\nleading to a set of interesting findings. Finally, we make suggestions about\nimproving the current practice of MLaaS in terms of attack robustness. Our\nstudy sheds light on the current state of ME vulnerability in the wild and\npoints to several promising directions for future research."},{"date":"2023-12","title":"Fine-tuning pre-trained extractive QA models for clinical document parsing","author":"Ashwyn Sharma, David I. Feldman, and Aneesh Jain","link":"http://arxiv.org/abs/2312.02314v1","abstract":"Electronic health records (EHRs) contain a vast amount of high-dimensional\nmulti-modal data that can accurately represent a patient's medical history.\nUnfortunately, most of this data is either unstructured or semi-structured,\nrendering it unsuitable for real-time and retrospective analyses. A remote\npatient monitoring (RPM) program for Heart Failure (HF) patients needs to have\naccess to clinical markers like EF (Ejection Fraction) or LVEF (Left\nVentricular Ejection Fraction) in order to ascertain eligibility and\nappropriateness for the program. This paper explains a system that can parse\nechocardiogram reports and verify EF values. This system helps identify\neligible HF patients who can be enrolled in such a program. At the heart of\nthis system is a pre-trained extractive QA transformer model that is fine-tuned\non custom-labeled data. The methods used to prepare such a model for deployment\nare illustrated by running experiments on a public clinical dataset like\nMIMIC-IV-Note. The pipeline can be used to generalize solutions to similar\nproblems in a low-resource setting. We found that the system saved over 1500\nhours for our clinicians over 12 months by automating the task at scale."},{"date":"2023-12","title":"LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models","author":"Reza Yousefi Maragheh, Chenhao Fang, Charan Chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, and Kannan Achan","link":"http://arxiv.org/abs/2312.00909v1","abstract":"Keyword extraction is one of the core tasks in natural language processing.\nClassic extraction models are notorious for having a short attention span which\nmake it hard for them to conclude relational connections among the words and\nsentences that are far from each other. This, in turn, makes their usage\nprohibitive for generating keywords that are inferred from the context of the\nwhole text. In this paper, we explore using Large Language Models (LLMs) in\ngenerating keywords for items that are inferred from the items textual\nmetadata. Our modeling framework includes several stages to fine grain the\nresults by avoiding outputting keywords that are non informative or sensitive\nand reduce hallucinations common in LLM. We call our LLM-based framework\nTheme-Aware Keyword Extraction (LLM TAKE). We propose two variations of\nframework for generating extractive and abstractive themes for products in an E\ncommerce setting. We perform an extensive set of experiments on three real data\nsets and show that our modeling framework can enhance accuracy based and\ndiversity based metrics when compared with benchmark models."},{"date":"2023-11","title":"Scalable Extraction of Training Data from (Production) Language Models","author":"Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tram\u00e8r, and Katherine Lee","link":"http://arxiv.org/abs/2311.17035v1","abstract":"This paper studies extractable memorization: training data that an adversary\ncan efficiently extract by querying a machine learning model without prior\nknowledge of the training dataset. We show an adversary can extract gigabytes\nof training data from open-source language models like Pythia or GPT-Neo,\nsemi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing\ntechniques from the literature suffice to attack unaligned models; in order to\nattack the aligned ChatGPT, we develop a new divergence attack that causes the\nmodel to diverge from its chatbot-style generations and emit training data at a\nrate 150x higher than when behaving properly. Our methods show practical\nattacks can recover far more data than previously thought, and reveal that\ncurrent alignment techniques do not eliminate memorization."},{"date":"2023-11","title":"GPT Struct Me: Probing GPT Models on Narrative Entity Extraction","author":"Hugo Sousa, Nuno Guimar\u00e3es, Al\u00edpio Jorge, and Ricardo Campos","link":"http://arxiv.org/abs/2311.14583v1","abstract":"The importance of systems that can extract structured information from\ntextual data becomes increasingly pronounced given the ever-increasing volume\nof text produced on a daily basis. Having a system that can effectively extract\nsuch information in an interoperable manner would be an asset for several\ndomains, be it finance, health, or legal. Recent developments in natural\nlanguage processing led to the production of powerful language models that can,\nto some degree, mimic human intelligence. Such effectiveness raises a pertinent\nquestion: Can these models be leveraged for the extraction of structured\ninformation? In this work, we address this question by evaluating the\ncapabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5,\ncommonly known as ChatGPT -- in the extraction of narrative entities, namely\nevents, participants, and temporal expressions. This study is conducted on the\nText2Story Lusa dataset, a collection of 119 Portuguese news articles whose\nannotation framework includes a set of entity structures along with several\ntags and attribute values. We first select the best prompt template through an\nablation study over prompt components that provide varying degrees of\ninformation on a subset of documents of the dataset. Subsequently, we use the\nbest templates to evaluate the effectiveness of the models on the remaining\ndocuments. The results obtained indicate that GPT models are competitive with\nout-of-the-box baseline systems, presenting an all-in-one alternative for\npractitioners with limited resources. By studying the strengths and limitations\nof these models in the context of information extraction, we offer insights\nthat can guide future improvements and avenues to explore in this field."},{"date":"2023-11","title":"Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models","author":"Ge Luo, Junqiang Huang, Manman Zhang, Zhenxing Qian, Sheng Li, and Xinpeng Zhang","link":"http://arxiv.org/abs/2311.13619v1","abstract":"The advancement in text-to-image models has led to astonishing artistic\nperformances. However, several studios and websites illegally fine-tune these\nmodels using artists' artworks to mimic their styles for profit, which violates\nthe copyrights of artists and diminishes their motivation to produce original\nworks. Currently, there is a notable lack of research focusing on this issue.\nIn this paper, we propose a novel watermarking framework that detects mimicry\nin text-to-image models through fine-tuning. This framework embeds subtle\nwatermarks into digital artworks to protect their copyrights while still\npreserving the artist's visual expression. If someone takes watermarked\nartworks as training data to mimic an artist's style, these watermarks can\nserve as detectable indicators. By analyzing the distribution of these\nwatermarks in a series of generated images, acts of fine-tuning mimicry using\nstolen victim data will be exposed. In various fine-tune scenarios and against\nwatermark attack methods, our research confirms that analyzing the distribution\nof watermarks in artificially generated images reliably detects unauthorized\nmimicry."},{"date":"2023-11","title":"Use GPT-J Prompt Generation with RoBERTa for NER Models on Diagnosis Extraction of Periodontal Diagnosis from Electronic Dental Records","author":"Yao-Shun Chuang, Xiaoqian Jiang, Chun-Teh Lee, Ryan Brandon, Duong Tran, Oluwabunmi Tokede, and Muhammad F. Walji","link":"http://arxiv.org/abs/2311.10810v1","abstract":"This study explored the usability of prompt generation on named entity\nrecognition (NER) tasks and the performance in different settings of the\nprompt. The prompt generation by GPT-J models was utilized to directly test the\ngold standard as well as to generate the seed and further fed to the RoBERTa\nmodel with the spaCy package. In the direct test, a lower ratio of negative\nexamples with higher numbers of examples in prompt achieved the best results\nwith a F1 score of 0.72. The performance revealed consistency, 0.92-0.97 in the\nF1 score, in all settings after training with the RoBERTa model. The study\nhighlighted the importance of seed quality rather than quantity in feeding NER\nmodels. This research reports on an efficient and accurate way to mine clinical\nnotes for periodontal diagnoses, allowing researchers to easily and quickly\nbuild a NER model with the prompt generation approach."},{"date":"2023-11","title":"Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models","author":"Junpeng Li, Zixia Jia, and Zilong Zheng","link":"http://arxiv.org/abs/2311.07314v1","abstract":"Document-level Relation Extraction (DocRE), which aims to extract relations\nfrom a long context, is a critical challenge in achieving fine-grained\nstructural comprehension and generating interpretable document representations.\nInspired by recent advances in in-context learning capabilities emergent from\nlarge language models (LLMs), such as ChatGPT, we aim to design an automated\nannotation method for DocRE with minimum human effort. Unfortunately, vanilla\nin-context learning is infeasible for document-level relation extraction due to\nthe plenty of predefined fine-grained relation types and the uncontrolled\ngenerations of LLMs. To tackle this issue, we propose a method integrating a\nlarge language model (LLM) and a natural language inference (NLI) module to\ngenerate relation triples, thereby augmenting document-level relation datasets.\nWe demonstrate the effectiveness of our approach by introducing an enhanced\ndataset known as DocGNRE, which excels in re-annotating numerous long-tail\nrelation types. We are confident that our method holds the potential for\nbroader applications in domain-specific relation type definitions and offers\ntangible benefits in advancing generalized language semantic comprehension."},{"date":"2023-11","title":"Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection","author":"Akshit Jindal, Vikram Goyal, Saket Anand, and Chetan Arora","link":"http://arxiv.org/abs/2311.04588v1","abstract":"Machine Learning (ML) models become vulnerable to Model Stealing Attacks\n(MSA) when they are deployed as a service. In such attacks, the deployed model\nis queried repeatedly to build a labelled dataset. This dataset allows the\nattacker to train a thief model that mimics the original model. To maximize\nquery efficiency, the attacker has to select the most informative subset of\ndata points from the pool of available data. Existing attack strategies utilize\napproaches like Active Learning and Semi-Supervised learning to minimize costs.\nHowever, in the black-box setting, these approaches may select sub-optimal\nsamples as they train only one thief model. Depending on the thief model's\ncapacity and the data it was pretrained on, the model might even select noisy\nsamples that harm the learning process. In this work, we explore the usage of\nan ensemble of deep learning models as our thief model. We call our attack Army\nof Thieves(AOT) as we train multiple models with varying complexities to\nleverage the crowd's wisdom. Based on the ensemble's collective decision,\nuncertain samples are selected for querying, while the most confident samples\nare directly included in the training data. Our approach is the first one to\nutilize an ensemble of thief models to perform model extraction. We outperform\nthe base approaches of existing state-of-the-art methods by at least 3% and\nachieve a 21% higher adversarial sample transferability than previous work for\nmodels trained on the CIFAR-10 dataset."},{"date":"2023-11","title":"JPAVE: A Generation and Classification-based Model for Joint Product Attribute Prediction and Value Extraction","author":"Zhongfen Deng, Hao Peng, Tao Zhang, Shuaiqi Liu, Wenting Zhao, Yibo Wang, and Philip S. Yu","link":"http://arxiv.org/abs/2311.04196v1","abstract":"Product attribute value extraction is an important task in e-Commerce which\ncan help several downstream applications such as product search and\nrecommendation. Most previous models handle this task using sequence labeling\nor question answering method which rely on the sequential position information\nof values in the product text and are vulnerable to data discrepancy between\ntraining and testing. This limits their generalization ability to real-world\nscenario in which each product can have multiple descriptions across various\nshopping platforms with different composition of text and style. They also have\nlimited zero-shot ability to new values. In this paper, we propose a multi-task\nlearning model with value generation/classification and attribute prediction\ncalled JPAVE to predict values without the necessity of position information of\nvalues in the text. Furthermore, the copy mechanism in value generator and the\nvalue attention module in value classifier help our model address the data\ndiscrepancy issue by only focusing on the relevant part of input text and\nignoring other information which causes the discrepancy issue such as sentence\nstructure in the text. Besides, two variants of our model are designed for\nopen-world and closed-world scenarios. In addition, copy mechanism introduced\nin the first variant based on value generation can improve its zero-shot\nability for identifying unseen values. Experimental results on a public dataset\ndemonstrate the superiority of our model compared with strong baselines and its\ngeneralization ability of predicting new values."},{"date":"2023-11","title":"Extracting human interpretable structure-property relationships in chemistry using XAI and large language models","author":"Geemi P. Wellawatte, and Philippe Schwaller","link":"http://arxiv.org/abs/2311.04047v1","abstract":"Explainable Artificial Intelligence (XAI) is an emerging field in AI that\naims to address the opaque nature of machine learning models. Furthermore, it\nhas been shown that XAI can be used to extract input-output relationships,\nmaking them a useful tool in chemistry to understand structure-property\nrelationships. However, one of the main limitations of XAI methods is that they\nare developed for technically oriented users. We propose the XpertAI framework\nthat integrates XAI methods with large language models (LLMs) accessing\nscientific literature to generate accessible natural language explanations of\nraw chemical data automatically. We conducted 5 case studies to evaluate the\nperformance of XpertAI. Our results show that XpertAI combines the strengths of\nLLMs and XAI tools in generating specific, scientific, and interpretable\nexplanations."},{"date":"2023-11","title":"Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features","author":"Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, V\u00edctor Abia Alonso, Charlie Griffin, and Bogdan-Ionut Cirstea","link":"http://arxiv.org/abs/2311.04046v1","abstract":"Many capable large language models (LLMs) are developed via self-supervised\npre-training followed by a reinforcement-learning fine-tuning phase, often\nbased on human or AI feedback. During this stage, models may be guided by their\ninductive biases to rely on simpler features which may be easier to extract, at\na cost to robustness and generalisation. We investigate whether principles\ngoverning inductive biases in the supervised fine-tuning of LLMs also apply\nwhen the fine-tuning process uses reinforcement learning. Following Lovering et\nal (2021), we test two hypotheses: that features more $\\textit{extractable}$\nafter pre-training are more likely to be utilised by the final policy, and that\nthe evidence for/against a feature predicts whether it will be utilised.\nThrough controlled experiments on synthetic and natural language tasks, we find\nstatistically significant correlations which constitute strong evidence for\nthese hypotheses."},{"date":"2023-11","title":"Enhancing AI Research Paper Analysis: Methodology Component Extraction using Factored Transformer-based Sequence Modeling Approach","author":"Madhusudan Ghosh, Debasis Ganguly, Partha Basuchowdhuri, and Sudip Kumar Naskar","link":"http://arxiv.org/abs/2311.03401v1","abstract":"Research in scientific disciplines evolves, often rapidly, over time with the\nemergence of novel methodologies and their associated terminologies. While\nmethodologies themselves being conceptual in nature and rather difficult to\nautomatically extract and characterise, in this paper, we seek to develop\nsupervised models for automatic extraction of the names of the various\nconstituents of a methodology, e.g., `R-CNN', `ELMo' etc. The main research\nchallenge for this task is effectively modeling the contexts around these\nmethodology component names in a few-shot or even a zero-shot setting. The main\ncontributions of this paper towards effectively identifying new evolving\nscientific methodology names are as follows: i) we propose a factored approach\nto sequence modeling, which leverages a broad-level category information of\nmethodology domains, e.g., `NLP', `RL' etc.; ii) to demonstrate the feasibility\nof our proposed approach of identifying methodology component names under a\npractical setting of fast evolving AI literature, we conduct experiments\nfollowing a simulated chronological setup (newer methodologies not seen during\nthe training process); iii) our experiments demonstrate that the factored\napproach outperforms state-of-the-art baselines by margins of up to 9.257\\% for\nthe methodology extraction task with the few-shot setup."},{"date":"2023-11","title":"Extraction of Atypical Aspects from Customer Reviews: Datasets and Experiments with Language Models","author":"Smita Nannaware, Erfan Al-Hossami, and Razvan Bunescu","link":"http://arxiv.org/abs/2311.02702v1","abstract":"A restaurant dinner may become a memorable experience due to an unexpected\naspect enjoyed by the customer, such as an origami-making station in the\nwaiting area. If aspects that are atypical for a restaurant experience were\nknown in advance, they could be leveraged to make recommendations that have the\npotential to engender serendipitous experiences, further increasing user\nsatisfaction. Although relatively rare, whenever encountered, atypical aspects\noften end up being mentioned in reviews due to their memorable quality.\nCorrespondingly, in this paper we introduce the task of detecting atypical\naspects in customer reviews. To facilitate the development of extraction\nmodels, we manually annotate benchmark datasets of reviews in three domains -\nrestaurants, hotels, and hair salons, which we use to evaluate a number of\nlanguage models, ranging from fine-tuning the instruction-based text-to-text\ntransformer Flan-T5 to zero-shot and few-shot prompting of GPT-3.5."},{"date":"2023-10","title":"rTsfNet: a DNN model with Multi-head 3D Rotation and Time Series Feature Extraction for IMU-based Human Activity Recognition","author":"Yu Enokibori","link":"http://arxiv.org/abs/2310.19283v3","abstract":"Although many deep learning (DL) algorithms have been proposed for the\nIMU-based HAR domain, traditional machine learning that utilizes handcrafted\ntime series features (TSFs) still often performs well. It is not rare that\ncombinations among DL and TSFs show better accuracy than DL-only approaches.\nHowever, there is a problem with time series features in IMU-based HAR. The\namount of derived features can vary greatly depending on the method used to\nselect the 3D basis. Fortunately, DL's strengths include capturing the features\nof input data and adaptively deriving parameters. Thus, as a new DNN model for\nIMU-based human activity recognition (HAR), this paper proposes rTsfNet, a DNN\nmodel with Multi-head 3D Rotation and Time Series Feature Extraction. rTsfNet\nautomatically selects 3D bases from which features should be derived by\nextracting 3D rotation parameters within the DNN. Then, time series features\n(TSFs), based on many researchers' wisdom, are derived to achieve HAR using\nMLP. Although rTsfNet is a model that does not use CNN, it achieved higher\naccuracy than existing models under well-managed benchmark conditions and\nmultiple datasets: UCI HAR, PAMAP2, Daphnet, and OPPORTUNITY, all of which\ntarget different activities."},{"date":"2023-10","title":"Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting","author":"Hejie Cui, Xinyu Fang, Zihan Zhang, Ran Xu, Xuan Kan, Xin Liu, Yue Yu, Manling Li, Yangqiu Song, and Carl Yang","link":"http://arxiv.org/abs/2310.18804v1","abstract":"Images contain rich relational knowledge that can help machines understand\nthe world. Existing methods on visual knowledge extraction often rely on the\npre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation\ntypes), restricting the expressiveness of the extracted knowledge. In this\nwork, we take a first exploration to a new paradigm of open visual knowledge\nextraction. To achieve this, we present OpenVik which consists of an open\nrelational region detector to detect regions potentially containing relational\nknowledge and a visual knowledge generator that generates format-free knowledge\nby prompting the large multimodality model with the detected region of\ninterest. We also explore two data enhancement techniques for diversifying the\ngenerated format-free visual knowledge. Extensive knowledge quality evaluations\nhighlight the correctness and uniqueness of the extracted open visual knowledge\nby OpenVik. Moreover, integrating our extracted knowledge across various visual\nreasoning applications shows consistent improvements, indicating the real-world\napplicability of OpenVik."},{"date":"2023-10","title":"Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages","author":"Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield","link":"http://arxiv.org/abs/2310.17526v2","abstract":"Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance."},{"date":"2023-10","title":"Prompt-Driven Building Footprint Extraction in Aerial Images with Offset-Building Model","author":"Kai Li, Yupeng Deng, Yunlong Kong, Diyou Liu, Jingbo Chen, Yu Meng, and Junxian Ma","link":"http://arxiv.org/abs/2310.16717v3","abstract":"More accurate extraction of invisible building footprints from\nvery-high-resolution (VHR) aerial images relies on roof segmentation and\nroof-to-footprint offset extraction. Existing state-of-the-art methods based on\ninstance segmentation suffer from poor generalization when extended to\nlarge-scale data production and fail to achieve low-cost human interactive\nannotation. The latest prompt paradigms inspire us to design a promptable\nframework for roof and offset extraction, which transforms end-to-end\nalgorithms into promptable methods. Within this framework, we propose a novel\nOffset-Building Model (OBM). To rigorously evaluate the algorithm's\ncapabilities, we introduce a prompt-based evaluation method, where our model\nreduces offset errors by 16.6% and improves roof Intersection over Union (IoU)\nby 10.8% compared to other models. Leveraging the common patterns in predicting\noffsets, we propose Distance-NMS (DNMS) algorithms, enabling the model to\nfurther reduce offset vector loss by 6.5%. To further validate the\ngeneralization of models, we tested them using a new dataset with over 7,000\nmanually annotated instance samples. Our algorithms and dataset are available\nat https://anonymous.4open.science/r/OBM-B3EC."},{"date":"2023-10","title":"Defense Against Model Extraction Attacks on Recommender Systems","author":"Sixiao Zhang, Hongzhi Yin, Hongxu Chen, and Cheng Long","link":"http://arxiv.org/abs/2310.16335v1","abstract":"The robustness of recommender systems has become a prominent topic within the\nresearch community. Numerous adversarial attacks have been proposed, but most\nof them rely on extensive prior knowledge, such as all the white-box attacks or\nmost of the black-box attacks which assume that certain external knowledge is\navailable. Among these attacks, the model extraction attack stands out as a\npromising and practical method, involving training a surrogate model by\nrepeatedly querying the target model. However, there is a significant gap in\nthe existing literature when it comes to defending against model extraction\nattacks on recommender systems. In this paper, we introduce Gradient-based\nRanking Optimization (GRO), which is the first defense strategy designed to\ncounter such attacks. We formalize the defense as an optimization problem,\naiming to minimize the loss of the protected target model while maximizing the\nloss of the attacker's surrogate model. Since top-k ranking lists are\nnon-differentiable, we transform them into swap matrices which are instead\ndifferentiable. These swap matrices serve as input to a student model that\nemulates the surrogate model's behavior. By back-propagating the loss of the\nstudent model, we obtain gradients for the swap matrices. These gradients are\nused to compute a swap loss, which maximizes the loss of the student model. We\nconducted experiments on three benchmark datasets to evaluate the performance\nof GRO, and the results demonstrate its superior effectiveness in defending\nagainst model extraction attacks."},{"date":"2023-10","title":"Efficient Data Learning for Open Information Extraction with Pre-trained Language Models","author":"Zhiyuan Fan, and Shizhu He","link":"http://arxiv.org/abs/2310.15021v2","abstract":"Open Information Extraction (OpenIE) is a fundamental yet challenging task in\nNatural Language Processing, which involves extracting all triples (subject,\npredicate, object) from a given sentence. While labeling-based methods have\ntheir merits, generation-based techniques offer unique advantages, such as the\nability to generate tokens not present in the original sentence. However, these\ngeneration-based methods often require a significant amount of training data to\nlearn the task form of OpenIE and substantial training time to overcome slow\nmodel convergence due to the order penalty. In this paper, we introduce a novel\nframework, OK-IE, that ingeniously transforms the task form of OpenIE into the\npre-training task form of the T5 model, thereby reducing the need for extensive\ntraining data. Furthermore, we introduce an innovative concept of Anchor to\ncontrol the sequence of model outputs, effectively eliminating the impact of\norder penalty on model convergence and significantly reducing training time.\nExperimental results indicate that, compared to previous SOTA methods, OK-IE\nrequires only 1/100 of the training data (900 instances) and 1/120 of the\ntraining time (3 minutes) to achieve comparable results."},{"date":"2023-10","title":"Knowledge Extraction and Distillation from Large-Scale Image-Text Colonoscopy Records Leveraging Large Language and Vision Models","author":"Shuo Wang, Yan Zhu, Xiaoyuan Luo, Zhiwei Yang, Yizhe Zhang, Peiyao Fu, Manning Wang, Zhijian Song, Quanlin Li, Pinghong Zhou, and Yike Guo","link":"http://arxiv.org/abs/2310.11173v1","abstract":"The development of artificial intelligence systems for colonoscopy analysis\noften necessitates expert-annotated image datasets. However, limitations in\ndataset size and diversity impede model performance and generalisation.\nImage-text colonoscopy records from routine clinical practice, comprising\nmillions of images and text reports, serve as a valuable data source, though\nannotating them is labour-intensive. Here we leverage recent advancements in\nlarge language and vision models and propose EndoKED, a data mining paradigm\nfor deep knowledge extraction and distillation. EndoKED automates the\ntransformation of raw colonoscopy records into image datasets with pixel-level\nannotation. We validate EndoKED using multi-centre datasets of raw colonoscopy\nrecords (~1 million images), demonstrating its superior performance in training\npolyp detection and segmentation models. Furthermore, the EndoKED pre-trained\nvision backbone enables data-efficient and generalisable learning for optical\nbiopsy, achieving expert-level performance in both retrospective and\nprospective validation."},{"date":"2023-10","title":"Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models","author":"Yilmazcan Ozyurt, Stefan Feuerriegel, and Ce Zhang","link":"http://arxiv.org/abs/2310.11085v4","abstract":"Document-level relation extraction aims at inferring structured human\nknowledge from textual documents. State-of-the-art methods for this task use\npre-trained language models (LMs) via fine-tuning, yet fine-tuning is\ncomputationally expensive and cannot adapt to new relation types or new LMs. As\na remedy, we leverage the generalization capabilities of pre-trained LMs and\npresent a novel framework for document-level in-context few-shot relation\nextraction. Our framework has three strengths: it eliminates the need (1) for\nnamed entity recognition and (2) for human annotations of documents, and (3) it\ncan be updated to new LMs without re-training. We evaluate our framework using\nDocRED, the largest publicly available dataset for document-level relation\nextraction, and demonstrate that our framework achieves state-of-the-art\nperformance. We further show that our framework actually performs much better\nthan the original labels from the development set of DocRED. Finally, we\nconduct an extensive benchmark demonstrating the effectiveness of our\nframework, achieving state-of-the-art results across six relation extraction\ndatasets and outperforming more than 30 baseline methods. Unlike our framework,\nthe baseline methods have large computational overhead (e.g., from\nfine-tuning). To the best of our knowledge, we are the first to reformulate the\ndocument-level relation extraction task as a tailored in-context few-shot\nlearning paradigm."},{"date":"2023-10","title":"Convolutional Neural Network Model for Diabetic Retinopathy Feature Extraction and Classification","author":"Sharan Subramanian, and Leilani H. Gilpin","link":"http://arxiv.org/abs/2310.10806v1","abstract":"The application of Artificial Intelligence in the medical market brings up\nincreasing concerns but aids in more timely diagnosis of silent progressing\ndiseases like Diabetic Retinopathy. In order to diagnose Diabetic Retinopathy\n(DR), ophthalmologists use color fundus images, or pictures of the back of the\nretina, to identify small distinct features through a difficult and\ntime-consuming process. Our work creates a novel CNN model and identifies the\nseverity of DR through fundus image input. We classified 4 known DR features,\nincluding micro-aneurysms, cotton wools, exudates, and hemorrhages, through\nconvolutional layers and were able to provide an accurate diagnostic without\nadditional user input. The proposed model is more interpretable and robust to\noverfitting. We present initial results with a sensitivity of 97% and an\naccuracy of 71%. Our contribution is an interpretable model with similar\naccuracy to more complex models. With that, our model advances the field of DR\ndetection and proves to be a key step towards AI-focused medical diagnosis."},{"date":"2023-10","title":"SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack","author":"Renyang Liu, Jinhong Zhang, Kwok-Yan Lam, Jun Zhao, and Wei Zhou","link":"http://arxiv.org/abs/2310.09792v1","abstract":"Previous studies have revealed that artificial intelligence (AI) systems are\nvulnerable to adversarial attacks. Among them, model extraction attacks fool\nthe target model by generating adversarial examples on a substitute model. The\ncore of such an attack is training a substitute model as similar to the target\nmodel as possible, where the simulation process can be categorized in a\ndata-dependent and data-free manner. Compared with the data-dependent method,\nthe data-free one has been proven to be more practical in the real world since\nit trains the substitute model with synthesized data. However, the distribution\nof these fake data lacks diversity and cannot detect the decision boundary of\nthe target model well, resulting in the dissatisfactory simulation effect.\nBesides, these data-free techniques need a vast number of queries to train the\nsubstitute model, increasing the time and computing consumption and the risk of\nexposure. To solve the aforementioned problems, in this paper, we propose a\nnovel data-free model extraction method named SCME (Self-Contrastive Model\nExtraction), which considers both the inter- and intra-class diversity in\nsynthesizing fake data. In addition, SCME introduces the Mixup operation to\naugment the fake data, which can explore the target model's decision boundary\neffectively and improve the simulating capacity. Extensive experiments show\nthat the proposed method can yield diversified fake data. Moreover, our method\nhas shown superiority in many different attack settings under the query-limited\nscenario, especially for untargeted attacks, the SCME outperforms SOTA methods\nby 11.43\\% on average for five baseline datasets."},{"date":"2023-10","title":"Notes on Applicability of Explainable AI Methods to Machine Learning Models Using Features Extracted by Persistent Homology","author":"Naofumi Hama","link":"http://arxiv.org/abs/2310.09780v1","abstract":"Data analysis that uses the output of topological data analysis as input for\nmachine learning algorithms has been the subject of extensive research. This\napproach offers a means of capturing the global structure of data. Persistent\nhomology (PH), a common methodology within the field of TDA, has found\nwide-ranging applications in machine learning. One of the key reasons for the\nsuccess of the PH-ML pipeline lies in the deterministic nature of feature\nextraction conducted through PH. The ability to achieve satisfactory levels of\naccuracy with relatively simple downstream machine learning models, when\nprocessing these extracted features, underlines the pipeline's superior\ninterpretability. However, it must be noted that this interpretation has\nencountered issues. Specifically, it fails to accurately reflect the feasible\nparameter region in the data generation process, and the physical or chemical\nconstraints that restrict this process. Against this backdrop, we explore the\npotential application of explainable AI methodologies to this PH-ML pipeline.\nWe apply this approach to the specific problem of predicting gas adsorption in\nmetal-organic frameworks and demonstrate that it can yield suggestive results.\nThe codes to reproduce our results are available at\nhttps://github.com/naofumihama/xai_ph_ml"},{"date":"2023-10","title":"Polynomial Time Cryptanalytic Extraction of Neural Network Models","author":"Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, and Nitin Satpute","link":"http://arxiv.org/abs/2310.08708v1","abstract":"Billions of dollars and countless GPU hours are currently spent on training\nDeep Neural Networks (DNNs) for a variety of tasks. Thus, it is essential to\ndetermine the difficulty of extracting all the parameters of such neural\nnetworks when given access to their black-box implementations. Many versions of\nthis problem have been studied over the last 30 years, and the best current\nattack on ReLU-based deep neural networks was presented at Crypto 2020 by\nCarlini, Jagielski, and Mironov. It resembles a differential chosen plaintext\nattack on a cryptosystem, which has a secret key embedded in its black-box\nimplementation and requires a polynomial number of queries but an exponential\namount of time (as a function of the number of neurons). In this paper, we\nimprove this attack by developing several new techniques that enable us to\nextract with arbitrarily high precision all the real-valued parameters of a\nReLU-based DNN using a polynomial number of queries and a polynomial amount of\ntime. We demonstrate its practical efficiency by applying it to a full-sized\nneural network for classifying the CIFAR10 dataset, which has 3072 inputs, 8\nhidden layers with 256 neurons each, and over million neuronal parameters. An\nattack following the approach by Carlini et al. requires an exhaustive search\nover 2 to the power 256 possibilities. Our attack replaces this with our new\ntechniques, which require only 30 minutes on a 256-core computer."},{"date":"2023-10","title":"I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction","author":"Yusheng Huang, and Zhouhan Lin","link":"http://arxiv.org/abs/2310.06326v1","abstract":"Multimodal information extraction is attracting research attention nowadays,\nwhich requires aggregating representations from different modalities. In this\npaper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM)\nmethod for this task, which contains two modules. Firstly, the intra-sample\nrelationship modeling module operates on a single sample and aims to learn\neffective representations. Embeddings from textual and visual modalities are\nshifted to bridge the modality gap caused by distinct pre-trained language and\nimage models. Secondly, the inter-sample relationship modeling module considers\nrelationships among multiple samples and focuses on capturing the interactions.\nAn AttnMixup strategy is proposed, which not only enables collaboration among\nsamples but also augments data to improve generalization. We conduct extensive\nexperiments on the multimodal named entity recognition datasets Twitter-2015\nand Twitter-2017, and the multimodal relation extraction dataset MNRE. Our\nproposed method I2SRM achieves competitive results, 77.12% F1-score on\nTwitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE."},{"date":"2023-10","title":"Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction","author":"Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, and Yonghui Wu","link":"http://arxiv.org/abs/2310.06239v1","abstract":"Objective To develop soft prompt-based learning algorithms for large language\nmodels (LLMs), examine the shape of prompts, prompt-tuning using\nfrozen/unfrozen LLMs, transfer learning, and few-shot learning abilities.\nMethods We developed a soft prompt-based LLM model and compared 4 training\nstrategies including (1) fine-tuning without prompts; (2) hard-prompt with\nunfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with\nfrozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for\nclinical concept and relation extraction on two benchmark datasets. We\nevaluated the transfer learning ability of the prompt-based learning algorithms\nin a cross-institution setting. We also assessed the few-shot learning ability.\nResults and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft\nprompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept\nextraction, outperforming the traditional fine-tuning and hard prompt-based\nmodels by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft\nprompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end\nrelation extraction, outperforming the other two models by 0.2~2% and\n0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million\nparameters) LLMs have a big gap to be competitive with unfrozen models; scaling\nLLMs up to billions of parameters makes frozen LLMs competitive with unfrozen\nLLMs. For cross-institute evaluation, soft prompting with a frozen\nGatorTron-8.9B model achieved the best performance. This study demonstrates\nthat (1) machines can learn soft prompts better than humans, (2) frozen LLMs\nhave better few-shot learning ability and transfer learning ability to\nfacilitate muti-institution applications, and (3) frozen LLMs require large\nmodels."},{"date":"2023-10","title":"GeoLLM: Extracting Geospatial Knowledge from Large Language Models","author":"Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, and Stefano Ermon","link":"http://arxiv.org/abs/2310.06213v2","abstract":"The application of machine learning (ML) in a range of geospatial tasks is\nincreasingly common but often relies on globally available covariates such as\nsatellite imagery that can either be expensive or lack predictive power. Here\nwe explore the question of whether the vast amounts of knowledge found in\nInternet language corpora, now compressed within large language models (LLMs),\ncan be leveraged for geospatial prediction tasks. We first demonstrate that\nLLMs embed remarkable spatial information about locations, but naively querying\nLLMs using geographic coordinates alone is ineffective in predicting key\nindicators like population density. We then present GeoLLM, a novel method that\ncan effectively extract geospatial knowledge from LLMs with auxiliary map data\nfrom OpenStreetMap. We demonstrate the utility of our approach across multiple\ntasks of central interest to the international community, including the\nmeasurement of population density and economic livelihoods. Across these tasks,\nour method demonstrates a 70% improvement in performance (measured using\nPearson's $r^2$) relative to baselines that use nearest neighbors or use\ninformation directly from the prompt, and performance equal to or exceeding\nsatellite-based benchmarks in the literature. With GeoLLM, we observe that\nGPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting\nthat the performance of our method scales well with the size of the model and\nits pretraining dataset. Our experiments reveal that LLMs are remarkably\nsample-efficient, rich in geospatial information, and robust across the globe.\nCrucially, GeoLLM shows promise in mitigating the limitations of existing\ngeospatial covariates and complementing them well. Code is available on the\nproject website: https://rohinmanvi.github.io/GeoLLM"},{"date":"2023-10","title":"Conditional Diffusion Model for Target Speaker Extraction","author":"Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, and Philip C Woodland","link":"http://arxiv.org/abs/2310.04791v1","abstract":"We propose DiffSpEx, a generative target speaker extraction method based on\nscore-based generative modelling through stochastic differential equations.\nDiffSpEx deploys a continuous-time stochastic diffusion process in the complex\nshort-time Fourier transform domain, starting from the target speaker source\nand converging to a Gaussian distribution centred on the mixture of sources.\nFor the reverse-time process, a parametrised score function is conditioned on a\ntarget speaker embedding to extract the target speaker from the mixture of\nsources. We utilise ECAPA-TDNN target speaker embeddings and condition the\nscore function alternately on the SDE time embedding and the target speaker\nembedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix\ndataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we\nshow that fine-tuning a pre-trained DiffSpEx model to a specific speaker\nfurther improves performance, enabling personalisation in target speaker\nextraction."},{"date":"2023-10","title":"Do self-supervised speech and language models extract similar representations as human brain?","author":"Peili Chen, Linyang He, Li Fu, Lu Fan, Edward F. Chang, and Yuanning Li","link":"http://arxiv.org/abs/2310.04645v2","abstract":"Speech and language models trained through self-supervised learning (SSL)\ndemonstrate strong alignment with brain activity during speech and language\nperception. However, given their distinct training modalities, it remains\nunclear whether they correlate with the same neural aspects. We directly\naddress this question by evaluating the brain prediction performance of two\nrepresentative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and\nlanguage tasks. Our findings reveal that both models accurately predict speech\nresponses in the auditory cortex, with a significant correlation between their\nbrain predictions. Notably, shared speech contextual information between\nWav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain\nactivity, surpassing static semantic and lower-level acoustic-phonetic\ninformation. These results underscore the convergence of speech contextual\nrepresentations in SSL models and their alignment with the neural network\nunderlying speech perception, offering valuable insights into both SSL models\nand the neural basis of speech and language processing."},{"date":"2023-10","title":"Extraction of Medication and Temporal Relation from Clinical Text using Neural Language Models","author":"Hangyu Tu, Lifeng Han, and Goran Nenadic","link":"http://arxiv.org/abs/2310.02229v2","abstract":"Clinical texts, represented in electronic medical records (EMRs), contain\nrich medical information and are essential for disease prediction, personalised\ninformation recommendation, clinical decision support, and medication pattern\nmining and measurement. Relation extractions between medication mentions and\ntemporal information can further help clinicians better understand the\npatients' treatment history. To evaluate the performances of deep learning (DL)\nand large language models (LLMs) in medication extraction and temporal\nrelations classification, we carry out an empirical investigation of\n\\textbf{MedTem} project using several advanced learning structures including\nBiLSTM-CRF and CNN-BiLSTM for a clinical domain named entity recognition (NER),\nand BERT-CNN for temporal relation extraction (RE), in addition to the\nexploration of different word embedding techniques. Furthermore, we also\ndesigned a set of post-processing roles to generate structured output on\nmedications and the temporal relation. Our experiments show that CNN-BiLSTM\nslightly wins the BiLSTM-CRF model on the i2b2-2009 clinical NER task yielding\n75.67, 77.83, and 78.17 for precision, recall, and F1 scores using Macro\nAverage. BERT-CNN model also produced reasonable evaluation scores 64.48,\n67.17, and 65.03 for P/R/F1 using Macro Avg on the temporal relation extraction\ntest set from i2b2-2012 challenges. Code and Tools from MedTem will be hosted\nat \\url{https://github.com/HECTA-UoM/MedTem}"},{"date":"2023-10","title":"An evaluation of pre-trained models for feature extraction in image classification","author":"Erick da Silva Puls, Matheus V. Todescato, and Joel L. Carbonera","link":"http://arxiv.org/abs/2310.02037v1","abstract":"In recent years, we have witnessed a considerable increase in performance in\nimage classification tasks. This performance improvement is mainly due to the\nadoption of deep learning techniques. Generally, deep learning techniques\ndemand a large set of annotated data, making it a challenge when applying it to\nsmall datasets. In this scenario, transfer learning strategies have become a\npromising alternative to overcome these issues. This work aims to compare the\nperformance of different pre-trained neural networks for feature extraction in\nimage classification tasks. We evaluated 16 different pre-trained models in\nfour image datasets. Our results demonstrate that the best general performance\nalong the datasets was achieved by CLIP-ViT-B and ViT-H-14, where the\nCLIP-ResNet50 model had similar performance but with less variability.\nTherefore, our study provides evidence supporting the choice of models for\nfeature extraction in image classification tasks."},{"date":"2023-10","title":"Beyond Labeling Oracles: What does it mean to steal ML models?","author":"Avital Shafran, Ilia Shumailov, Murat A. Erdogdu, and Nicolas Papernot","link":"http://arxiv.org/abs/2310.01959v3","abstract":"Model extraction attacks are designed to steal trained models with only query\naccess, as is often provided through APIs that ML-as-a-Service providers offer.\nMachine Learning (ML) models are expensive to train, in part because data is\nhard to obtain, and a primary incentive for model extraction is to acquire a\nmodel while incurring less cost than training from scratch. Literature on model\nextraction commonly claims or presumes that the attacker is able to save on\nboth data acquisition and labeling costs. We thoroughly evaluate this\nassumption and find that the attacker often does not. This is because current\nattacks implicitly rely on the adversary being able to sample from the victim\nmodel's data distribution. We thoroughly research factors influencing the\nsuccess of model extraction. We discover that prior knowledge of the attacker,\ni.e., access to in-distribution data, dominates other factors like the attack\npolicy the adversary follows to choose which queries to make to the victim\nmodel API. Our findings urge the community to redefine the adversarial goals of\nME attacks as current evaluation methods misinterpret the ME performance."},{"date":"2023-10","title":"Unsupervised Roofline Extraction from True Orthophotos for LoD2 Building Model Reconstruction","author":"Weixiao Gao, Ravi Peters, and Jantien Stoter","link":"http://arxiv.org/abs/2310.01067v1","abstract":"This paper discusses the reconstruction of LoD2 building models from 2D and\n3D data for large-scale urban environments. Traditional methods involve the use\nof LiDAR point clouds, but due to high costs and long intervals associated with\nacquiring such data for rapidly developing areas, researchers have started\nexploring the use of point clouds generated from (oblique) aerial images.\nHowever, using such point clouds for traditional plane detection-based methods\ncan result in significant errors and introduce noise into the reconstructed\nbuilding models. To address this, this paper presents a method for extracting\nrooflines from true orthophotos using line detection for the reconstruction of\nbuilding models at the LoD2 level. The approach is able to extract relatively\ncomplete rooflines without the need for pre-labeled training data or\npre-trained models. These lines can directly be used in the LoD2 building model\nreconstruction process. The method is superior to existing plane\ndetection-based methods and state-of-the-art deep learning methods in terms of\nthe accuracy and completeness of the reconstructed building. Our source code is\navailable at https://github.com/tudelft3d/Roofline-extraction-from-orthophotos."},{"date":"2023-09","title":"Towards Few-Call Model Stealing via Active Self-Paced Knowledge Distillation and Diffusion-Based Image Generation","author":"Vlad Hondru, and Radu Tudor Ionescu","link":"http://arxiv.org/abs/2310.00096v1","abstract":"Diffusion models showcased strong capabilities in image synthesis, being used\nin many computer vision tasks with great success. To this end, we propose to\nexplore a new use case, namely to copy black-box classification models without\nhaving access to the original training data, the architecture, and the weights\nof the model, \\ie~the model is only exposed through an inference API. More\nspecifically, we can only observe the (soft or hard) labels for some image\nsamples passed as input to the model. Furthermore, we consider an additional\nconstraint limiting the number of model calls, mostly focusing our research on\nfew-call model stealing. In order to solve the model extraction task given the\napplied restrictions, we propose the following framework. As training data, we\ncreate a synthetic data set (called proxy data set) by leveraging the ability\nof diffusion models to generate realistic and diverse images. Given a maximum\nnumber of allowed API calls, we pass the respective number of samples through\nthe black-box model to collect labels. Finally, we distill the knowledge of the\nblack-box teacher (attacked model) into a student model (copy of the attacked\nmodel), harnessing both labeled and unlabeled data generated by the diffusion\nmodel. We employ a novel active self-paced learning framework to make the most\nof the proxy data during distillation. Our empirical results on two data sets\nconfirm the superiority of our framework over two state-of-the-art methods in\nthe few-call model extraction scenario."}] \ No newline at end of file +[{"date":"2024-11","title":"Model Stealing for Any Low-Rank Language Model","author":"Allen Liu, and Ankur Moitra","link":"http://arxiv.org/abs/2411.07536v1","abstract":"Model stealing, where a learner tries to recover an unknown model via\ncarefully chosen queries, is a critical problem in machine learning, as it\nthreatens the security of proprietary models and the privacy of data they are\ntrained on. In recent years, there has been particular interest in stealing\nlarge language models (LLMs). In this paper, we aim to build a theoretical\nunderstanding of stealing language models by studying a simple and\nmathematically tractable setting. We study model stealing for Hidden Markov\nModels (HMMs), and more generally low-rank language models.\n We assume that the learner works in the conditional query model, introduced\nby Kakade, Krishnamurthy, Mahajan and Zhang. Our main result is an efficient\nalgorithm in the conditional query model, for learning any low-rank\ndistribution. In other words, our algorithm succeeds at stealing any language\nmodel whose output distribution is low-rank. This improves upon the previous\nresult by Kakade, Krishnamurthy, Mahajan and Zhang, which also requires the\nunknown distribution to have high \"fidelity\", a property that holds only in\nrestricted cases. There are two key insights behind our algorithm: First, we\nrepresent the conditional distributions at each timestep by constructing\nbarycentric spanners among a collection of vectors of exponentially large\ndimension. Second, for sampling from our representation, we iteratively solve a\nsequence of convex optimization problems that involve projection in relative\nentropy to prevent compounding of errors over the length of the sequence. This\nis an interesting example where, at least theoretically, allowing a machine\nlearning model to solve more complex problems at inference time can lead to\ndrastic improvements in its performance."},{"date":"2024-11","title":"AWARE Narrator and the Utilization of Large Language Models to Extract Behavioral Insights from Smartphone Sensing Data","author":"Tianyi Zhang, Miu Kojima, and Simon D'Alfonso","link":"http://arxiv.org/abs/2411.04691v1","abstract":"Smartphones, equipped with an array of sensors, have become valuable tools\nfor personal sensing. Particularly in digital health, smartphones facilitate\nthe tracking of health-related behaviors and contexts, contributing\nsignificantly to digital phenotyping, a process where data from digital\ninteractions is analyzed to infer behaviors and assess mental health.\nTraditional methods process raw sensor data into information features for\nstatistical and machine learning analyses. In this paper, we introduce a novel\napproach that systematically converts smartphone-collected data into\nstructured, chronological narratives. The AWARE Narrator translates\nquantitative smartphone sensing data into English language descriptions,\nforming comprehensive narratives of an individual's activities. We apply the\nframework to the data collected from university students over a week,\ndemonstrating the potential of utilizing the narratives to summarize individual\nbehavior, and analyzing psychological states by leveraging large language\nmodels."},{"date":"2024-11","title":"Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction","author":"Muhammad Tayyab Khan, Lequn Chen, Ye Han Ng, Wenhe Feng, Nicholas Yew Jin Tan, and Seung Ki Moon","link":"http://arxiv.org/abs/2411.03707v1","abstract":"Geometric Dimensioning and Tolerancing (GD&T) plays a critical role in\nmanufacturing by defining acceptable variations in part features to ensure\ncomponent quality and functionality. However, extracting GD&T information from\n2D engineering drawings is a time-consuming and labor-intensive task, often\nrelying on manual efforts or semi-automated tools. To address these challenges,\nthis study proposes an automated and computationally efficient GD&T extraction\nmethod by fine-tuning Florence-2, an open-source vision-language model (VLM).\nThe model is trained on a dataset of 400 drawings with ground truth annotations\nprovided by domain experts. For comparison, two state-of-the-art closed-source\nVLMs, GPT-4o and Claude-3.5-Sonnet, are evaluated on the same dataset. All\nmodels are assessed using precision, recall, F1-score, and hallucination\nmetrics. Due to the computational cost and impracticality of fine-tuning large\nclosed-source VLMs for domain-specific tasks, GPT-4o and Claude-3.5-Sonnet are\nevaluated in a zero-shot setting. In contrast, Florence-2, a smaller model with\n0.23 billion parameters, is optimized through full-parameter fine-tuning across\nthree distinct experiments, each utilizing datasets augmented to different\nlevels. The results show that Florence-2 achieves a 29.95% increase in\nprecision, a 37.75% increase in recall, a 52.40% improvement in F1-score, and a\n43.15% reduction in hallucination rate compared to the best-performing\nclosed-source model. These findings highlight the effectiveness of fine-tuning\nsmaller, open-source VLMs like Florence-2, offering a practical and efficient\nsolution for automated GD&T extraction to support downstream manufacturing\ntasks."},{"date":"2024-11","title":"Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT","author":"Pourya Jafarzadeh, Amir Mohammad Rostami, and Padideh Choobdar","link":"http://arxiv.org/abs/2411.02964v2","abstract":"Speech is the most natural way of expressing ourselves as humans. Identifying\nemotion from speech is a nontrivial task due to the ambiguous definition of\nemotion itself. Speaker Emotion Recognition (SER) is essential for\nunderstanding human emotional behavior. The SER task is challenging due to the\nvariety of speakers, background noise, complexity of emotions, and speaking\nstyles. It has many applications in education, healthcare, customer service,\nand Human-Computer Interaction (HCI). Previously, conventional machine learning\nmethods such as SVM, HMM, and KNN have been used for the SER task. In recent\nyears, deep learning methods have become popular, with convolutional neural\nnetworks and recurrent neural networks being used for SER tasks. The input of\nthese methods is mostly spectrograms and hand-crafted features. In this work,\nwe study the use of self-supervised transformer-based models, Wav2Vec2 and\nHuBERT, to determine the emotion of speakers from their voice. The models\nautomatically extract features from raw audio signals, which are then used for\nthe classification task. The proposed solution is evaluated on reputable\ndatasets, including RAVDESS, SHEMO, SAVEE, AESDD, and Emo-DB. The results show\nthe effectiveness of the proposed method on different datasets. Moreover, the\nmodel has been used for real-world applications like call center conversations,\nand the results demonstrate that the model accurately predicts emotions."},{"date":"2024-11","title":"DM4Steal: Diffusion Model For Link Stealing Attack On Graph Neural Networks","author":"Jinyin Chen, Haonan Ma, and Haibin Zheng","link":"http://arxiv.org/abs/2411.03364v2","abstract":"Graph has become increasingly integral to the advancement of recommendation\nsystems, particularly with the fast development of graph neural network(GNN).\nBy exploring the virtue of rich node features and link information, GNN is\ndesigned to provide personalized and accurate suggestions. Meanwhile, the\nprivacy leakage of GNN in such contexts has also captured special attention.\nPrior work has revealed that a malicious user can utilize auxiliary knowledge\nto extract sensitive link data of the target graph, integral to recommendation\nsystems, via the decision made by the target GNN model. This poses a\nsignificant risk to the integrity and confidentiality of data used in\nrecommendation system. Though important, previous works on GNN's privacy\nleakage are still challenged in three aspects, i.e., limited stealing attack\nscenarios, sub-optimal attack performance, and adaptation against defense. To\naddress these issues, we propose a diffusion model based link stealing attack,\nnamed DM4Steal. It differs previous work from three critical aspects. (i)\nGenerality: aiming at six attack scenarios with limited auxiliary knowledge, we\npropose a novel training strategy for diffusion models so that DM4Steal is\ntransferable to diverse attack scenarios. (ii) Effectiveness: benefiting from\nthe retention of semantic structure in the diffusion model during the training\nprocess, DM4Steal is capable to learn the precise topology of the target graph\nthrough the GNN decision process. (iii) Adaptation: when GNN is defensive\n(e.g., DP, Dropout), DM4Steal relies on the stability that comes from sampling\nthe score model multiple times to keep performance degradation to a minimum,\nthus DM4Steal implements successful adaptive attack on defensive GNN."},{"date":"2024-11","title":"HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction","author":"Rujiao Long, Pengfei Wang, Zhibo Yang, and Cong Yao","link":"http://arxiv.org/abs/2411.01139v1","abstract":"End-to-end visual information extraction (VIE) aims at integrating the\nhierarchical subtasks of VIE, including text spotting, word grouping, and\nentity labeling, into a unified framework. Dealing with the gaps among the\nthree subtasks plays a pivotal role in designing an effective VIE model.\nOCR-dependent methods heavily rely on offline OCR engines and inevitably suffer\nfrom OCR errors, while OCR-free methods, particularly those employing a\nblack-box model, might produce outputs that lack interpretability or contain\nhallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP,\nwhich models entities as HIerarchical Points to better conform to the\nhierarchical nature of the end-to-end VIE task. Specifically, such hierarchical\npoints can be flexibly encoded and subsequently decoded into desired text\ntranscripts, centers of various regions, and categories of entities.\nFurthermore, we devise corresponding hierarchical pre-training strategies,\ncategorized as image reconstruction, layout learning, and language enhancement,\nto reinforce the cross-modality representation of the hierarchical encoders.\nQuantitative experiments on public benchmarks demonstrate that HIP outperforms\nprevious state-of-the-art methods, while qualitative results show its excellent\ninterpretability."},{"date":"2024-10","title":"Graph-Augmented Relation Extraction Model with LLMs-Generated Support Document","author":"Vicky Dong, Hao Yu, and Yao Chen","link":"http://arxiv.org/abs/2410.23452v1","abstract":"This study introduces a novel approach to sentence-level relation extraction\n(RE) that integrates Graph Neural Networks (GNNs) with Large Language Models\n(LLMs) to generate contextually enriched support documents. By harnessing the\npower of LLMs to generate auxiliary information, our approach crafts an\nintricate graph representation of textual data. This graph is subsequently\nprocessed through a Graph Neural Network (GNN) to refine and enrich the\nembeddings associated with each entity ensuring a more nuanced and\ninterconnected understanding of the data. This methodology addresses the\nlimitations of traditional sentence-level RE models by incorporating broader\ncontexts and leveraging inter-entity interactions, thereby improving the\nmodel's ability to capture complex relationships across sentences. Our\nexperiments, conducted on the CrossRE dataset, demonstrate the effectiveness of\nour approach, with notable improvements in performance across various domains.\nThe results underscore the potential of combining GNNs with LLM-generated\ncontext to advance the field of relation extraction."},{"date":"2024-10","title":"Image2Struct: Benchmarking Structure Extraction for Vision-Language Models","author":"Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, and Percy Liang","link":"http://arxiv.org/abs/2410.22456v1","abstract":"We introduce Image2Struct, a benchmark to evaluate vision-language models\n(VLMs) on extracting structure from images. Our benchmark 1) captures\nreal-world use cases, 2) is fully automatic and does not require human\njudgment, and 3) is based on a renewable stream of fresh data. In Image2Struct,\nVLMs are prompted to generate the underlying structure (e.g., LaTeX code or\nHTML) from an input image (e.g., webpage screenshot). The structure is then\nrendered to produce an output image (e.g., rendered webpage), which is compared\nagainst the input image to produce a similarity score. This round-trip\nevaluation allows us to quantitatively evaluate VLMs on tasks with multiple\nvalid structures. We create a pipeline that downloads fresh data from active\nonline communities upon execution and evaluates the VLMs without human\nintervention. We introduce three domains (Webpages, LaTeX, and Musical Scores)\nand use five image metrics (pixel similarity, cosine similarity between the\nInception vectors, learned perceptual image patch similarity, structural\nsimilarity index measure, and earth mover similarity) that allow efficient and\nautomatic comparison between pairs of images. We evaluate Image2Struct on 14\nprominent VLMs and find that scores vary widely, indicating that Image2Struct\ncan differentiate between the performances of different VLMs. Additionally, the\nbest score varies considerably across domains (e.g., 0.402 on sheet music vs.\n0.830 on LaTeX equations), indicating that Image2Struct contains tasks of\nvarying difficulty. For transparency, we release the full results at\nhttps://crfm.stanford.edu/helm/image2struct/v1.0.1/."},{"date":"2024-10","title":"Integrating Deep Feature Extraction and Hybrid ResNet-DenseNet Model for Multi-Class Abnormality Detection in Endoscopic Images","author":"Aman Sagar, Preeti Mehta, Monika Shrivastva, and Suchi Kumari","link":"http://arxiv.org/abs/2410.18457v1","abstract":"This paper presents a deep learning framework for the multi-class\nclassification of gastrointestinal abnormalities in Video Capsule Endoscopy\n(VCE) frames. The aim is to automate the identification of ten GI abnormality\nclasses, including angioectasia, bleeding, and ulcers, thereby reducing the\ndiagnostic burden on gastroenterologists. Utilizing an ensemble of DenseNet and\nResNet architectures, the proposed model achieves an overall accuracy of 94\\%\nacross a well-structured dataset. Precision scores range from 0.56 for erythema\nto 1.00 for worms, with recall rates peaking at 98% for normal findings. This\nstudy emphasizes the importance of robust data preprocessing techniques,\nincluding normalization and augmentation, in enhancing model performance. The\ncontributions of this work lie in developing an effective AI-driven tool that\nstreamlines the diagnostic process in gastroenterology, ultimately improving\npatient care and clinical outcomes."},{"date":"2024-10","title":"Extracting Spatiotemporal Data from Gradients with Large Language Models","author":"Lele Zheng, Yang Cao, Renhe Jiang, Kenjiro Taura, Yulong Shen, Sheng Li, and Masatoshi Yoshikawa","link":"http://arxiv.org/abs/2410.16121v1","abstract":"Recent works show that sensitive user data can be reconstructed from gradient\nupdates, breaking the key privacy promise of federated learning. While success\nwas demonstrated primarily on image data, these methods do not directly\ntransfer to other domains, such as spatiotemporal data. To understand privacy\nrisks in spatiotemporal federated learning, we first propose Spatiotemporal\nGradient Inversion Attack (ST-GIA), a gradient attack algorithm tailored to\nspatiotemporal data that successfully reconstructs the original location from\ngradients. Furthermore, the absence of priors in attacks on spatiotemporal data\nhas hindered the accurate reconstruction of real client data. To address this\nlimitation, we propose ST-GIA+, which utilizes an auxiliary language model to\nguide the search for potential locations, thereby successfully reconstructing\nthe original data from gradients. In addition, we design an adaptive defense\nstrategy to mitigate gradient inversion attacks in spatiotemporal federated\nlearning. By dynamically adjusting the perturbation levels, we can offer\ntailored protection for varying rounds of training data, thereby achieving a\nbetter trade-off between privacy and utility than current state-of-the-art\nmethods. Through intensive experimental analysis on three real-world datasets,\nwe reveal that the proposed defense strategy can well preserve the utility of\nspatiotemporal federated learning with effective security protection."},{"date":"2024-10","title":"Kaninfradet3D:A Road-side Camera-LiDAR Fusion 3D Perception Model based on Nonlinear Feature Extraction and Intrinsic Correlation","author":"Pei Liu, Nanfang Zheng, Yiqun Li, Junlan Chen, and Ziyuan Pu","link":"http://arxiv.org/abs/2410.15814v1","abstract":"With the development of AI-assisted driving, numerous methods have emerged\nfor ego-vehicle 3D perception tasks, but there has been limited research on\nroadside perception. With its ability to provide a global view and a broader\nsensing range, the roadside perspective is worth developing. LiDAR provides\nprecise three-dimensional spatial information, while cameras offer semantic\ninformation. These two modalities are complementary in 3D detection. However,\nadding camera data does not increase accuracy in some studies since the\ninformation extraction and fusion procedure is not sufficiently reliable.\nRecently, Kolmogorov-Arnold Networks (KANs) have been proposed as replacements\nfor MLPs, which are better suited for high-dimensional, complex data. Both the\ncamera and the LiDAR provide high-dimensional information, and employing KANs\nshould enhance the extraction of valuable features to produce better fusion\noutcomes. This paper proposes Kaninfradet3D, which optimizes the feature\nextraction and fusion modules. To extract features from complex\nhigh-dimensional data, the model's encoder and fuser modules were improved\nusing KAN Layers. Cross-attention was applied to enhance feature fusion, and\nvisual comparisons verified that camera features were more evenly integrated.\nThis addressed the issue of camera features being abnormally concentrated,\nnegatively impacting fusion. Compared to the benchmark, our approach shows\nimprovements of +9.87 mAP and +10.64 mAP in the two viewpoints of the TUMTraf\nIntersection Dataset and an improvement of +1.40 mAP in the roadside end of the\nTUMTraf V2X Cooperative Perception Dataset. The results indicate that\nKaninfradet3D can effectively fuse features, demonstrating the potential of\napplying KANs in roadside perception tasks."},{"date":"2024-10","title":"Efficient Model Extraction via Boundary Sampling","author":"Maor Biton Dor, and Yisroel Mirsky","link":"http://arxiv.org/abs/2410.15429v1","abstract":"This paper introduces a novel data-free model extraction attack that\nsignificantly advances the current state-of-the-art in terms of efficiency,\naccuracy, and effectiveness. Traditional black-box methods rely on using the\nvictim's model as an oracle to label a vast number of samples within\nhigh-confidence areas. This approach not only requires an extensive number of\nqueries but also results in a less accurate and less transferable model. In\ncontrast, our method innovates by focusing on sampling low-confidence areas\n(along the decision boundaries) and employing an evolutionary algorithm to\noptimize the sampling process. These novel contributions allow for a dramatic\nreduction in the number of queries needed by the attacker by a factor of 10x to\n600x while simultaneously improving the accuracy of the stolen model. Moreover,\nour approach improves boundary alignment, resulting in better transferability\nof adversarial examples from the stolen model to the victim's model (increasing\nthe attack success rate from 60\\% to 82\\% on average). Finally, we accomplish\nall of this with a strict black-box assumption on the victim, with no knowledge\nof the target's architecture or dataset.\n We demonstrate our attack on three datasets with increasingly larger\nresolutions and compare our performance to four state-of-the-art model\nextraction attacks."},{"date":"2024-10","title":"Transit Pulse: Utilizing Social Media as a Source for Customer Feedback and Information Extraction with Large Language Model","author":"Jiahao Wang, and Amer Shalaby","link":"http://arxiv.org/abs/2410.15016v1","abstract":"Users of the transit system flood social networks daily with messages that\ncontain valuable insights crucial for improving service quality. These posts\nhelp transit agencies quickly identify emerging issues. Parsing topics and\nsentiments is key to gaining comprehensive insights to foster service\nexcellence. However, the volume of messages makes manual analysis impractical,\nand standard NLP techniques like Term Frequency-Inverse Document Frequency\n(TF-IDF) fall short in nuanced interpretation. Traditional sentiment analysis\nseparates topics and sentiments before integrating them, often missing the\ninteraction between them. This incremental approach complicates classification\nand reduces analytical productivity. To address these challenges, we propose a\nnovel approach to extracting and analyzing transit-related information,\nincluding sentiment and sarcasm detection, identification of unusual system\nproblems, and location data from social media. Our method employs Large\nLanguage Models (LLM), specifically Llama 3, for a streamlined analysis free\nfrom pre-established topic labels. To enhance the model's domain-specific\nknowledge, we utilize Retrieval-Augmented Generation (RAG), integrating\nexternal knowledge sources into the information extraction pipeline. We\nvalidated our method through extensive experiments comparing its performance\nwith traditional NLP approaches on user tweet data from the real world transit\nsystem. Our results demonstrate the potential of LLMs to transform social media\ndata analysis in the public transit domain, providing actionable insights and\nenhancing transit agencies' responsiveness by extracting a broader range of\ninformation."},{"date":"2024-10","title":"Few-Shot Joint Multimodal Entity-Relation Extraction via Knowledge-Enhanced Cross-modal Prompt Model","author":"Li Yuan, Yi Cai, and Junsheng Huang","link":"http://arxiv.org/abs/2410.14225v1","abstract":"Joint Multimodal Entity-Relation Extraction (JMERE) is a challenging task\nthat aims to extract entities and their relations from text-image pairs in\nsocial media posts. Existing methods for JMERE require large amounts of labeled\ndata. However, gathering and annotating fine-grained multimodal data for JMERE\nposes significant challenges. Initially, we construct diverse and comprehensive\nmultimodal few-shot datasets fitted to the original data distribution. To\naddress the insufficient information in the few-shot setting, we introduce the\n\\textbf{K}nowledge-\\textbf{E}nhanced \\textbf{C}ross-modal \\textbf{P}rompt\n\\textbf{M}odel (KECPM) for JMERE. This method can effectively address the\nproblem of insufficient information in the few-shot setting by guiding a large\nlanguage model to generate supplementary background knowledge. Our proposed\nmethod comprises two stages: (1) a knowledge ingestion stage that dynamically\nformulates prompts based on semantic similarity guide ChatGPT generating\nrelevant knowledge and employs self-reflection to refine the knowledge; (2) a\nknowledge-enhanced language model stage that merges the auxiliary knowledge\nwith the original input and utilizes a transformer-based model to align with\nJMERE's required output format. We extensively evaluate our approach on a\nfew-shot dataset derived from the JMERE dataset, demonstrating its superiority\nover strong baselines in terms of both micro and macro F$_1$ scores.\nAdditionally, we present qualitative analyses and case studies to elucidate the\neffectiveness of our model."},{"date":"2024-10","title":"Supply Chain Network Extraction and Entity Classification Leveraging Large Language Models","author":"Tong Liu, and Hadi Meidani","link":"http://arxiv.org/abs/2410.13051v1","abstract":"Supply chain networks are critical to the operational efficiency of\nindustries, yet their increasing complexity presents significant challenges in\nmapping relationships and identifying the roles of various entities.\nTraditional methods for constructing supply chain networks rely heavily on\nstructured datasets and manual data collection, limiting their scope and\nefficiency. In contrast, recent advancements in Natural Language Processing\n(NLP) and large language models (LLMs) offer new opportunities for discovering\nand analyzing supply chain networks using unstructured text data. This paper\nproposes a novel approach that leverages LLMs to extract and process raw\ntextual information from publicly available sources to construct a\ncomprehensive supply chain graph. We focus on the civil engineering sector as a\ncase study, demonstrating how LLMs can uncover hidden relationships among\ncompanies, projects, and other entities. Additionally, we fine-tune an LLM to\nclassify entities within the supply chain graph, providing detailed insights\ninto their roles and relationships. The results show that domain-specific\nfine-tuning improves classification accuracy, highlighting the potential of\nLLMs for industry-specific supply chain analysis. Our contributions include the\ndevelopment of a supply chain graph for the civil engineering sector, as well\nas a fine-tuned LLM model that enhances entity classification and understanding\nof supply chain networks."},{"date":"2024-10","title":"CoreGuard: Safeguarding Foundational Capabilities of LLMs Against Model Stealing in Edge Deployment","author":"Qinfeng Li, Yangfan Xie, Tianyu Du, Zhiqiang Shen, Zhenghan Qin, Hao Peng, Xinkui Zhao, Xianwei Zhu, Jianwei Yin, and Xuhong Zhang","link":"http://arxiv.org/abs/2410.13903v1","abstract":"Proprietary large language models (LLMs) demonstrate exceptional\ngeneralization ability across various tasks. Additionally, deploying LLMs on\nedge devices is trending for efficiency and privacy reasons. However, edge\ndeployment of proprietary LLMs introduces new security threats: attackers who\nobtain an edge-deployed LLM can easily use it as a base model for various tasks\ndue to its high generalization ability, which we call foundational capability\nstealing. Unfortunately, existing model protection mechanisms are often\ntask-specific and fail to protect general-purpose LLMs, as they mainly focus on\nprotecting task-related parameters using trusted execution environments (TEEs).\nAlthough some recent TEE-based methods are able to protect the overall model\nparameters in a computation-efficient way, they still suffer from prohibitive\ncommunication costs between TEE and CPU/GPU, making it impractical to deploy\nfor edge LLMs. To protect the foundational capabilities of edge LLMs, we\npropose CoreGuard, a computation- and communication-efficient model protection\napproach against model stealing on edge devices. The core component of\nCoreGuard is a lightweight and propagative authorization module residing in\nTEE. Extensive experiments show that CoreGuard achieves the same security\nprotection as the black-box security guarantees with negligible overhead."},{"date":"2024-10","title":"Identity-Focused Inference and Extraction Attacks on Diffusion Models","author":"Jayneel Vora, Aditya Krishnan, Nader Bouacida, Prabhu RV Shankar, and Prasant Mohapatra","link":"http://arxiv.org/abs/2410.10177v1","abstract":"The increasing reliance on diffusion models for generating synthetic images\nhas amplified concerns about the unauthorized use of personal data,\nparticularly facial images, in model training. In this paper, we introduce a\nnovel identity inference framework to hold model owners accountable for\nincluding individuals' identities in their training data. Our approach moves\nbeyond traditional membership inference attacks by focusing on identity-level\ninference, providing a new perspective on data privacy violations. Through\ncomprehensive evaluations on two facial image datasets, Labeled Faces in the\nWild (LFW) and CelebA, our experiments demonstrate that the proposed membership\ninference attack surpasses baseline methods, achieving an attack success rate\nof up to 89% and an AUC-ROC of 0.91, while the identity inference attack\nattains 92% on LDM models trained on LFW, and the data extraction attack\nachieves 91.6% accuracy on DDPMs, validating the effectiveness of our approach\nacross diffusion models."},{"date":"2024-10","title":"Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset","author":"Victor Radermecker, Andrea Zanon, Nancy Thomas, Annita Vapsi, Saba Rahimi, Rama Ramakrishnan, and Daniel Borrajo","link":"http://arxiv.org/abs/2410.09135v1","abstract":"Understanding land cover holds considerable potential for a myriad of\npractical applications, particularly as data accessibility transitions from\nbeing exclusive to governmental and commercial entities to now including the\nbroader research community. Nevertheless, although the data is accessible to\nany community member interested in exploration, there exists a formidable\nlearning curve and no standardized process for accessing, pre-processing, and\nleveraging the data for subsequent tasks. In this study, we democratize this\ndata by presenting a flexible and efficient end to end pipeline for working\nwith the Dynamic World dataset, a cutting-edge near-real-time land use/land\ncover (LULC) dataset. This includes a pre-processing and representation\nframework which tackles noise removal, efficient extraction of large amounts of\ndata, and re-representation of LULC data in a format well suited for several\ndownstream tasks. To demonstrate the power of our pipeline, we use it to\nextract data for an urbanization prediction problem and build a suite of\nmachine learning models with excellent performance. This task is easily\ngeneralizable to the prediction of any type of land cover and our pipeline is\nalso compatible with a series of other downstream tasks."},{"date":"2024-10","title":"Contrastive Learning to Fine-Tune Feature Extraction Models for the Visual Cortex","author":"Alex Mulrooney, and Austin J. Brockmeier","link":"http://arxiv.org/abs/2410.06067v1","abstract":"Predicting the neural response to natural images in the visual cortex\nrequires extracting relevant features from the images and relating those\nfeature to the observed responses. In this work, we optimize the feature\nextraction in order to maximize the information shared between the image\nfeatures and the neural response across voxels in a given region of interest\n(ROI) extracted from the BOLD signal measured by fMRI. We adapt contrastive\nlearning (CL) to fine-tune a convolutional neural network, which was pretrained\nfor image classification, such that a mapping of a given image's features are\nmore similar to the corresponding fMRI response than to the responses to other\nimages. We exploit the recently released Natural Scenes Dataset (Allen et al.,\n2022) as organized for the Algonauts Project (Gifford et al., 2023), which\ncontains the high-resolution fMRI responses of eight subjects to tens of\nthousands of naturalistic images. We show that CL fine-tuning creates feature\nextraction models that enable higher encoding accuracy in early visual ROIs as\ncompared to both the pretrained network and a baseline approach that uses a\nregression loss at the output of the network to tune it for fMRI response\nencoding. We investigate inter-subject transfer of the CL fine-tuned models,\nincluding subjects from another, lower-resolution dataset (Gong et al., 2023).\nWe also pool subjects for fine-tuning to further improve the encoding\nperformance. Finally, we examine the performance of the fine-tuned models on\ncommon image classification tasks, explore the landscape of ROI-specific models\nby applying dimensionality reduction on the Bhattacharya dissimilarity matrix\ncreated using the predictions on those tasks (Mao et al., 2024), and\ninvestigate lateralization of the processing for early visual ROIs using\nsalience maps of the classifiers built on the CL-tuned models."},{"date":"2024-10","title":"Polynomial Time Cryptanalytic Extraction of Deep Neural Networks in the Hard-Label Setting","author":"Nicholas Carlini, Jorge Ch\u00e1vez-Saab, Anna Hambitzer, Francisco Rodr\u00edguez-Henr\u00edquez, and Adi Shamir","link":"http://arxiv.org/abs/2410.05750v1","abstract":"Deep neural networks (DNNs) are valuable assets, yet their public\naccessibility raises security concerns about parameter extraction by malicious\nactors. Recent work by Carlini et al. (crypto'20) and Canales-Mart\\'inez et al.\n(eurocrypt'24) has drawn parallels between this issue and block cipher key\nextraction via chosen plaintext attacks. Leveraging differential cryptanalysis,\nthey demonstrated that all the weights and biases of black-box ReLU-based DNNs\ncould be inferred using a polynomial number of queries and computational time.\nHowever, their attacks relied on the availability of the exact numeric value of\noutput logits, which allowed the calculation of their derivatives. To overcome\nthis limitation, Chen et al. (asiacrypt'24) tackled the more realistic\nhard-label scenario, where only the final classification label (e.g., \"dog\" or\n\"car\") is accessible to the attacker. They proposed an extraction method\nrequiring a polynomial number of queries but an exponential execution time. In\naddition, their approach was applicable only to a restricted set of\narchitectures, could deal only with binary classifiers, and was demonstrated\nonly on tiny neural networks with up to four neurons split among up to two\nhidden layers. This paper introduces new techniques that, for the first time,\nachieve cryptanalytic extraction of DNN parameters in the most challenging\nhard-label setting, using both a polynomial number of queries and polynomial\ntime. We validate our approach by extracting nearly one million parameters from\na DNN trained on the CIFAR-10 dataset, comprising 832 neurons in four hidden\nlayers. Our results reveal the surprising fact that all the weights of a\nReLU-based DNN can be efficiently determined by analyzing only the geometric\nshape of its decision boundaries."},{"date":"2024-10","title":"Multiscale Latent Diffusion Model for Enhanced Feature Extraction from Medical Images","author":"Rabeya Tus Sadia, Jie Zhang, and Jin Chen","link":"http://arxiv.org/abs/2410.04000v2","abstract":"Various imaging modalities are used in patient diagnosis, each offering\nunique advantages and valuable insights into anatomy and pathology. Computed\nTomography (CT) is crucial in diagnostics, providing high-resolution images for\nprecise internal organ visualization. CT's ability to detect subtle tissue\nvariations is vital for diagnosing diseases like lung cancer, enabling early\ndetection and accurate tumor assessment. However, variations in CT scanner\nmodels and acquisition protocols introduce significant variability in the\nextracted radiomic features, even when imaging the same patient. This\nvariability poses considerable challenges for downstream research and clinical\nanalysis, which depend on consistent and reliable feature extraction. Current\nmethods for medical image feature extraction, often based on supervised\nlearning approaches, including GAN-based models, face limitations in\ngeneralizing across different imaging environments. In response to these\nchallenges, we propose LTDiff++, a multiscale latent diffusion model designed\nto enhance feature extraction in medical imaging. The model addresses\nvariability by standardizing non-uniform distributions in the latent space,\nimproving feature consistency. LTDiff++ utilizes a UNet++ encoder-decoder\narchitecture coupled with a conditional Denoising Diffusion Probabilistic Model\n(DDPM) at the latent bottleneck to achieve robust feature extraction and\nstandardization. Extensive empirical evaluations on both patient and phantom CT\ndatasets demonstrate significant improvements in image standardization, with\nhigher Concordance Correlation Coefficients (CCC) across multiple radiomic\nfeature categories. Through these advancements, LTDiff++ represents a promising\nsolution for overcoming the inherent variability in medical imaging data,\noffering improved reliability and accuracy in feature extraction processes."},{"date":"2024-10","title":"A Novel Feature Extraction Model for the Detection of Plant Disease from Leaf Images in Low Computational Devices","author":"Rikathi Pal, Anik Basu Bhaumik, Arpan Murmu, Sanoar Hossain, Biswajit Maity, and Soumya Sen","link":"http://arxiv.org/abs/2410.01854v1","abstract":"Diseases in plants cause significant danger to productive and secure\nagriculture. Plant diseases can be detected early and accurately, reducing crop\nlosses and pesticide use. Traditional methods of plant disease identification,\non the other hand, are generally time-consuming and require professional\nexpertise. It would be beneficial to the farmers if they could detect the\ndisease quickly by taking images of the leaf directly. This will be a\ntime-saving process and they can take remedial actions immediately. To achieve\nthis a novel feature extraction approach for detecting tomato plant illnesses\nfrom leaf photos using low-cost computing systems such as mobile phones is\nproposed in this study. The proposed approach integrates various types of Deep\nLearning techniques to extract robust and discriminative features from leaf\nimages. After the proposed feature extraction comparisons have been made on\nfive cutting-edge deep learning models: AlexNet, ResNet50, VGG16, VGG19, and\nMobileNet. The dataset contains 10,000 leaf photos from ten classes of tomato\nillnesses and one class of healthy leaves. Experimental findings demonstrate\nthat AlexNet has an accuracy score of 87%, with the benefit of being quick and\nlightweight, making it appropriate for use on embedded systems and other\nlow-processing devices like smartphones."},{"date":"2024-10","title":"Preserving Generalization of Language models in Few-shot Continual Relation Extraction","author":"Quyen Tran, Nguyen Xuan Thanh, Nguyen Hoang Anh, Nam Le Hai, Trung Le, Linh Van Ngo, and Thien Huu Nguyen","link":"http://arxiv.org/abs/2410.00334v1","abstract":"Few-shot Continual Relations Extraction (FCRE) is an emerging and dynamic\narea of study where models can sequentially integrate knowledge from new\nrelations with limited labeled data while circumventing catastrophic forgetting\nand preserving prior knowledge from pre-trained backbones. In this work, we\nintroduce a novel method that leverages often-discarded language model heads.\nBy employing these components via a mutual information maximization strategy,\nour approach helps maintain prior knowledge from the pre-trained backbone and\nstrategically aligns the primary classification head, thereby enhancing model\nperformance. Furthermore, we explore the potential of Large Language Models\n(LLMs), renowned for their wealth of knowledge, in addressing FCRE challenges.\nOur comprehensive experimental results underscore the efficacy of the proposed\nmethod and offer valuable insights for future work."},{"date":"2024-09","title":"Towards Robust Extractive Question Answering Models: Rethinking the Training Methodology","author":"Son Quoc Tran, and Matt Kretchmar","link":"http://arxiv.org/abs/2409.19766v1","abstract":"This paper proposes a novel training method to improve the robustness of\nExtractive Question Answering (EQA) models. Previous research has shown that\nexisting models, when trained on EQA datasets that include unanswerable\nquestions, demonstrate a significant lack of robustness against distribution\nshifts and adversarial attacks. Despite this, the inclusion of unanswerable\nquestions in EQA training datasets is essential for ensuring real-world\nreliability. Our proposed training method includes a novel loss function for\nthe EQA problem and challenges an implicit assumption present in numerous EQA\ndatasets. Models trained with our method maintain in-domain performance while\nachieving a notable improvement on out-of-domain datasets. This results in an\noverall F1 score improvement of 5.7 across all testing sets. Furthermore, our\nmodels exhibit significantly enhanced robustness against two types of\nadversarial attacks, with a performance decrease of only about a third compared\nto the default models."},{"date":"2024-09","title":"INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Large Language Models and Ensemble Learning","author":"Pablo Romero, Lifeng Han, and Goran Nenadic","link":"http://arxiv.org/abs/2409.19467v1","abstract":"Medication Extraction and Mining play an important role in healthcare NLP\nresearch due to its practical applications in hospital settings, such as their\nmapping into standard clinical knowledge bases (SNOMED-CT, BNF, etc.). In this\nwork, we investigate state-of-the-art LLMs in text mining tasks on medications\nand their related attributes such as dosage, route, strength, and adverse\neffects. In addition, we explore different ensemble learning methods\n(\\textsc{Stack-Ensemble} and \\textsc{Voting-Ensemble}) to augment the model\nperformances from individual LLMs. Our ensemble learning result demonstrated\nbetter performances than individually fine-tuned base models BERT, RoBERTa,\nRoBERTa-L, BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and\nPubMedBERT across general and specific domains. Finally, we build up an entity\nlinking function to map extracted medical terminologies into the SNOMED-CT\ncodes and the British National Formulary (BNF) codes, which are further mapped\nto the Dictionary of Medicines and Devices (dm+d), and ICD. Our model's toolkit\nand desktop applications are publicly available at\n\\url{https://github.com/HECTA-UoM/ensemble-NER}."},{"date":"2024-09","title":"Semi-strong Efficient Market of Bitcoin and Twitter: an Analysis of Semantic Vector Spaces of Extracted Keywords and Light Gradient Boosting Machine Models","author":"Fang Wang, and Marko Gacesa","link":"http://arxiv.org/abs/2409.15988v1","abstract":"This study extends the examination of the Efficient-Market Hypothesis in\nBitcoin market during a five year fluctuation period, from September 1 2017 to\nSeptember 1 2022, by analyzing 28,739,514 qualified tweets containing the\ntargeted topic \"Bitcoin\". Unlike previous studies, we extracted fundamental\nkeywords as an informative proxy for carrying out the study of the EMH in the\nBitcoin market rather than focusing on sentiment analysis, information volume,\nor price data. We tested market efficiency in hourly, 4-hourly, and daily time\nperiods to understand the speed and accuracy of market reactions towards the\ninformation within different thresholds. A sequence of machine learning methods\nand textual analyses were used, including measurements of distances of semantic\nvector spaces of information, keywords extraction and encoding model, and Light\nGradient Boosting Machine (LGBM) classifiers. Our results suggest that 78.06%\n(83.08%), 84.63% (87.77%), and 94.03% (94.60%) of hourly, 4-hourly, and daily\nbullish (bearish) market movements can be attributed to public information\nwithin organic tweets."},{"date":"2024-09","title":"ASTE Transformer Modelling Dependencies in Aspect-Sentiment Triplet Extraction","author":"Iwo Naglik, and Mateusz Lango","link":"http://arxiv.org/abs/2409.15202v2","abstract":"Aspect-Sentiment Triplet Extraction (ASTE) is a recently proposed task of\naspect-based sentiment analysis that consists in extracting (aspect phrase,\nopinion phrase, sentiment polarity) triples from a given sentence. Recent\nstate-of-the-art methods approach this task by first extracting all possible\ntext spans from a given text, then filtering the potential aspect and opinion\nphrases with a classifier, and finally considering all their pairs with another\nclassifier that additionally assigns sentiment polarity to them. Although\nseveral variations of the above scheme have been proposed, the common feature\nis that the final result is constructed by a sequence of independent classifier\ndecisions. This hinders the exploitation of dependencies between extracted\nphrases and prevents the use of knowledge about the interrelationships between\nclassifier predictions to improve performance. In this paper, we propose a new\nASTE approach consisting of three transformer-inspired layers, which enables\nthe modelling of dependencies both between phrases and between the final\nclassifier decisions. Experimental results show that the method achieves higher\nperformance in terms of F1 measure than other methods studied on popular\nbenchmarks. In addition, we show that a simple pre-training technique further\nimproves the performance of the model."},{"date":"2024-09","title":"Efficient and Effective Model Extraction","author":"Hongyu Zhu, Wentao Hu, Sichu Liang, Fangqi Li, Wenwen Wang, and Shilin Wang","link":"http://arxiv.org/abs/2409.14122v2","abstract":"Model extraction aims to create a functionally similar copy from a machine\nlearning as a service (MLaaS) API with minimal overhead, typically for illicit\nprofit or as a precursor to further attacks, posing a significant threat to the\nMLaaS ecosystem. However, recent studies have shown that model extraction is\nhighly inefficient, particularly when the target task distribution is\nunavailable. In such cases, even substantially increasing the attack budget\nfails to produce a sufficiently similar replica, reducing the adversary's\nmotivation to pursue extraction attacks. In this paper, we revisit the\nelementary design choices throughout the extraction lifecycle. We propose an\nembarrassingly simple yet dramatically effective algorithm, Efficient and\nEffective Model Extraction (E3), focusing on both query preparation and\ntraining routine. E3 achieves superior generalization compared to\nstate-of-the-art methods while minimizing computational costs. For instance,\nwith only 0.005 times the query budget and less than 0.2 times the runtime, E3\noutperforms classical generative model based data-free model extraction by an\nabsolute accuracy improvement of over 50% on CIFAR-10. Our findings underscore\nthe persistent threat posed by model extraction and suggest that it could serve\nas a valuable benchmarking algorithm for future security evaluations."},{"date":"2024-09","title":"Hard-Label Cryptanalytic Extraction of Neural Network Models","author":"Yi Chen, Xiaoyang Dong, Jian Guo, Yantian Shen, Anyu Wang, and Xiaoyun Wang","link":"http://arxiv.org/abs/2409.11646v1","abstract":"The machine learning problem of extracting neural network parameters has been\nproposed for nearly three decades. Functionally equivalent extraction is a\ncrucial goal for research on this problem. When the adversary has access to the\nraw output of neural networks, various attacks, including those presented at\nCRYPTO 2020 and EUROCRYPT 2024, have successfully achieved this goal. However,\nthis goal is not achieved when neural networks operate under a hard-label\nsetting where the raw output is inaccessible.\n In this paper, we propose the first attack that theoretically achieves\nfunctionally equivalent extraction under the hard-label setting, which applies\nto ReLU neural networks. The effectiveness of our attack is validated through\npractical experiments on a wide range of ReLU neural networks, including neural\nnetworks trained on two real benchmarking datasets (MNIST, CIFAR10) widely used\nin computer vision. For a neural network consisting of $10^5$ parameters, our\nattack only requires several hours on a single core."},{"date":"2024-09","title":"CaBaGe: Data-Free Model Extraction using ClAss BAlanced Generator Ensemble","author":"Jonathan Rosenthal, Shanchao Liang, Kevin Zhang, and Lin Tan","link":"http://arxiv.org/abs/2409.10643v1","abstract":"Machine Learning as a Service (MLaaS) is often provided as a pay-per-query,\nblack-box system to clients. Such a black-box approach not only hinders open\nreplication, validation, and interpretation of model results, but also makes it\nharder for white-hat researchers to identify vulnerabilities in the MLaaS\nsystems. Model extraction is a promising technique to address these challenges\nby reverse-engineering black-box models. Since training data is typically\nunavailable for MLaaS models, this paper focuses on the realistic version of\nit: data-free model extraction. We propose a data-free model extraction\napproach, CaBaGe, to achieve higher model extraction accuracy with a small\nnumber of queries. Our innovations include (1) a novel experience replay for\nfocusing on difficult training samples; (2) an ensemble of generators for\nsteadily producing diverse synthetic data; and (3) a selective filtering\nprocess for querying the victim model with harder, more balanced samples. In\naddition, we create a more realistic setting, for the first time, where the\nattacker has no knowledge of the number of classes in the victim training data,\nand create a solution to learn the number of classes on the fly. Our evaluation\nshows that CaBaGe outperforms existing techniques on seven datasets -- MNIST,\nFMNIST, SVHN, CIFAR-10, CIFAR-100, ImageNet-subset, and Tiny ImageNet -- with\nan accuracy improvement of the extracted models by up to 43.13%. Furthermore,\nthe number of queries required to extract a clone model matching the final\naccuracy of prior work is reduced by up to 75.7%."},{"date":"2024-09","title":"Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports","author":"Mohamed Sobhi Jabal, Pranav Warman, Jikai Zhang, Kartikeye Gupta, Ayush Jain, Maciej Mazurowski, Walter Wiggins, Kirti Magudia, and Evan Calabrese","link":"http://arxiv.org/abs/2409.10576v2","abstract":"Purpose: To develop and evaluate an automated system for extracting\nstructured clinical information from unstructured radiology and pathology\nreports using open-weights large language models (LMs) and retrieval augmented\ngeneration (RAG), and to assess the effects of model configuration variables on\nextraction performance. Methods and Materials: The study utilized two datasets:\n7,294 radiology reports annotated for Brain Tumor Reporting and Data System\n(BT-RADS) scores and 2,154 pathology reports annotated for isocitrate\ndehydrogenase (IDH) mutation status. An automated pipeline was developed to\nbenchmark the performance of various LMs and RAG configurations. The impact of\nmodel size, quantization, prompting strategies, output formatting, and\ninference parameters was systematically evaluated. Results: The best performing\nmodels achieved over 98% accuracy in extracting BT-RADS scores from radiology\nreports and over 90% for IDH mutation status extraction from pathology reports.\nThe top model being medical fine-tuned llama3. Larger, newer, and domain\nfine-tuned models consistently outperformed older and smaller models. Model\nquantization had minimal impact on performance. Few-shot prompting\nsignificantly improved accuracy. RAG improved performance for complex pathology\nreports but not for shorter radiology reports. Conclusions: Open LMs\ndemonstrate significant potential for automated extraction of structured\nclinical data from unstructured clinical reports with local privacy-preserving\napplication. Careful model selection, prompt engineering, and semi-automated\noptimization using annotated data are critical for optimal performance. These\napproaches could be reliable enough for practical use in research workflows,\nhighlighting the potential for human-machine collaboration in healthcare data\nextraction."},{"date":"2024-09","title":"TSELM: Target Speaker Extraction using Discrete Tokens and Language Models","author":"Beilong Tang, Bang Zeng, and Ming Li","link":"http://arxiv.org/abs/2409.07841v3","abstract":"We propose TSELM, a novel target speaker extraction network that leverages\ndiscrete tokens and language models. TSELM utilizes multiple discretized layers\nfrom WavLM as input tokens and incorporates cross-attention mechanisms to\nintegrate target speaker information. Language models are employed to capture\nthe sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the\naudio from the tokens. By applying a cross-entropy loss, TSELM models the\nprobability distribution of output tokens, thus converting the complex\nregression problem of audio generation into a classification task. Experimental\nresults show that TSELM achieves excellent results in speech quality and\ncomparable results in speech intelligibility."},{"date":"2024-09","title":"Alignment-Aware Model Extraction Attacks on Large Language Models","author":"Zi Liang, Qingqing Ye, Yanyun Wang, Sen Zhang, Yaxin Xiao, Ronghua Li, Jianliang Xu, and Haibo Hu","link":"http://arxiv.org/abs/2409.02718v1","abstract":"Model extraction attacks (MEAs) on large language models (LLMs) have received\nincreasing research attention lately. Existing attack methods on LLMs inherit\nthe extraction strategies from those designed for deep neural networks (DNNs)\nyet neglect the inconsistency of training tasks between MEA and LLMs'\nalignments. As such, they result in poor attack performances. To tackle this\nissue, we present Locality Reinforced Distillation (LoRD), a novel model\nextraction attack algorithm specifically for LLMs. In particular, we design a\npolicy-gradient-style training task, which utilizes victim models' responses as\na signal to guide the crafting of preference for the local model. Theoretical\nanalysis has shown that i) LoRD's convergence procedure in MEAs is consistent\nwith the alignments of LLMs, and ii) LoRD can reduce query complexity while\nmitigating watermark protection through exploration-based stealing. Extensive\nexperiments on domain-specific extractions demonstrate the superiority of our\nmethod by examining the extraction of various state-of-the-art commercial LLMs."},{"date":"2024-09","title":"AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models","author":"Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, and Zhiming Zheng","link":"http://arxiv.org/abs/2409.01579v1","abstract":"Retrieved documents containing noise will hinder RAG from detecting answer\nclues and make the inference process slow and expensive. Therefore, context\ncompression is necessary to enhance its accuracy and efficiency. Existing\ncontext compression methods use extractive or generative models to retain the\nmost query-relevant sentences or apply the information bottleneck theory to\npreserve sufficient information. However, these methods may face issues such as\nover-compression or high computational costs. We observe that the retriever\noften ranks relevant documents at the top, but the exact number of documents\nneeded to answer the query is uncertain due to the impact of query complexity\nand retrieval quality: complex queries like multi-hop questions may require\nretaining more documents than simpler queries, and a low-quality retrieval may\nneed to rely on more documents to generate accurate outputs. Therefore,\ndetermining the minimum number of required documents (compression rate) is\nstill a challenge for RAG. In this paper, we introduce AdaComp, a low-cost\nextractive context compression method that adaptively determines the\ncompression rate based on both query complexity and retrieval quality.\nSpecifically, we first annotate the minimum top-k documents necessary for the\nRAG system to answer the current query as the compression rate and then\nconstruct triplets of the query, retrieved documents, and its compression rate.\nThen, we use this triplet dataset to train a compression-rate predictor.\nExperiments on three QA datasets and one conversational Muiti-doc QA dataset\nshow that AdaComp significantly reduces inference costs while maintaining\nperformance nearly identical to uncompressed models, achieving a balance\nbetween efficiency and performance."},{"date":"2024-08","title":"Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis","author":"S. Nishio, H. Nonaka, N. Tsuchiya, A. Migita, Y. Banno, T. Hayashi, H. Sakaji, T. Sakumoto, and K. Watabe","link":"http://arxiv.org/abs/2408.12097v1","abstract":"Machine learning is widely utilized across various industries. Identifying\nthe appropriate machine learning models and datasets for specific tasks is\ncrucial for the effective industrial application of machine learning. However,\nthis requires expertise in both machine learning and the relevant domain,\nleading to a high learning cost. Therefore, research focused on extracting\ncombinations of tasks, machine learning models, and datasets from academic\npapers is critically important, as it can facilitate the automatic\nrecommendation of suitable methods. Conventional information extraction methods\nfrom academic papers have been limited to identifying machine learning models\nand other entities as named entities. To address this issue, this study\nproposes a methodology extracting tasks, machine learning methods, and dataset\nnames from scientific papers and analyzing the relationships between these\ninformation by using LLM, embedding model, and network clustering. The proposed\nmethod's expression extraction performance, when using Llama3, achieves an\nF-score exceeding 0.8 across various categories, confirming its practical\nutility. Benchmarking results on financial domain papers have demonstrated the\neffectiveness of this method, providing insights into the use of the latest\ndatasets, including those related to ESG (Environmental, Social, and\nGovernance) data."},{"date":"2024-08","title":"JieHua Paintings Style Feature Extracting Model using Stable Diffusion with ControlNet","author":"Yujia Gu, Haofeng Li, Xinyu Fang, Zihan Peng, and Yinan Peng","link":"http://arxiv.org/abs/2408.11744v1","abstract":"This study proposes a novel approach to extract stylistic features of Jiehua:\nthe utilization of the Fine-tuned Stable Diffusion Model with ControlNet\n(FSDMC) to refine depiction techniques from artists' Jiehua. The training data\nfor FSDMC is based on the opensource Jiehua artist's work collected from the\nInternet, which were subsequently manually constructed in the format of\n(Original Image, Canny Edge Features, Text Prompt). By employing the optimal\nhyperparameters identified in this paper, it was observed FSDMC outperforms\nCycleGAN, another mainstream style transfer model. FSDMC achieves FID of 3.27\non the dataset and also surpasses CycleGAN in terms of expert evaluation. This\nnot only demonstrates the model's high effectiveness in extracting Jiehua's\nstyle features, but also preserves the original pre-trained semantic\ninformation. The findings of this study suggest that the application of FSDMC\nwith appropriate hyperparameters can enhance the efficacy of the Stable\nDiffusion Model in the field of traditional art style migration tasks,\nparticularly within the context of Jiehua."},{"date":"2024-08","title":"Extracting polygonal footprints in off-nadir images with Segment Anything Model","author":"Kai Li, Yupeng Deng, Jingbo Chen, Yu Meng, Zhihao Xi, Junxian Ma, Chenhao Wang, and Xiangyu Zhao","link":"http://arxiv.org/abs/2408.08645v3","abstract":"Building Footprint Extraction (BFE) from off-nadir aerial images often\ninvolves roof segmentation and offset prediction to adjust roof boundaries to\nthe building footprint. However, this multi-stage approach typically produces\nlow-quality results, limiting its applicability in real-world data production.\nTo address this issue, we present OBMv2, an end-to-end and promptable model for\npolygonal footprint prediction. Unlike its predecessor OBM, OBMv2 introduces a\nnovel Self Offset Attention (SOFA) mechanism that improves performance across\ndiverse building types, from bungalows to skyscrapers, enabling end-to-end\nfootprint prediction without post-processing. Additionally, we propose a\nMulti-level Information System (MISS) to effectively leverage roof masks,\nbuilding masks, and offsets for accurate footprint prediction. We evaluate\nOBMv2 on the BONAI and OmniCity-view3 datasets and demonstrate its\ngeneralization on the Huizhou test set. The code will be available at\nhttps://github.com/likaiucas/OBMv2."},{"date":"2024-08","title":"Extracting Sentence Embeddings from Pretrained Transformer Models","author":"Lukas Stankevi\u010dius, and Mantas Luko\u0161evi\u010dius","link":"http://arxiv.org/abs/2408.08073v1","abstract":"Background/introduction: Pre-trained transformer models shine in many natural\nlanguage processing tasks and therefore are expected to bear the representation\nof the input sentence or text meaning. These sentence-level embeddings are also\nimportant in retrieval-augmented generation. But do commonly used plain\naveraging or prompt templates surface it enough?\n Methods: Given 110M parameters BERT's hidden representations from multiple\nlayers and multiple tokens we tried various ways to extract optimal sentence\nrepresentations. We tested various token aggregation and representation\npost-processing techniques. We also tested multiple ways of using a general\nWikitext dataset to complement BERTs sentence representations. All methods were\ntested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12\nclassification tasks. We also evaluated our representation-shaping techniques\non other static models, including random token representations.\n Results: Proposed representation extraction methods improved the performance\non STS and clustering tasks for all models considered. Very high improvements\nfor static token-based models, especially random embeddings for STS tasks\nalmost reach the performance of BERT-derived representations.\n Conclusions: Our work shows that for multiple tasks simple baselines with\nrepresentation shaping techniques reach or even outperform more complex\nBERT-based models or are able to contribute to their performance."},{"date":"2024-08","title":"Evaluating Large Language Model based Personal Information Extraction and Countermeasures","author":"Yupei Liu, Yuqi Jia, Jinyuan Jia, and Neil Zhenqiang Gong","link":"http://arxiv.org/abs/2408.07291v1","abstract":"Automatically extracting personal information--such as name, phone number,\nand email address--from publicly available profiles at a large scale is a\nstepstone to many other security attacks including spear phishing. Traditional\nmethods--such as regular expression, keyword search, and entity\ndetection--achieve limited success at such personal information extraction. In\nthis work, we perform a systematic measurement study to benchmark large\nlanguage model (LLM) based personal information extraction and countermeasures.\nTowards this goal, we present a framework for LLM-based extraction attacks;\ncollect three datasets including a synthetic dataset generated by GPT-4 and two\nreal-world datasets with manually labeled 8 categories of personal information;\nintroduce a novel mitigation strategy based on \\emph{prompt injection}; and\nsystematically benchmark LLM-based attacks and countermeasures using 10 LLMs\nand our 3 datasets. Our key findings include: LLM can be misused by attackers\nto accurately extract various personal information from personal profiles; LLM\noutperforms conventional methods at such extraction; and prompt injection can\nmitigate such risk to a large extent and outperforms conventional\ncountermeasures. Our code and data are available at:\n\\url{https://github.com/liu00222/LLM-Based-Personal-Profile-Extraction}."},{"date":"2024-08","title":"Automatic Feature Recognition and Dimensional Attributes Extraction From CAD Models for Hybrid Additive-Subtractive Manufacturing","author":"Muhammad Tayyab Khan, Wenhe Feng, Lequn Chen, Ye Han Ng, Nicholas Yew Jin Tan, and Seung Ki Moon","link":"http://arxiv.org/abs/2408.06891v2","abstract":"The integration of Computer-Aided Design (CAD), Computer-Aided Process\nPlanning (CAPP), and Computer-Aided Manufacturing (CAM) plays a crucial role in\nmodern manufacturing, facilitating seamless transitions from digital designs to\nphysical products. However, a significant challenge within this integration is\nthe Automatic Feature Recognition (AFR) of CAD models, especially in the\ncontext of hybrid manufacturing that combines subtractive and additive\nmanufacturing processes. Traditional AFR methods, focused mainly on the\nidentification of subtractive (machined) features including holes, fillets,\nchamfers, pockets, and slots, fail to recognize features pertinent to additive\nmanufacturing. Furthermore, the traditional methods fall short in accurately\nextracting geometric dimensions and orientations, which are also key factors\nfor effective manufacturing process planning. This paper presents a novel\napproach for creating a synthetic CAD dataset that encompasses features\nrelevant to both additive and subtractive machining through Python Open\nCascade. The Hierarchical Graph Convolutional Neural Network (HGCNN) model is\nimplemented to accurately identify the composite additive-subtractive features\nwithin the synthetic CAD dataset. The key novelty and contribution of the\nproposed methodology lie in its ability to recognize a wide range of\nmanufacturing features, and precisely extracting their dimensions,\norientations, and stock sizes. The proposed model demonstrates remarkable\nfeature recognition accuracy exceeding 97% and a dimension extraction accuracy\nof 100% for identified features. Therefore, the proposed methodology enhances\nthe integration of CAD, CAPP, and CAM within hybrid manufacturing by providing\nprecise feature recognition and dimension extraction. It facilitates improved\nmanufacturing process planning, by enabling more informed decision-making."},{"date":"2024-08","title":"Target Prompting for Information Extraction with Vision Language Model","author":"Dipankar Medhi","link":"http://arxiv.org/abs/2408.03834v1","abstract":"The recent trend in the Large Vision and Language model has brought a new\nchange in how information extraction systems are built. VLMs have set a new\nbenchmark with their State-of-the-art techniques in understanding documents and\nbuilding question-answering systems across various industries. They are\nsignificantly better at generating text from document images and providing\naccurate answers to questions. However, there are still some challenges in\neffectively utilizing these models to build a precise conversational system.\nGeneral prompting techniques used with large language models are often not\nsuitable for these specially designed vision language models. The output\ngenerated by such generic input prompts is ordinary and may contain information\ngaps when compared with the actual content of the document. To obtain more\naccurate and specific answers, a well-targeted prompt is required by the vision\nlanguage model, along with the document image. In this paper, a technique is\ndiscussed called Target prompting, which focuses on explicitly targeting parts\nof document images and generating related answers from those specific regions\nonly. The paper also covers the evaluation of response for each prompting\ntechnique using different user queries and input prompts."},{"date":"2024-08","title":"Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction","author":"Benjamin Matthias Ruppik, Michael Heck, Carel van Niekerk, Renato Vukovic, Hsien-chin Lin, Shutong Feng, Marcus Zibrowius, and Milica Ga\u0161i\u0107","link":"http://arxiv.org/abs/2408.03706v1","abstract":"A common approach for sequence tagging tasks based on contextual word\nrepresentations is to train a machine learning classifier directly on these\nembedding vectors. This approach has two shortcomings. First, such methods\nconsider single input sequences in isolation and are unable to put an\nindividual embedding vector in relation to vectors outside the current local\ncontext of use. Second, the high performance of these models relies on\nfine-tuning the embedding model in conjunction with the classifier, which may\nnot always be feasible due to the size or inaccessibility of the underlying\nfeature-generation model. It is thus desirable, given a collection of embedding\nvectors of a corpus, i.e., a datastore, to find features of each vector that\ndescribe its relation to other, similar vectors in the datastore. With this in\nmind, we introduce complexity measures of the local topology of the latent\nspace of a contextual language model with respect to a given datastore. The\neffectiveness of our features is demonstrated through their application to\ndialogue term extraction. Our work continues a line of research that explores\nthe manifold hypothesis for word embeddings, demonstrating that local structure\nin the space carved out by word embeddings can be exploited to infer semantic\nproperties."},{"date":"2024-08","title":"Modelling Visual Semantics via Image Captioning to extract Enhanced Multi-Level Cross-Modal Semantic Incongruity Representation with Attention for Multimodal Sarcasm Detection","author":"Sajal Aggarwal, Ananya Pandey, and Dinesh Kumar Vishwakarma","link":"http://arxiv.org/abs/2408.02595v1","abstract":"Sarcasm is a type of irony, characterized by an inherent mismatch between the\nliteral interpretation and the intended connotation. Though sarcasm detection\nin text has been extensively studied, there are situations in which textual\ninput alone might be insufficient to perceive sarcasm. The inclusion of\nadditional contextual cues, such as images, is essential to recognize sarcasm\nin social media data effectively. This study presents a novel framework for\nmultimodal sarcasm detection that can process input triplets. Two components of\nthese triplets comprise the input text and its associated image, as provided in\nthe datasets. Additionally, a supplementary modality is introduced in the form\nof descriptive image captions. The motivation behind incorporating this visual\nsemantic representation is to more accurately capture the discrepancies between\nthe textual and visual content, which are fundamental to the sarcasm detection\ntask. The primary contributions of this study are: (1) a robust textual feature\nextraction branch that utilizes a cross-lingual language model; (2) a visual\nfeature extraction branch that incorporates a self-regulated residual ConvNet\nintegrated with a lightweight spatially aware attention module; (3) an\nadditional modality in the form of image captions generated using an\nencoder-decoder architecture capable of reading text embedded in images; (4)\ndistinct attention modules to effectively identify the incongruities between\nthe text and two levels of image representations; (5) multi-level cross-domain\nsemantic incongruity representation achieved through feature fusion. Compared\nwith cutting-edge baselines, the proposed model achieves the best accuracy of\n92.89% and 64.48%, respectively, on the Twitter multimodal sarcasm and\nMultiBully datasets."},{"date":"2024-08","title":"Why Are My Prompts Leaked? Unraveling Prompt Extraction Threats in Customized Large Language Models","author":"Zi Liang, Haibo Hu, Qingqing Ye, Yaxin Xiao, and Haoyang Li","link":"http://arxiv.org/abs/2408.02416v1","abstract":"The drastic increase of large language models' (LLMs) parameters has led to a\nnew research direction of fine-tuning-free downstream customization by prompts,\ni.e., task descriptions. While these prompt-based services (e.g. OpenAI's GPTs)\nplay an important role in many businesses, there has emerged growing concerns\nabout the prompt leakage, which undermines the intellectual properties of these\nservices and causes downstream attacks. In this paper, we analyze the\nunderlying mechanism of prompt leakage, which we refer to as prompt\nmemorization, and develop corresponding defending strategies. By exploring the\nscaling laws in prompt extraction, we analyze key attributes that influence\nprompt extraction, including model sizes, prompt lengths, as well as the types\nof prompts. Then we propose two hypotheses that explain how LLMs expose their\nprompts. The first is attributed to the perplexity, i.e. the familiarity of\nLLMs to texts, whereas the second is based on the straightforward token\ntranslation path in attention matrices. To defend against such threats, we\ninvestigate whether alignments can undermine the extraction of prompts. We find\nthat current LLMs, even those with safety alignments like GPT-4, are highly\nvulnerable to prompt extraction attacks, even under the most straightforward\nuser attacks. Therefore, we put forward several defense strategies with the\ninspiration of our findings, which achieve 83.8\\% and 71.0\\% drop in the prompt\nextraction rate for Llama2-7B and GPT-3.5, respectively. Source code is\navaliable at \\url{https://github.com/liangzid/PromptExtractionEval}."},{"date":"2024-08","title":"A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models","author":"Vanni Zavarella, Juan Carlos Gamero-Salinas, and Sergio Consoli","link":"http://arxiv.org/abs/2408.02377v1","abstract":"Knowledge graphs (KGs) have been successfully applied to the analysis of\ncomplex scientific and technological domains, with automatic KG generation\nmethods typically building upon relation extraction models capturing\nfine-grained relations between domain entities in text. While these relations\nare fully applicable across scientific areas, existing models are trained on\nfew domain-specific datasets such as SciERC and do not perform well on new\ntarget domains. In this paper, we experiment with leveraging in-context\nlearning capabilities of Large Language Models to perform schema-constrained\ndata annotation, collecting in-domain training instances for a\nTransformer-based relation extraction model deployed on titles and abstracts of\nresearch papers in the Architecture, Construction, Engineering and Operations\n(AECO) domain. By assessing the performance gain with respect to a baseline\nDeep Learning architecture trained on off-domain data, we show that by using a\nfew-shot learning strategy with structured prompts and only minimal expert\nannotation the presented approach can potentially support domain adaptation of\na science KG generation model."},{"date":"2024-08","title":"VidModEx: Interpretable and Efficient Black Box Model Extraction for High-Dimensional Spaces","author":"Somnath Sendhil Kumar, Yuvaraj Govindarajulu, Pavan Kulkarni, and Manojkumar Parmar","link":"http://arxiv.org/abs/2408.02140v1","abstract":"In the domain of black-box model extraction, conventional methods reliant on\nsoft labels or surrogate datasets struggle with scaling to high-dimensional\ninput spaces and managing the complexity of an extensive array of interrelated\nclasses. In this work, we present a novel approach that utilizes SHAP (SHapley\nAdditive exPlanations) to enhance synthetic data generation. SHAP quantifies\nthe individual contributions of each input feature towards the victim model's\noutput, facilitating the optimization of an energy-based GAN towards a\ndesirable output. This method significantly boosts performance, achieving a\n16.45% increase in the accuracy of image classification models and extending to\nvideo classification models with an average improvement of 26.11% and a maximum\nof 33.36% on challenging datasets such as UCF11, UCF101, Kinetics 400, Kinetics\n600, and Something-Something V2. We further demonstrate the effectiveness and\npractical utility of our method under various scenarios, including the\navailability of top-k prediction probabilities, top-k prediction labels, and\ntop-1 labels."},{"date":"2024-08","title":"Knowledge AI: Fine-tuning NLP Models for Facilitating Scientific Knowledge Extraction and Understanding","author":"Balaji Muralidharan, Hayden Beadles, Reza Marzban, and Kalyan Sashank Mupparaju","link":"http://arxiv.org/abs/2408.04651v1","abstract":"This project investigates the efficacy of Large Language Models (LLMs) in\nunderstanding and extracting scientific knowledge across specific domains and\nto create a deep learning framework: Knowledge AI. As a part of this framework,\nwe employ pre-trained models and fine-tune them on datasets in the scientific\ndomain. The models are adapted for four key Natural Language Processing (NLP)\ntasks: summarization, text generation, question answering, and named entity\nrecognition. Our results indicate that domain-specific fine-tuning\nsignificantly enhances model performance in each of these tasks, thereby\nimproving their applicability for scientific contexts. This adaptation enables\nnon-experts to efficiently query and extract information within targeted\nscientific fields, demonstrating the potential of fine-tuned LLMs as a tool for\nknowledge discovery in the sciences."},{"date":"2024-08","title":"Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data","author":"Antonio De Santis, Marco Balduini, Federico De Santis, Andrea Proia, Arsenio Leo, Marco Brambilla, and Emanuele Della Valle","link":"http://arxiv.org/abs/2408.01700v1","abstract":"Aerospace manufacturing companies, such as Thales Alenia Space, design,\ndevelop, integrate, verify, and validate products characterized by high\ncomplexity and low volume. They carefully document all phases for each product\nbut analyses across products are challenging due to the heterogeneity and\nunstructured nature of the data in documents. In this paper, we propose a\nhybrid methodology that leverages Knowledge Graphs (KGs) in conjunction with\nLarge Language Models (LLMs) to extract and validate data contained in these\ndocuments. We consider a case study focused on test data related to electronic\nboards for satellites. To do so, we extend the Semantic Sensor Network\nontology. We store the metadata of the reports in a KG, while the actual test\nresults are stored in parquet accessible via a Virtual Knowledge Graph. The\nvalidation process is managed using an LLM-based approach. We also conduct a\nbenchmarking study to evaluate the performance of state-of-the-art LLMs in\nexecuting this task. Finally, we analyze the costs and benefits of automating\npreexisting processes of manual data extraction and validation for subsequent\ncross-report analyses."},{"date":"2024-07","title":"FIARSE: Model-Heterogeneous Federated Learning via Importance-Aware Submodel Extraction","author":"Feijie Wu, Xingchen Wang, Yaqing Wang, Tianci Liu, Lu Su, and Jing Gao","link":"http://arxiv.org/abs/2407.19389v2","abstract":"In federated learning (FL), accommodating clients' varied computational\ncapacities poses a challenge, often limiting the participation of those with\nconstrained resources in global model training. To address this issue, the\nconcept of model heterogeneity through submodel extraction has emerged,\noffering a tailored solution that aligns the model's complexity with each\nclient's computational capacity. In this work, we propose Federated\nImportance-Aware Submodel Extraction (FIARSE), a novel approach that\ndynamically adjusts submodels based on the importance of model parameters,\nthereby overcoming the limitations of previous static and dynamic submodel\nextraction methods. Compared to existing works, the proposed method offers a\ntheoretical foundation for the submodel extraction and eliminates the need for\nadditional information beyond the model parameters themselves to determine\nparameter importance, significantly reducing the overhead on clients. Extensive\nexperiments are conducted on various datasets to showcase the superior\nperformance of the proposed FIARSE."},{"date":"2024-07","title":"Human-artificial intelligence teaming for scientific information extraction from data-driven additive manufacturing research using large language models","author":"Mutahar Safdar, Jiarui Xie, Andrei Mircea, and Yaoyao Fiona Zhao","link":"http://arxiv.org/abs/2407.18827v1","abstract":"Data-driven research in Additive Manufacturing (AM) has gained significant\nsuccess in recent years. This has led to a plethora of scientific literature to\nemerge. The knowledge in these works consists of AM and Artificial Intelligence\n(AI) contexts that have not been mined and formalized in an integrated way. It\nrequires substantial effort and time to extract scientific information from\nthese works. AM domain experts have contributed over two dozen review papers to\nsummarize these works. However, information specific to AM and AI contexts\nstill requires manual effort to extract. The recent success of foundation\nmodels such as BERT (Bidirectional Encoder Representations for Transformers) or\nGPT (Generative Pre-trained Transformers) on textual data has opened the\npossibility of expediting scientific information extraction. We propose a\nframework that enables collaboration between AM and AI experts to continuously\nextract scientific information from data-driven AM literature. A demonstration\ntool is implemented based on the proposed framework and a case study is\nconducted to extract information relevant to the datasets, modeling, sensing,\nand AM system categories. We show the ability of LLMs (Large Language Models)\nto expedite the extraction of relevant information from data-driven AM\nliterature. In the future, the framework can be used to extract information\nfrom the broader design and manufacturing literature in the engineering\ndiscipline."},{"date":"2024-07","title":"A Universal Prompting Strategy for Extracting Process Model Information from Natural Language Text using Large Language Models","author":"Julian Neuberger, Lars Ackermann, Han van der Aa, and Stefan Jablonski","link":"http://arxiv.org/abs/2407.18540v1","abstract":"Over the past decade, extensive research efforts have been dedicated to the\nextraction of information from textual process descriptions. Despite the\nremarkable progress witnessed in natural language processing (NLP), information\nextraction within the Business Process Management domain remains predominantly\nreliant on rule-based systems and machine learning methodologies. Data scarcity\nhas so far prevented the successful application of deep learning techniques.\nHowever, the rapid progress in generative large language models (LLMs) makes it\npossible to solve many NLP tasks with very high quality without the need for\nextensive data. Therefore, we systematically investigate the potential of LLMs\nfor extracting information from textual process descriptions, targeting the\ndetection of process elements such as activities and actors, and relations\nbetween them. Using a heuristic algorithm, we demonstrate the suitability of\nthe extracted information for process model generation. Based on a novel\nprompting strategy, we show that LLMs are able to outperform state-of-the-art\nmachine learning approaches with absolute performance improvements of up to 8\\%\n$F_1$ score across three different datasets. We evaluate our prompting strategy\non eight different LLMs, showing it is universally applicable, while also\nanalyzing the impact of certain prompt parts on extraction quality. The number\nof example texts, the specificity of definitions, and the rigour of format\ninstructions are identified as key for improving the accuracy of extracted\ninformation. Our code, prompts, and data are publicly available."},{"date":"2024-07","title":"SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)","author":"Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, and Ying Ding","link":"http://arxiv.org/abs/2407.17126v1","abstract":"Extracting social determinants of health (SDoH) from unstructured medical\nnotes depends heavily on labor-intensive annotations, which are typically\ntask-specific, hampering reusability and limiting sharing. In this study we\nintroduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM)\nmethod leveraging contrastive examples and concise instructions to extract SDoH\nwithout relying on extensive medical annotations or costly human intervention.\nIt achieved tenfold and twentyfold reductions in time and cost respectively,\nand superior consistency with human annotators measured by Cohen's kappa of up\nto 0.92. The innovative combination of SDoH-GPT and XGBoost leverages the\nstrengths of both, ensuring high accuracy and computational efficiency while\nconsistently maintaining 0.90+ AUROC scores. Testing across three distinct\ndatasets has confirmed its robustness and accuracy. This study highlights the\npotential of leveraging LLMs to revolutionize medical note classification,\ndemonstrating their capability to achieve highly accurate classifications with\nsignificantly reduced time and cost."},{"date":"2024-07","title":"From Text to Insight: Large Language Models for Materials Science Data Extraction","author":"Mara Schilling-Wilhelmi, Marti\u00f1o R\u00edos-Garc\u00eda, Sherjeel Shabih, Mar\u00eda Victoria Gil, Santiago Miret, Christoph T. Koch, Jos\u00e9 A. M\u00e1rquez, and Kevin Maik Jablonka","link":"http://arxiv.org/abs/2407.16867v1","abstract":"The vast majority of materials science knowledge exists in unstructured\nnatural language, yet structured data is crucial for innovative and systematic\nmaterials design. Traditionally, the field has relied on manual curation and\npartial automation for data extraction for specific use cases. The advent of\nlarge language models (LLMs) represents a significant shift, potentially\nenabling efficient extraction of structured, actionable data from unstructured\ntext by non-experts. While applying LLMs to materials science data extraction\npresents unique challenges, domain knowledge offers opportunities to guide and\nvalidate LLM outputs. This review provides a comprehensive overview of\nLLM-based structured data extraction in materials science, synthesizing current\nknowledge and outlining future directions. We address the lack of standardized\nguidelines and present frameworks for leveraging the synergy between LLMs and\nmaterials science expertise. This work serves as a foundational resource for\nresearchers aiming to harness LLMs for data-driven materials research. The\ninsights presented here could significantly enhance how researchers across\ndisciplines access and utilize scientific information, potentially accelerating\nthe development of novel materials for critical societal needs."},{"date":"2024-07","title":"Causality extraction from medical text using Large Language Models (LLMs)","author":"Seethalakshmi Gopalakrishnan, Luciana Garbayo, and Wlodek Zadrozny","link":"http://arxiv.org/abs/2407.10020v1","abstract":"This study explores the potential of natural language models, including large\nlanguage models, to extract causal relations from medical texts, specifically\nfrom Clinical Practice Guidelines (CPGs). The outcomes causality extraction\nfrom Clinical Practice Guidelines for gestational diabetes are presented,\nmarking a first in the field. We report on a set of experiments using variants\nof BERT (BioBERT, DistilBERT, and BERT) and using Large Language Models (LLMs),\nnamely GPT-4 and LLAMA2. Our experiments show that BioBERT performed better\nthan other models, including the Large Language Models, with an average\nF1-score of 0.72. GPT-4 and LLAMA2 results show similar performance but less\nconsistency. We also release the code and an annotated a corpus of causal\nstatements within the Clinical Practice Guidelines for gestational diabetes."},{"date":"2024-07","title":"Empowering Few-Shot Relation Extraction with The Integration of Traditional RE Methods and Large Language Models","author":"Ye Liu, Kai Zhang, Aoran Gan, Linan Yue, Feng Hu, Qi Liu, and Enhong Chen","link":"http://arxiv.org/abs/2407.08967v1","abstract":"Few-Shot Relation Extraction (FSRE), a subtask of Relation Extraction (RE)\nthat utilizes limited training instances, appeals to more researchers in\nNatural Language Processing (NLP) due to its capability to extract textual\ninformation in extremely low-resource scenarios. The primary methodologies\nemployed for FSRE have been fine-tuning or prompt tuning techniques based on\nPre-trained Language Models (PLMs). Recently, the emergence of Large Language\nModels (LLMs) has prompted numerous researchers to explore FSRE through\nIn-Context Learning (ICL). However, there are substantial limitations\nassociated with methods based on either traditional RE models or LLMs.\nTraditional RE models are hampered by a lack of necessary prior knowledge,\nwhile LLMs fall short in their task-specific capabilities for RE. To address\nthese shortcomings, we propose a Dual-System Augmented Relation Extractor\n(DSARE), which synergistically combines traditional RE models with LLMs.\nSpecifically, DSARE innovatively injects the prior knowledge of LLMs into\ntraditional RE models, and conversely enhances LLMs' task-specific aptitude for\nRE through relation extraction augmentation. Moreover, an Integrated Prediction\nmodule is employed to jointly consider these two respective predictions and\nderive the final results. Extensive experiments demonstrate the efficacy of our\nproposed method."},{"date":"2024-07","title":"Extracting Training Data from Document-Based VQA Models","author":"Francesco Pinto, Nathalie Rauschmayr, Florian Tram\u00e8r, Philip Torr, and Federico Tombari","link":"http://arxiv.org/abs/2407.08707v1","abstract":"Vision-Language Models (VLMs) have made remarkable progress in document-based\nVisual Question Answering (i.e., responding to queries about the contents of an\ninput document provided as an image). In this work, we show these models can\nmemorize responses for training samples and regurgitate them even when the\nrelevant visual information has been removed. This includes Personal\nIdentifiable Information (PII) repeated once in the training set, indicating\nthese models could divulge memorised sensitive information and therefore pose a\nprivacy risk. We quantitatively measure the extractability of information in\ncontrolled experiments and differentiate between cases where it arises from\ngeneralization capabilities or from memorization. We further investigate the\nfactors that influence memorization across multiple state-of-the-art models and\npropose an effective heuristic countermeasure that empirically prevents the\nextractability of PII."},{"date":"2024-07","title":"ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction","author":"Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K. Wong","link":"http://arxiv.org/abs/2407.07077v1","abstract":"While personalized text-to-image generation has enabled the learning of a\nsingle concept from multiple images, a more practical yet challenging scenario\ninvolves learning multiple concepts within a single image. However, existing\nworks tackling this scenario heavily rely on extensive human annotations. In\nthis paper, we introduce a novel task named Unsupervised Concept Extraction\n(UCE) that considers an unsupervised setting without any human knowledge of the\nconcepts. Given an image that contains multiple concepts, the task aims to\nextract and recreate individual concepts solely relying on the existing\nknowledge from pretrained diffusion models. To achieve this, we present\nConceptExpress that tackles UCE by unleashing the inherent capabilities of\npretrained diffusion models in two aspects. Specifically, a concept\nlocalization approach automatically locates and disentangles salient concepts\nby leveraging spatial correspondence from diffusion self-attention; and based\non the lookup association between a concept and a conceptual token, a\nconcept-wise optimization process learns discriminative tokens that represent\neach individual concept. Finally, we establish an evaluation protocol tailored\nfor the UCE task. Extensive experiments demonstrate that ConceptExpress is a\npromising solution to the UCE task. Our code and data are available at:\nhttps://github.com/haoosz/ConceptExpress"},{"date":"2024-07","title":"Large Language Models for Judicial Entity Extraction: A Comparative Study","author":"Atin Sakkeer Hussain, and Anu Thomas","link":"http://arxiv.org/abs/2407.05786v1","abstract":"Domain-specific Entity Recognition holds significant importance in legal\ncontexts, serving as a fundamental task that supports various applications such\nas question-answering systems, text summarization, machine translation,\nsentiment analysis, and information retrieval specifically within case law\ndocuments. Recent advancements have highlighted the efficacy of Large Language\nModels in natural language processing tasks, demonstrating their capability to\naccurately detect and classify domain-specific facts (entities) from\nspecialized texts like clinical and financial documents. This research\ninvestigates the application of Large Language Models in identifying\ndomain-specific entities (e.g., courts, petitioner, judge, lawyer, respondents,\nFIR nos.) within case law documents, with a specific focus on their aptitude\nfor handling domain-specific language complexity and contextual variations. The\nstudy evaluates the performance of state-of-the-art Large Language Model\narchitectures, including Large Language Model Meta AI 3, Mistral, and Gemma, in\nthe context of extracting judicial facts tailored to Indian judicial texts.\nMistral and Gemma emerged as the top-performing models, showcasing balanced\nprecision and recall crucial for accurate entity identification. These findings\nconfirm the value of Large Language Models in judicial documents and\ndemonstrate how they can facilitate and quicken scientific research by\nproducing precise, organised data outputs that are appropriate for in-depth\nexamination."},{"date":"2024-07","title":"Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation","author":"Pablo Messina, Ren\u00e9 Vidal, Denis Parra, \u00c1lvaro Soto, and Vladimir Araujo","link":"http://arxiv.org/abs/2407.01948v1","abstract":"Advancing representation learning in specialized fields like medicine remains\nchallenging due to the scarcity of expert annotations for text and images. To\ntackle this issue, we present a novel two-stage framework designed to extract\nhigh-quality factual statements from free-text radiology reports in order to\nimprove the representations of text encoders and, consequently, their\nperformance on various downstream tasks. In the first stage, we propose a\n\\textit{Fact Extractor} that leverages large language models (LLMs) to identify\nfactual statements from well-curated domain-specific datasets. In the second\nstage, we introduce a \\textit{Fact Encoder} (CXRFE) based on a BERT model\nfine-tuned with objective functions designed to improve its representations\nusing the extracted factual data. Our framework also includes a new\nembedding-based metric (CXRFEScore) for evaluating chest X-ray text generation\nsystems, leveraging both stages of our approach. Extensive evaluations show\nthat our fact extractor and encoder outperform current state-of-the-art methods\nin tasks such as sentence ranking, natural language inference, and label\nextraction from radiology reports. Additionally, our metric proves to be more\nrobust and effective than existing metrics commonly used in the radiology\nreport generation literature. The code of this project is available at\n\\url{https://github.com/PabloMessina/CXR-Fact-Encoder}."},{"date":"2024-07","title":"QUEEN: Query Unlearning against Model Extraction","author":"Huajie Chen, Tianqing Zhu, Lefeng Zhang, Bo Liu, Derui Wang, Wanlei Zhou, and Minhui Xue","link":"http://arxiv.org/abs/2407.01251v1","abstract":"Model extraction attacks currently pose a non-negligible threat to the\nsecurity and privacy of deep learning models. By querying the model with a\nsmall dataset and usingthe query results as the ground-truth labels, an\nadversary can steal a piracy model with performance comparable to the original\nmodel. Two key issues that cause the threat are, on the one hand, accurate and\nunlimited queries can be obtained by the adversary; on the other hand, the\nadversary can aggregate the query results to train the model step by step. The\nexisting defenses usually employ model watermarking or fingerprinting to\nprotect the ownership. However, these methods cannot proactively prevent the\nviolation from happening. To mitigate the threat, we propose QUEEN (QUEry\nunlEarNing) that proactively launches counterattacks on potential model\nextraction attacks from the very beginning. To limit the potential threat,\nQUEEN has sensitivity measurement and outputs perturbation that prevents the\nadversary from training a piracy model with high performance. In sensitivity\nmeasurement, QUEEN measures the single query sensitivity by its distance from\nthe center of its cluster in the feature space. To reduce the learning accuracy\nof attacks, for the highly sensitive query batch, QUEEN applies query\nunlearning, which is implemented by gradient reverse to perturb the softmax\noutput such that the piracy model will generate reverse gradients to worsen its\nperformance unconsciously. Experiments show that QUEEN outperforms the\nstate-of-the-art defenses against various model extraction attacks with a\nrelatively low cost to the model accuracy. The artifact is publicly available\nat https://anonymous.4open.science/r/queen implementation-5408/."},{"date":"2024-06","title":"Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs","author":"Lokesh Mishra, Sohayl Dhibi, Yusik Kim, Cesar Berrospi Ramis, Shubham Gupta, Michele Dolfi, and Peter Staar","link":"http://arxiv.org/abs/2406.19102v1","abstract":"Environment, Social, and Governance (ESG) KPIs assess an organization's\nperformance on issues such as climate change, greenhouse gas emissions, water\nconsumption, waste management, human rights, diversity, and policies. ESG\nreports convey this valuable quantitative information through tables.\nUnfortunately, extracting this information is difficult due to high variability\nin the table structure as well as content. We propose Statements, a novel\ndomain agnostic data structure for extracting quantitative facts and related\ninformation. We propose translating tables to statements as a new supervised\ndeep-learning universal information extraction task. We introduce SemTabNet - a\ndataset of over 100K annotated tables. Investigating a family of T5-based\nStatement Extraction Models, our best model generates statements which are 82%\nsimilar to the ground-truth (compared to baseline of 21%). We demonstrate the\nadvantages of statements by applying our model to over 2700 tables from ESG\nreports. The homogeneous nature of statements permits exploratory data analysis\non expansive information found in large collections of ESG reports."},{"date":"2024-06","title":"Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model","author":"Yiming Chen, Haobin Chen, Simin Liu, Yunyun Liu, Fanhao Zhou, and Bing Wei","link":"http://arxiv.org/abs/2406.18364v1","abstract":"With the continuous advancement of artificial intelligence, natural language\nprocessing technology has become widely utilized in various fields. At the same\ntime, there are many challenges in creating Chinese news summaries. First of\nall, the semantics of Chinese news is complex, and the amount of information is\nenormous. Extracting critical information from Chinese news presents a\nsignificant challenge. Second, the news summary should be concise and clear,\nfocusing on the main content and avoiding redundancy. In addition, the\nparticularity of the Chinese language, such as polysemy, word segmentation,\netc., makes it challenging to generate Chinese news summaries. Based on the\nabove, this paper studies the information extraction method of the LCSTS\ndataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM\nmodel to make it perform better in generating Chinese news summaries. The\nexperimental results show that the proposed method has a good effect on\ncreating news summaries, which is of great importance to the construction of\nnews summaries."},{"date":"2024-06","title":"Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources","author":"Yiming Li, Deepthi Viswaroopan, William He, Jianfu Li, Xu Zuo, Hua Xu, and Cui Tao","link":"http://arxiv.org/abs/2406.18049v1","abstract":"Adverse event (AE) extraction following COVID-19 vaccines from text data is\ncrucial for monitoring and analyzing the safety profiles of immunizations.\nTraditional deep learning models are adept at learning intricate feature\nrepresentations and dependencies in sequential data, but often require\nextensive labeled data. In contrast, large language models (LLMs) excel in\nunderstanding contextual information, but exhibit unstable performance on named\nentity recognition tasks, possibly due to their broad but unspecific training.\nThis study aims to evaluate the effectiveness of LLMs and traditional deep\nlearning models in AE extraction, and to assess the impact of ensembling these\nmodels on performance. In this study, we utilized reports and posts from the\nVAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal\nwas to extract three types of entities: \"vaccine\", \"shot\", and \"ae\". We\nexplored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5,\nGPT-4, and Llama-2, as well as traditional deep learning models like RNN and\nBioBERT. To enhance performance, we created ensembles of the three models with\nthe best performance. For evaluation, we used strict and relaxed F1 scores to\nevaluate the performance for each entity type, and micro-average F1 was used to\nassess the overall performance. The ensemble model achieved the highest\nperformance in \"vaccine\", \"shot\", and \"ae\" with strict F1-scores of 0.878,\n0.930, and 0.925, respectively, along with a micro-average score of 0.903. In\nconclusion, this study demonstrates the effectiveness and robustness of\nensembling fine-tuned traditional deep learning models and LLMs, for extracting\nAE-related information. This study contributes to the advancement of biomedical\nnatural language processing, providing valuable insights into improving AE\nextraction from text data for pharmacovigilance and public health surveillance."},{"date":"2024-06","title":"Enabling Regional Explainability by Automatic and Model-agnostic Rule Extraction","author":"Yu Chen, Tianyu Cui, Alexander Capstick, Nan Fletcher-Loyd, and Payam Barnaghi","link":"http://arxiv.org/abs/2406.17885v3","abstract":"In Explainable AI, rule extraction translates model knowledge into logical\nrules, such as IF-THEN statements, crucial for understanding patterns learned\nby black-box models. This could significantly aid in fields like disease\ndiagnosis, disease progression estimation, or drug discovery. However, such\napplication domains often contain imbalanced data, with the class of interest\nunderrepresented. Existing methods inevitably compromise the performance of\nrules for the minor class to maximise the overall performance. As the first\nattempt in this field, we propose a model-agnostic approach for extracting\nrules from specific subgroups of data, featuring automatic rule generation for\nnumerical features. This method enhances the regional explainability of machine\nlearning models and offers wider applicability compared to existing methods. We\nadditionally introduce a new method for selecting features to compose rules,\nreducing computational costs in high-dimensional spaces. Experiments across\nvarious datasets and models demonstrate the effectiveness of our methods."},{"date":"2024-06","title":"Compact Model Parameter Extraction via Derivative-Free Optimization","author":"Rafael Perez Martinez, Masaya Iwamoto, Kelly Woo, Zhengliang Bian, Roberto Tinti, Stephen Boyd, and Srabanti Chowdhury","link":"http://arxiv.org/abs/2406.16355v2","abstract":"In this paper, we address the problem of compact model parameter extraction\nto simultaneously extract tens of parameters via derivative-free optimization.\nTraditionally, parameter extraction is performed manually by dividing the\ncomplete set of parameters into smaller subsets, each targeting different\noperational regions of the device, a process that can take several days or\nweeks. Our approach streamlines this process by employing derivative-free\noptimization to identify a good parameter set that best fits the compact model\nwithout performing an exhaustive number of simulations. We further enhance the\noptimization process to address three critical issues in device modeling by\ncarefully choosing a loss function that focuses on relative errors rather than\nabsolute errors to ensure consistent performance across different orders of\nmagnitude, prioritizes accuracy in key operational regions above a specific\nthreshold, and reduces sensitivity to outliers. Furthermore, we utilize the\nconcept of train-test split to assess the model fit and avoid overfitting. We\ndemonstrate the effectiveness of our approach by successfully modeling a\ndiamond Schottky diode with the SPICE diode model and a GaN-on-SiC HEMT with\nthe ASM-HEMT model. For the latter, which involves extracting 35 parameters for\nthe ASM-HEMT DC model, we identified the best set of parameters in under 6,000\ntrials. Additional examples using both devices are provided to demonstrate\nrobustness to outliers, showing that an excellent fit is achieved even with\nover 25% of the data purposely corrupted. These examples demonstrate the\npracticality of our approach, highlighting the benefits of derivative-free\noptimization in device modeling."},{"date":"2024-06","title":"Large Language Models for Link Stealing Attacks Against Graph Neural Networks","author":"Faqian Guan, Tianqing Zhu, Hui Sun, Wanlei Zhou, and Philip S. Yu","link":"http://arxiv.org/abs/2406.16963v1","abstract":"Graph data contains rich node features and unique edge information, which\nhave been applied across various domains, such as citation networks or\nrecommendation systems. Graph Neural Networks (GNNs) are specialized for\nhandling such data and have shown impressive performance in many applications.\nHowever, GNNs may contain of sensitive information and susceptible to privacy\nattacks. For example, link stealing is a type of attack in which attackers\ninfer whether two nodes are linked or not. Previous link stealing attacks\nprimarily relied on posterior probabilities from the target GNN model,\nneglecting the significance of node features. Additionally, variations in node\nclasses across different datasets lead to different dimensions of posterior\nprobabilities. The handling of these varying data dimensions posed a challenge\nin using a single model to effectively conduct link stealing attacks on\ndifferent datasets. To address these challenges, we introduce Large Language\nModels (LLMs) to perform link stealing attacks on GNNs. LLMs can effectively\nintegrate textual features and exhibit strong generalizability, enabling\nattacks to handle diverse data dimensions across various datasets. We design\ntwo distinct LLM prompts to effectively combine textual features and posterior\nprobabilities of graph nodes. Through these designed prompts, we fine-tune the\nLLM to adapt to the link stealing attack task. Furthermore, we fine-tune the\nLLM using multiple datasets and enable the LLM to learn features from different\ndatasets simultaneously. Experimental results show that our approach\nsignificantly enhances the performance of existing link stealing attack tasks\nin both white-box and black-box scenarios. Our method can execute link stealing\nattacks across different datasets using only a single model, making link\nstealing attacks more applicable to real-world scenarios."},{"date":"2024-06","title":"Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks","author":"Sefika Efeoglu, and Adrian Paschke","link":"http://arxiv.org/abs/2406.14745v2","abstract":"Information Extraction (IE) is crucial for converting unstructured data into\nstructured formats like Knowledge Graphs (KGs). A key task within IE is\nRelation Extraction (RE), which identifies relationships between entities in\ntext. Various RE methods exist, including supervised, unsupervised, weakly\nsupervised, and rule-based approaches. Recent studies leveraging pre-trained\nlanguage models (PLMs) have shown significant success in this area. In the\ncurrent era dominated by Large Language Models (LLMs), fine-tuning these models\ncan overcome limitations associated with zero-shot LLM prompting-based RE\nmethods, especially regarding domain adaptation challenges and identifying\nimplicit relations between entities in sentences. These implicit relations,\nwhich cannot be easily extracted from a sentence's dependency tree, require\nlogical inference for accurate identification. This work explores the\nperformance of fine-tuned LLMs and their integration into the Retrieval\nAugmented-based (RAG) RE approach to address the challenges of identifying\nimplicit relations at the sentence level, particularly when LLMs act as\ngenerators within the RAG framework. Empirical evaluations on the TACRED,\nTACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant\nperformance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B,\nand T5 (Large). Notably, our approach achieves substantial gains on SemEVAL,\nwhere implicit relations are common, surpassing previous results on this\ndataset. Additionally, our method outperforms previous works on TACRED, TACREV,\nand Re-TACRED, demonstrating exceptional performance across diverse evaluation\nscenarios."},{"date":"2024-06","title":"Extracting Training Data from Unconditional Diffusion Models","author":"Yunhao Chen, Xingjun Ma, Difan Zou, and Yu-Gang Jiang","link":"http://arxiv.org/abs/2406.12752v2","abstract":"As diffusion probabilistic models (DPMs) are being employed as mainstream\nmodels for generative artificial intelligence (AI), the study of their\nmemorization of the raw training data has attracted growing attention. Existing\nworks in this direction aim to establish an understanding of whether or to what\nextent DPMs learn by memorization. Such an understanding is crucial for\nidentifying potential risks of data leakage and copyright infringement in\ndiffusion models and, more importantly, for more controllable generation and\ntrustworthy application of Artificial Intelligence Generated Content (AIGC).\nWhile previous works have made important observations of when DPMs are prone to\nmemorization, these findings are mostly empirical, and the developed data\nextraction methods only work for conditional diffusion models. In this work, we\naim to establish a theoretical understanding of memorization in DPMs with 1) a\nmemorization metric for theoretical analysis, 2) an analysis of conditional\nmemorization with informative and random labels, and 3) two better evaluation\nmetrics for measuring memorization. Based on the theoretical analysis, we\nfurther propose a novel data extraction method called \\textbf{Surrogate\ncondItional Data Extraction (SIDE)} that leverages a classifier trained on\ngenerated data as a surrogate condition to extract training data directly from\nunconditional diffusion models. Our empirical results demonstrate that SIDE can\nextract training data from diffusion models where previous methods fail, and it\nis on average over 50\\% more effective across different scales of the CelebA\ndataset."},{"date":"2024-06","title":"Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction","author":"Zepeng Ding, Ruiyang Ke, Wenhao Huang, Guochao Jiang, Yanda Li, Deqing Yang, and Jiaqing Liang","link":"http://arxiv.org/abs/2406.11455v2","abstract":"Existing research on large language models (LLMs) shows that they can solve\ninformation extraction tasks through multi-step planning. However, their\nextraction behavior on complex sentences and tasks is unstable, emerging issues\nsuch as false positives and missing elements. We observe that decomposing\ncomplex extraction tasks and extracting them step by step can effectively\nimprove LLMs' performance, and the extraction orders of entities significantly\naffect the final results of LLMs. This paper proposes a two-stage multi-step\nmethod for LLM-based information extraction and adopts the RL framework to\nexecute the multi-step planning. We regard sequential extraction as a Markov\ndecision process, build an LLM-based extraction environment, design a decision\nmodule to adaptively provide the optimal order for sequential entity extraction\non different sentences, and utilize the DDQN algorithm to train the decision\nmodel. We also design the rewards and evaluation metrics suitable for the\nextraction results of LLMs. We conduct extensive experiments on multiple public\ndatasets to demonstrate the effectiveness of our method in improving the\ninformation extraction capabilities of LLMs."},{"date":"2024-06","title":"How Should We Extract Discrete Audio Tokens from Self-Supervised Models?","author":"Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, and Mirco Ravanelli","link":"http://arxiv.org/abs/2406.10735v1","abstract":"Discrete audio tokens have recently gained attention for their potential to\nbridge the gap between audio and language processing. Ideal audio tokens must\npreserve content, paralinguistic elements, speaker identity, and many other\naudio details. Current audio tokenization methods fall into two categories:\nSemantic tokens, acquired through quantization of Self-Supervised Learning\n(SSL) models, and Neural compression-based tokens (codecs). Although previous\nstudies have benchmarked codec models to identify optimal configurations, the\nideal setup for quantizing pretrained SSL models remains unclear. This paper\nexplores the optimal configuration of semantic tokens across discriminative and\ngenerative tasks. We propose a scalable solution to train a universal vocoder\nacross multiple SSL layers. Furthermore, an attention mechanism is employed to\nidentify task-specific influential layers, enhancing the adaptability and\nperformance of semantic tokens in diverse audio applications."},{"date":"2024-06","title":"GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks","author":"Ihor Stepanov, and Mykhailo Shtopko","link":"http://arxiv.org/abs/2406.12925v2","abstract":"Information extraction tasks require both accurate, efficient, and\ngeneralisable models. Classical supervised deep learning approaches can achieve\nthe required performance, but they need large datasets and are limited in their\nability to adapt to different tasks. On the other hand, large language models\n(LLMs) demonstrate good generalization, meaning that they can adapt to many\ndifferent tasks based on user requests. However, LLMs are computationally\nexpensive and tend to fail to generate structured outputs. In this article, we\nwill introduce a new kind of GLiNER model that can be used for various\ninformation extraction tasks while being a small encoder model. Our model\nachieved SoTA performance on zero-shot NER benchmarks and leading performance\non question-answering, summarization and relation extraction tasks.\nAdditionally, in this article, we will cover experimental results on\nself-learning approaches for named entity recognition using GLiNER models."},{"date":"2024-06","title":"Beyond Slow Signs in High-fidelity Model Extraction","author":"Hanna Foerster, Robert Mullins, Ilia Shumailov, and Jamie Hayes","link":"http://arxiv.org/abs/2406.10011v1","abstract":"Deep neural networks, costly to train and rich in intellectual property\nvalue, are increasingly threatened by model extraction attacks that compromise\ntheir confidentiality. Previous attacks have succeeded in reverse-engineering\nmodel parameters up to a precision of float64 for models trained on random data\nwith at most three hidden layers using cryptanalytical techniques. However, the\nprocess was identified to be very time consuming and not feasible for larger\nand deeper models trained on standard benchmarks. Our study evaluates the\nfeasibility of parameter extraction methods of Carlini et al. [1] further\nenhanced by Canales-Mart\\'inez et al. [2] for models trained on standard\nbenchmarks. We introduce a unified codebase that integrates previous methods\nand reveal that computational tools can significantly influence performance. We\ndevelop further optimisations to the end-to-end attack and improve the\nefficiency of extracting weight signs by up to 14.8 times compared to former\nmethods through the identification of easier and harder to extract neurons.\nContrary to prior assumptions, we identify extraction of weights, not\nextraction of weight signs, as the critical bottleneck. With our improvements,\na 16,721 parameter model with 2 hidden layers trained on MNIST is extracted\nwithin only 98 minutes compared to at least 150 minutes previously. Finally,\naddressing methodological deficiencies observed in previous studies, we propose\nnew ways of robust benchmarking for future model extraction attacks."},{"date":"2024-06","title":"RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models","author":"Daniel Reichenpfader, Jonas Knupp, Andr\u00e9 Sander, and Kerstin Denecke","link":"http://arxiv.org/abs/2406.15465v1","abstract":"Annually and globally, over three billion radiography examinations and\ncomputer tomography scans result in mostly unstructured radiology reports\ncontaining free text. Despite the potential benefits of structured reporting,\nits adoption is limited by factors such as established processes, resource\nconstraints and potential loss of information. However, structured information\nwould be necessary for various use cases, including automatic analysis,\nclinical trial matching, and prediction of health outcomes. This study\nintroduces RadEx, an end-to-end framework comprising 15 software components and\nten artifacts to develop systems that perform automated information extraction\nfrom radiology reports. It covers the complete process from annotating training\ndata to extracting information by offering a consistent generic information\nmodel and setting boundaries for model development. Specifically, RadEx allows\nclinicians to define relevant information for clinical domains (e.g.,\nmammography) and to create report templates. The framework supports both\ngenerative and encoder-only models and the decoupling of information extraction\nfrom template filling enables independent model improvements. Developing\ninformation extraction systems according to the RadEx framework facilitates\nimplementation and maintenance as components are easily exchangeable, while\nstandardized artifacts ensure interoperability between components."},{"date":"2024-06","title":"Zero-Shot Learning Over Large Output Spaces : Utilizing Indirect Knowledge Extraction from Large Language Models","author":"Jinbin Zhang, Nasib Ullah, and Rohit Babbar","link":"http://arxiv.org/abs/2406.09288v1","abstract":"Extreme Multi-label Learning (XMC) is a task that allocates the most relevant\nlabels for an instance from a predefined label set. Extreme Zero-shot XMC\n(EZ-XMC) is a special setting of XMC wherein no supervision is provided; only\nthe instances (raw text of the document) and the predetermined label set are\ngiven. The scenario is designed to address cold-start problems in\ncategorization and recommendation. Traditional state-of-the-art methods extract\npseudo labels from the document title or segments. These labels from the\ndocument are used to train a zero-shot bi-encoder model. The main issue with\nthese generated labels is their misalignment with the tagging task. In this\nwork, we propose a framework to train a small bi-encoder model via the feedback\nfrom the large language model (LLM), the bi-encoder model encodes the document\nand labels into embeddings for retrieval. Our approach leverages the zero-shot\nability of LLM to assess the correlation between labels and the document\ninstead of using the low-quality labels extracted from the document itself. Our\nmethod also guarantees fast inference without the involvement of LLM. The\nperformance of our approach outperforms the SOTA methods on various datasets\nwhile retaining a similar training time for large datasets."},{"date":"2024-06","title":"Research on Deep Learning Model of Feature Extraction Based on Convolutional Neural Network","author":"Houze Liu, Iris Li, Yaxin Liang, Dan Sun, Yining Yang, and Haowei Yang","link":"http://arxiv.org/abs/2406.08837v1","abstract":"Neural networks with relatively shallow layers and simple structures may have\nlimited ability in accurately identifying pneumonia. In addition, deep neural\nnetworks also have a large demand for computing resources, which may cause\nconvolutional neural networks to be unable to be implemented on terminals.\nTherefore, this paper will carry out the optimal classification of\nconvolutional neural networks. Firstly, according to the characteristics of\npneumonia images, AlexNet and InceptionV3 were selected to obtain better image\nrecognition results. Combining the features of medical images, the forward\nneural network with deeper and more complex structure is learned. Finally,\nknowledge extraction technology is used to extract the obtained data into the\nAlexNet model to achieve the purpose of improving computing efficiency and\nreducing computing costs. The results showed that the prediction accuracy,\nspecificity, and sensitivity of the trained AlexNet model increased by 4.25\npercentage points, 7.85 percentage points, and 2.32 percentage points,\nrespectively. The graphics processing usage has decreased by 51% compared to\nthe InceptionV3 mode."},{"date":"2024-06","title":"A Combination Model for Time Series Prediction using LSTM via Extracting Dynamic Features Based on Spatial Smoothing and Sequential General Variational Mode Decomposition","author":"Jianyu Liu, Wei Chen, Yong Zhang, Zhenfeng Chen, Bin Wan, and Jinwei Hu","link":"http://arxiv.org/abs/2406.03144v1","abstract":"In order to solve the problems such as difficult to extract effective\nfeatures and low accuracy of sales volume prediction caused by complex\nrelationships such as market sales volume in time series prediction, we\nproposed a time series prediction method of market sales volume based on\nSequential General VMD and spatial smoothing Long short-term memory neural\nnetwork (SS-LSTM) combination model. Firstly, the spatial smoothing algorithm\nis used to decompose and calculate the sample data of related industry sectors\naffected by the linkage effect of market sectors, extracting modal features\ncontaining information via Sequential General VMD on overall market and\nspecific price trends; Then, according to the background of different Market\ndata sets, LSTM network is used to model and predict the price of fundamental\ndata and modal characteristics. The experimental results of data prediction\nwith seasonal and periodic trends show that this method can achieve higher\nprice prediction accuracy and more accurate accuracy in specific market\ncontexts compared to traditional prediction methods Describe the changes in\nmarket sales volume."},{"date":"2024-06","title":"Stealing Image-to-Image Translation Models With a Single Query","author":"Nurit Spingarn-Eliezer, and Tomer Michaeli","link":"http://arxiv.org/abs/2406.00828v1","abstract":"Training deep neural networks requires significant computational resources\nand large datasets that are often confidential or expensive to collect. As a\nresult, owners tend to protect their models by allowing access only via an API.\nMany works demonstrated the possibility of stealing such protected models by\nrepeatedly querying the API. However, to date, research has predominantly\nfocused on stealing classification models, for which a very large number of\nqueries has been found necessary. In this paper, we study the possibility of\nstealing image-to-image models. Surprisingly, we find that many such models can\nbe stolen with as little as a single, small-sized, query image using simple\ndistillation. We study this phenomenon on a wide variety of model\narchitectures, datasets, and tasks, including denoising, deblurring, deraining,\nsuper-resolution, and biological image-to-image translation. Remarkably, we\nfind that the vulnerability to stealing attacks is shared by CNNs and by models\nwith attention mechanisms, and that stealing is commonly possible even without\nknowing the architecture of the target model."},{"date":"2024-05","title":"Large Language Model Watermark Stealing With Mixed Integer Programming","author":"Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang, Leo Yu Zhang, Chao Chen, Shengshan Hu, Asif Gill, and Shirui Pan","link":"http://arxiv.org/abs/2405.19677v1","abstract":"The Large Language Model (LLM) watermark is a newly emerging technique that\nshows promise in addressing concerns surrounding LLM copyright, monitoring\nAI-generated text, and preventing its misuse. The LLM watermark scheme commonly\nincludes generating secret keys to partition the vocabulary into green and red\nlists, applying a perturbation to the logits of tokens in the green list to\nincrease their sampling likelihood, thus facilitating watermark detection to\nidentify AI-generated text if the proportion of green tokens exceeds a\nthreshold. However, recent research indicates that watermarking methods using\nnumerous keys are susceptible to removal attacks, such as token editing,\nsynonym substitution, and paraphrasing, with robustness declining as the number\nof keys increases. Therefore, the state-of-the-art watermark schemes that\nemploy fewer or single keys have been demonstrated to be more robust against\ntext editing and paraphrasing. In this paper, we propose a novel green list\nstealing attack against the state-of-the-art LLM watermark scheme and\nsystematically examine its vulnerability to this attack. We formalize the\nattack as a mixed integer programming problem with constraints. We evaluate our\nattack under a comprehensive threat model, including an extreme scenario where\nthe attacker has no prior knowledge, lacks access to the watermark detector\nAPI, and possesses no information about the LLM's parameter settings or\nwatermark injection/detection scheme. Extensive experiments on LLMs, such as\nOPT and LLaMA, demonstrate that our attack can successfully steal the green\nlist and remove the watermark across all settings."},{"date":"2024-05","title":"Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning","author":"Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan Liu, and Subbarao Kambhampati","link":"http://arxiv.org/abs/2405.15194v2","abstract":"Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward\ndomains, and the problem is further pronounced in case of stochastic\ntransitions. To improve the sample efficiency, reward shaping is a well-studied\napproach to introduce intrinsic rewards that can help the RL agent converge to\nan optimal policy faster. However, designing a useful reward shaping function\nfor all desirable states in the Markov Decision Process (MDP) is challenging,\neven for domain experts. Given that Large Language Models (LLMs) have\ndemonstrated impressive performance across a magnitude of natural language\ntasks, we aim to answer the following question: `Can we obtain heuristics using\nLLMs for constructing a reward shaping function that can boost an RL agent's\nsample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to\ngenerate a plan for an abstraction of the underlying MDP. We further use this\nLLM-generated plan as a heuristic to construct the reward shaping signal for\nthe downstream RL agent. By characterizing the type of abstraction based on the\nMDP horizon length, we analyze the quality of heuristics when generated using\nan LLM, with and without a verifier in the loop. Our experiments across\nmultiple domains with varying horizon length and number of sub-goals from the\nBabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the\nadvantages and limitations of querying LLMs with and without a verifier to\ngenerate a reward shaping heuristic, and, 2) a significant improvement in the\nsample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated\nheuristics."},{"date":"2024-05","title":"Evaluating Large Language Models for Public Health Classification and Extraction Tasks","author":"Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed, Thomas Finnie, Luke Hounsome, and Steven Riley","link":"http://arxiv.org/abs/2405.14766v1","abstract":"Advances in Large Language Models (LLMs) have led to significant interest in\ntheir potential to support human experts across a range of domains, including\npublic health. In this work we present automated evaluations of LLMs for public\nhealth tasks involving the classification and extraction of free text. We\ncombine six externally annotated datasets with seven new internally annotated\ndatasets to evaluate LLMs for processing text related to: health burden,\nepidemiological risk factors, and public health interventions. We initially\nevaluate five open-weight LLMs (7-70 billion parameters) across all tasks using\nzero-shot in-context learning. We find that Llama-3-70B-Instruct is the highest\nperforming model, achieving the best results on 15/17 tasks (using micro-F1\nscores). We see significant variation across tasks with all open-weight LLMs\nscoring below 60% micro-F1 on some challenging tasks, such as Contact\nClassification, while all LLMs achieve greater than 80% micro-F1 on others,\nsuch as GI Illness Classification. For a subset of 12 tasks, we also evaluate\nGPT-4 and find comparable results to Llama-3-70B-Instruct, which scores equally\nor outperforms GPT-4 on 6 of the 12 tasks. Overall, based on these initial\nresults we find promising signs that LLMs may be useful tools for public health\nexperts to extract information from a wide variety of free text sources, and\nsupport public health surveillance, research, and interventions."},{"date":"2024-05","title":"Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study","author":"Lena Schmidt, Kaitlyn Hair, Sergio Graziozi, Fiona Campbell, Claudia Kapp, Alireza Khanteymoori, Dawn Craig, Mark Engelbert, and James Thomas","link":"http://arxiv.org/abs/2405.14445v1","abstract":"This paper describes a rapid feasibility study of using GPT-4, a large\nlanguage model (LLM), to (semi)automate data extraction in systematic reviews.\nDespite the recent surge of interest in LLMs there is still a lack of\nunderstanding of how to design LLM-based automation tools and how to robustly\nevaluate their performance. During the 2023 Evidence Synthesis Hackathon we\nconducted two feasibility studies. Firstly, to automatically extract study\ncharacteristics from human clinical, animal, and social science domain studies.\nWe used two studies from each category for prompt-development; and ten for\nevaluation. Secondly, we used the LLM to predict Participants, Interventions,\nControls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP\ndataset. Overall, results indicated an accuracy of around 80%, with some\nvariability between domains (82% for human clinical, 80% for animal, and 72%\nfor studies of human social sciences). Causal inference methods and study\ndesign were the data extraction items with the most errors. In the PICO study,\nparticipants and intervention/control showed high accuracy (>80%), outcomes\nwere more challenging. Evaluation was done manually; scoring methods such as\nBLEU and ROUGE showed limited value. We observed variability in the LLMs\npredictions and changes in response quality. This paper presents a template for\nfuture evaluations of LLMs in the context of data extraction for systematic\nreview automation. Our results show that there might be value in using LLMs,\nfor example as second or third reviewers. However, caution is advised when\nintegrating models such as GPT-4 into tools. Further research on stability and\nreliability in practical settings is warranted for each type of data that is\nprocessed by the LLM."},{"date":"2024-05","title":"A Set-based Approach for Feature Extraction of 3D CAD Models","author":"Peng Xu, Qi Gao, and Ying-Jie Wu","link":"http://arxiv.org/abs/2406.18543v1","abstract":"Feature extraction is a critical technology to realize the automatic\ntransmission of feature information throughout product life cycles. As CAD\nmodels primarily capture the 3D geometry of products, feature extraction\nheavily relies on geometric information. However, existing feature extraction\nmethods often yield inaccurate outcomes due to the diverse interpretations of\ngeometric information. This report presents a set-based feature extraction\napproach to address this uncertainty issue. Unlike existing methods that seek\naccurate feature results, our approach aims to transform the uncertainty of\ngeometric information into a set of feature subgraphs. First, we define the\nconvexity of basic geometric entities and introduce the concept of two-level\nattributed adjacency graphs. Second, a feature extraction workflow is designed\nto determine feature boundaries and identify feature subgraphs from CAD models.\nThis set of feature subgraphs can be used for further feature recognition. A\nfeature extraction system is programmed using C++ and UG/Open to demonstrate\nthe feasibility of our proposed approach."},{"date":"2024-05","title":"Dataset Mention Extraction in Scientific Articles Using Bi-LSTM-CRF Model","author":"Tong Zeng, and Daniel Acuna","link":"http://arxiv.org/abs/2405.13135v1","abstract":"Datasets are critical for scientific research, playing an important role in\nreplication, reproducibility, and efficiency. Researchers have recently shown\nthat datasets are becoming more important for science to function properly,\neven serving as artifacts of study themselves. However, citing datasets is not\na common or standard practice in spite of recent efforts by data repositories\nand funding agencies. This greatly affects our ability to track their usage and\nimportance. A potential solution to this problem is to automatically extract\ndataset mentions from scientific articles. In this work, we propose to achieve\nsuch extraction by using a neural network based on a Bi-LSTM-CRF architecture.\nOur method achieves F1 = 0.885 in social science articles released as part of\nthe Rich Context Dataset. We discuss the limitations of the current datasets\nand propose modifications to the model to be done in the future."},{"date":"2024-05","title":"Efficient Model-Stealing Attacks Against Inductive Graph Neural Networks","author":"Marcin Podhajski, Jan Dubi\u0144ski, Franziska Boenisch, Adam Dziedzic, and Agnieszka Pregowska And Tomasz Michalak","link":"http://arxiv.org/abs/2405.12295v3","abstract":"Graph Neural Networks (GNNs) are recognized as potent tools for processing\nreal-world data organized in graph structures. Especially inductive GNNs, which\nallow for the processing of graph-structured data without relying on predefined\ngraph structures, are becoming increasingly important in a wide range of\napplications. As such these networks become attractive targets for\nmodel-stealing attacks where an adversary seeks to replicate the functionality\nof the targeted network. Significant efforts have been devoted to developing\nmodel-stealing attacks that extract models trained on images and texts.\nHowever, little attention has been given to stealing GNNs trained on graph\ndata. This paper identifies a new method of performing unsupervised\nmodel-stealing attacks against inductive GNNs, utilizing graph contrastive\nlearning and spectral graph augmentations to efficiently extract information\nfrom the targeted model. The new type of attack is thoroughly evaluated on six\ndatasets and the results show that our approach outperforms the current\nstate-of-the-art by Shen et al. (2021). In particular, our attack surpasses the\nbaseline across all benchmarks, attaining superior fidelity and downstream\naccuracy of the stolen model while necessitating fewer queries directed toward\nthe target model."},{"date":"2024-05","title":"Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing","author":"Yunlong Zhao, Xiaoheng Deng, Yijing Liu, Xinjun Pei, Jiazhi Xia, and Wei Chen","link":"http://arxiv.org/abs/2406.18540v1","abstract":"Model stealing (MS) involves querying and observing the output of a machine\nlearning model to steal its capabilities. The quality of queried data is\ncrucial, yet obtaining a large amount of real data for MS is often challenging.\nRecent works have reduced reliance on real data by using generative models.\nHowever, when high-dimensional query data is required, these methods are\nimpractical due to the high costs of querying and the risk of model collapse.\nIn this work, we propose using sample gradients (SG) to enhance the utility of\neach real sample, as SG provides crucial guidance on the decision boundaries of\nthe victim model. However, utilizing SG in the model stealing scenario faces\ntwo challenges: 1. Pixel-level gradient estimation requires extensive query\nvolume and is susceptible to defenses. 2. The estimation of sample gradients\nhas a significant variance. This paper proposes Superpixel Sample Gradient\nstealing (SPSG) for model stealing under the constraint of limited real\nsamples. With the basic idea of imitating the victim model's low-variance\npatch-level gradients instead of pixel-level gradients, SPSG achieves efficient\nsample gradient estimation through two steps. First, we perform patch-wise\nperturbations on query images to estimate the average gradient in different\nregions of the image. Then, we filter the gradients through a threshold\nstrategy to reduce variance. Exhaustive experiments demonstrate that, with the\nsame number of real samples, SPSG achieves accuracy, agreements, and\nadversarial success rate significantly surpassing the current state-of-the-art\nMS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack."},{"date":"2024-05","title":"Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction","author":"Chinedu Ekuma","link":"http://arxiv.org/abs/2405.10448v2","abstract":"The advent of natural language processing and large language models (LLMs)\nhas revolutionized the extraction of data from unstructured scholarly papers.\nHowever, ensuring data trustworthiness remains a significant challenge. In this\npaper, we introduce PropertyExtractor, an open-source tool that leverages\nadvanced conversational LLMs like Google gemini-pro and OpenAI gpt-4, blends\nzero-shot with few-shot in-context learning, and employs engineered prompts for\nthe dynamic refinement of structured information hierarchies - enabling\nautonomous, efficient, scalable, and accurate identification, extraction, and\nverification of material property data. Our tests on material data demonstrate\nprecision and recall that exceed 95\\% with an error rate of approximately 9%,\nhighlighting the effectiveness and versatility of the toolkit. Finally,\ndatabases for 2D material thicknesses, a critical parameter for device\nintegration, and energy bandgap values are developed using PropertyExtractor.\nSpecifically for the thickness database, the rapid evolution of the field has\noutpaced both experimental measurements and computational methods, creating a\nsignificant data gap. Our work addresses this gap and showcases the potential\nof PropertyExtractor as a reliable and efficient tool for the autonomous\ngeneration of various material property databases, advancing the field."},{"date":"2024-05","title":"Unsupervised Work Behavior Pattern Extraction Based on Hierarchical Probabilistic Model","author":"Issei Saito, Tomoaki Nakamura, Toshiyuki Hatta, Wataru Fujita, Shintaro Watanabe, and Shotaro Miwa","link":"http://arxiv.org/abs/2405.09838v1","abstract":"Evolving consumer demands and market trends have led to businesses\nincreasingly embracing a production approach that prioritizes flexibility and\ncustomization. Consequently, factory workers must engage in tasks that are more\ncomplex than before. Thus, productivity depends on each worker's skills in\nassembling products. Therefore, analyzing the behavior of a worker is crucial\nfor work improvement. However, manual analysis is time consuming and does not\nprovide quick and accurate feedback. Machine learning have been attempted to\nautomate the analyses; however, most of these methods need several labels for\ntraining. To this end, we extend the Gaussian process hidden semi-Markov model\n(GP-HSMM), to enable the rapid and automated analysis of worker behavior\nwithout pre-training. The model does not require labeled data and can\nautomatically and accurately segment continuous motions into motion classes.\nThe proposed model is a probabilistic model that hierarchically connects\nGP-HSMM and HSMM, enabling the extraction of behavioral patterns with different\ngranularities. Furthermore, it mutually infers the parameters between the\nGP-HSMM and HSMM, resulting in accurate motion pattern extraction. We applied\nthe proposed method to motion data in which workers assembled products at an\nactual production site. The accuracy of behavior pattern extraction was\nevaluated using normalized Levenshtein distance (NLD). The smaller the value of\nNLD, the more accurate is the pattern extraction. The NLD of motion patterns\ncaptured by GP-HSMM and HSMM layers in our proposed method was 0.50 and 0.33,\nrespectively, which are the smallest compared to that of the baseline methods."},{"date":"2024-05","title":"The object detection model uses combined extraction with KNN and RF classification","author":"Florentina Tatrin Kurniati, Daniel HF Manongga, Irwan Sembiring, Sutarto Wijono, and Roy Rudolf Huizen","link":"http://arxiv.org/abs/2405.05551v1","abstract":"Object detection plays an important role in various fields. Developing\ndetection models for 2D objects that experience rotation and texture variations\nis a challenge. In this research, the initial stage of the proposed model\nintegrates the gray-level co-occurrence matrix (GLCM) and local binary patterns\n(LBP) texture feature extraction to obtain feature vectors. The next stage is\nclassifying features using k-nearest neighbors (KNN) and random forest (RF), as\nwell as voting ensemble (VE). System testing used a dataset of 4,437 2D images,\nthe results for KNN accuracy were 92.7% and F1-score 92.5%, while RF\nperformance was lower. Although GLCM features improve performance on both\nalgorithms, KNN is more consistent. The VE approach provides the best\nperformance with an accuracy of 93.9% and an F1 score of 93.8%, this shows the\neffectiveness of the ensemble technique in increasing object detection\naccuracy. This study contributes to the field of object detection with a new\napproach combining GLCM and LBP as feature vectors as well as VE for\nclassification"},{"date":"2024-05","title":"Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models","author":"Yang Bai, Ge Pei, Jindong Gu, Yong Yang, and Xingjun Ma","link":"http://arxiv.org/abs/2405.05990v2","abstract":"Large language models (LLMs) have achieved remarkable performance on a wide\nrange of tasks. However, recent studies have shown that LLMs can memorize\ntraining data and simple repeated tokens can trick the model to leak the data.\nIn this paper, we take a step further and show that certain special characters\nor their combinations with English letters are stronger memory triggers,\nleading to more severe data leakage. The intuition is that, since LLMs are\ntrained with massive data that contains a substantial amount of special\ncharacters (e.g. structural symbols {, } of JSON files, and @, # in emails and\nonline posts), the model may memorize the co-occurrence between these special\ncharacters and the raw texts. This motivates us to propose a simple but\neffective Special Characters Attack (SCA) to induce training data leakage. Our\nexperiments verify the high effectiveness of SCA against state-of-the-art LLMs:\nthey can leak diverse training data, such as code corpus, web pages, and\npersonally identifiable information, and sometimes generate non-stop outputs as\na byproduct. We further show that the composition of the training data corpus\ncan be revealed by inspecting the leaked data -- one crucial piece of\ninformation for pre-training high-performance LLMs. Our work can help\nunderstand the sensitivity of LLMs to special characters and identify potential\nareas for improvement."},{"date":"2024-05","title":"Lightweight Spatial Modeling for Combinatorial Information Extraction From Documents","author":"Yanfei Dong, Lambert Deng, Jiazheng Zhang, Xiaodong Yu, Ting Lin, Francesco Gelli, Soujanya Poria, and Wee Sun Lee","link":"http://arxiv.org/abs/2405.06701v1","abstract":"Documents that consist of diverse templates and exhibit complex spatial\nstructures pose a challenge for document entity classification. We propose\nKNN-former, which incorporates a new kind of spatial bias in attention\ncalculation based on the K-nearest-neighbor (KNN) graph of document entities.\nWe limit entities' attention only to their local radius defined by the KNN\ngraph. We also use combinatorial matching to address the one-to-one mapping\nproperty that exists in many documents, where one field has only one\ncorresponding entity. Moreover, our method is highly parameter-efficient\ncompared to existing approaches in terms of the number of trainable parameters.\nDespite this, experiments across various datasets show our method outperforms\nbaselines in most entity types. Many real-world documents exhibit combinatorial\nproperties which can be leveraged as inductive biases to improve extraction\naccuracy, but existing datasets do not cover these documents. To facilitate\nfuture research into these types of documents, we release a new ID document\ndataset that covers diverse templates and languages. We also release enhanced\nannotations for an existing dataset."},{"date":"2024-05","title":"ModelShield: Adaptive and Robust Watermark against Model Extraction Attack","author":"Kaiyi Pang, Tao Qi, Chuhan Wu, Minhao Bai, Minghu Jiang, and Yongfeng Huang","link":"http://arxiv.org/abs/2405.02365v3","abstract":"Large language models (LLMs) demonstrate general intelligence across a\nvariety of machine learning tasks, thereby enhancing the commercial value of\ntheir intellectual property (IP). To protect this IP, model owners typically\nallow user access only in a black-box manner, however, adversaries can still\nutilize model extraction attacks to steal the model intelligence encoded in\nmodel generation. Watermarking technology offers a promising solution for\ndefending against such attacks by embedding unique identifiers into the\nmodel-generated content. However, existing watermarking methods often\ncompromise the quality of generated content due to heuristic alterations and\nlack robust mechanisms to counteract adversarial strategies, thus limiting\ntheir practicality in real-world scenarios. In this paper, we introduce an\nadaptive and robust watermarking method (named ModelShield) to protect the IP\nof LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs\nto autonomously insert watermarks into their generated content to avoid the\ndegradation of model content. We also propose a robust watermark detection\nmechanism capable of effectively identifying watermark signals under the\ninterference of varying adversarial strategies. Besides, ModelShield is a\nplug-and-play method that does not require additional model training, enhancing\nits applicability in LLM deployments. Extensive evaluations on two real-world\ndatasets and three LLMs demonstrate that our method surpasses existing methods\nin terms of defense effectiveness and robustness while significantly reducing\nthe degradation of watermarking on the model-generated content."},{"date":"2024-05","title":"Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models","author":"Hye Sun Yun, David Pogrebitskiy, Iain J. Marshall, and Byron C. Wallace","link":"http://arxiv.org/abs/2405.01686v2","abstract":"Meta-analyses statistically aggregate the findings of different randomized\ncontrolled trials (RCTs) to assess treatment effectiveness. Because this yields\nrobust estimates of treatment effectiveness, results from meta-analyses are\nconsidered the strongest form of evidence. However, rigorous evidence syntheses\nare time-consuming and labor-intensive, requiring manual extraction of data\nfrom individual trials to be synthesized. Ideally, language technologies would\npermit fully automatic meta-analysis, on demand. This requires accurately\nextracting numerical results from individual trials, which has been beyond the\ncapabilities of natural language processing (NLP) models to date. In this work,\nwe evaluate whether modern large language models (LLMs) can reliably perform\nthis task. We annotate (and release) a modest but granular evaluation dataset\nof clinical trial reports with numerical findings attached to interventions,\ncomparators, and outcomes. Using this dataset, we evaluate the performance of\nseven LLMs applied zero-shot for the task of conditionally extracting numerical\nfindings from trial reports. We find that massive LLMs that can accommodate\nlengthy inputs are tantalizingly close to realizing fully automatic\nmeta-analysis, especially for dichotomous (binary) outcomes (e.g., mortality).\nHowever, LLMs -- including ones trained on biomedical texts -- perform poorly\nwhen the outcome measures are complex and tallying the results requires\ninference. This work charts a path toward fully automatic meta-analysis of RCTs\nvia LLMs, while also highlighting the limitations of existing models for this\naim."},{"date":"2024-05","title":"Enhancing Language Models for Financial Relation Extraction with Named Entities and Part-of-Speech","author":"Menglin Li, and Kwan Hui Lim","link":"http://arxiv.org/abs/2405.06665v1","abstract":"The Financial Relation Extraction (FinRE) task involves identifying the\nentities and their relation, given a piece of financial statement/text. To\nsolve this FinRE problem, we propose a simple but effective strategy that\nimproves the performance of pre-trained language models by augmenting them with\nNamed Entity Recognition (NER) and Part-Of-Speech (POS), as well as different\napproaches to combine these information. Experiments on a financial relations\ndataset show promising results and highlights the benefits of incorporating NER\nand POS in existing models. Our dataset and codes are available at\nhttps://github.com/kwanhui/FinRelExtract."},{"date":"2024-04","title":"ECC Analyzer: Extract Trading Signal from Earnings Conference Calls using Large Language Model for Stock Performance Prediction","author":"Yupeng Cao, Zhi Chen, Qingyun Pei, Nathan Jinseok Lee, K. P. Subbalakshmi, and Papa Momar Ndiaye","link":"http://arxiv.org/abs/2404.18470v2","abstract":"In the realm of financial analytics, leveraging unstructured data, such as\nearnings conference calls (ECCs), to forecast stock volatility is a critical\nchallenge that has attracted both academics and investors. While previous\nstudies have used multimodal deep learning-based models to obtain a general\nview of ECCs for volatility predicting, they often fail to capture detailed,\ncomplex information. Our research introduces a novel framework: \\textbf{ECC\nAnalyzer}, which utilizes large language models (LLMs) to extract richer, more\npredictive content from ECCs to aid the model's prediction performance. We use\nthe pre-trained large models to extract textual and audio features from ECCs\nand implement a hierarchical information extraction strategy to extract more\nfine-grained information. This strategy first extracts paragraph-level general\ninformation by summarizing the text and then extracts fine-grained focus\nsentences using Retrieval-Augmented Generation (RAG). These features are then\nfused through multimodal feature fusion to perform volatility prediction.\nExperimental results demonstrate that our model outperforms traditional\nanalytical benchmarks, confirming the effectiveness of advanced LLM techniques\nin financial analysis."},{"date":"2024-04","title":"Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models","author":"Minhao Bai, Kaiyi Pang, and Yongfeng Huang","link":"http://arxiv.org/abs/2405.01509v1","abstract":"In the rapidly evolving domain of artificial intelligence, safeguarding the\nintellectual property of Large Language Models (LLMs) is increasingly crucial.\nCurrent watermarking techniques against model extraction attacks, which rely on\nsignal insertion in model logits or post-processing of generated text, remain\nlargely heuristic. We propose a novel method for embedding learnable linguistic\nwatermarks in LLMs, aimed at tracing and preventing model extraction attacks.\nOur approach subtly modifies the LLM's output distribution by introducing\ncontrolled noise into token frequency distributions, embedding an statistically\nidentifiable controllable watermark.We leverage statistical hypothesis testing\nand information theory, particularly focusing on Kullback-Leibler Divergence,\nto differentiate between original and modified distributions effectively. Our\nwatermarking method strikes a delicate well balance between robustness and\noutput quality, maintaining low false positive/negative rates and preserving\nthe LLM's original performance."},{"date":"2024-04","title":"Utilizing Large Language Models for Information Extraction from Real Estate Transactions","author":"Yu Zhao, and Haoxiang Gao","link":"http://arxiv.org/abs/2404.18043v1","abstract":"Real estate sales contracts contain crucial information for property\ntransactions, but manual extraction of data can be time-consuming and\nerror-prone. This paper explores the application of large language models,\nspecifically transformer-based architectures, for automated information\nextraction from real estate contracts. We discuss challenges, techniques, and\nfuture directions in leveraging these models to improve efficiency and accuracy\nin real estate contract analysis."},{"date":"2024-04","title":"Empirical Analysis of Dialogue Relation Extraction with Large Language Models","author":"Guozheng Li, Zijie Xu, Ziyu Shang, Jiajun Liu, Ke Ji, and Yikai Guo","link":"http://arxiv.org/abs/2404.17802v1","abstract":"Dialogue relation extraction (DRE) aims to extract relations between two\narguments within a dialogue, which is more challenging than standard RE due to\nthe higher person pronoun frequency and lower information density in dialogues.\nHowever, existing DRE methods still suffer from two serious issues: (1) hard to\ncapture long and sparse multi-turn information, and (2) struggle to extract\ngolden relations based on partial dialogues, which motivates us to discover\nmore effective methods that can alleviate the above issues. We notice that the\nrise of large language models (LLMs) has sparked considerable interest in\nevaluating their performance across diverse tasks. To this end, we initially\ninvestigate the capabilities of different LLMs in DRE, considering both\nproprietary models and open-source models. Interestingly, we discover that LLMs\nsignificantly alleviate two issues in existing DRE methods. Generally, we have\nfollowing findings: (1) scaling up model size substantially boosts the overall\nDRE performance and achieves exceptional results, tackling the difficulty of\ncapturing long and sparse multi-turn information; (2) LLMs encounter with much\nsmaller performance drop from entire dialogue setting to partial dialogue\nsetting compared to existing methods; (3) LLMs deliver competitive or superior\nperformances under both full-shot and few-shot settings compared to current\nstate-of-the-art; (4) LLMs show modest performances on inverse relations but\nmuch stronger improvements on general relations, and they can handle dialogues\nof various lengths especially for longer sequences."},{"date":"2024-04","title":"GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction","author":"Urchade Zaratiana, Nadi Tomeh, Niama El Khbir, Pierre Holat, and Thierry Charnois","link":"http://arxiv.org/abs/2404.12491v1","abstract":"Information extraction (IE) is an important task in Natural Language\nProcessing (NLP), involving the extraction of named entities and their\nrelationships from unstructured text. In this paper, we propose a novel\napproach to this task by formulating it as graph structure learning (GSL). By\nformulating IE as GSL, we enhance the model's ability to dynamically refine and\noptimize the graph structure during the extraction process. This formulation\nallows for better interaction and structure-informed decisions for entity and\nrelation prediction, in contrast to previous models that have separate or\nuntied predictions for these tasks. When compared against state-of-the-art\nbaselines on joint entity and relation extraction benchmarks, our model,\nGraphER, achieves competitive results."},{"date":"2024-04","title":"AI-Enhanced Cognitive Behavioral Therapy: Deep Learning and Large Language Models for Extracting Cognitive Pathways from Social Media Texts","author":"Meng Jiang, Yi Jing Yu, Qing Zhao, Jianqiang Li, Changwei Song, Hongzhi Qi, Wei Zhai, Dan Luo, Xiaoqin Wang, Guanghui Fu, and Bing Xiang Yang","link":"http://arxiv.org/abs/2404.11449v1","abstract":"Cognitive Behavioral Therapy (CBT) is an effective technique for addressing\nthe irrational thoughts stemming from mental illnesses, but it necessitates\nprecise identification of cognitive pathways to be successfully implemented in\npatient care. In current society, individuals frequently express negative\nemotions on social media on specific topics, often exhibiting cognitive\ndistortions, including suicidal behaviors in extreme cases. Yet, there is a\nnotable absence of methodologies for analyzing cognitive pathways that could\naid psychotherapists in conducting effective interventions online. In this\nstudy, we gathered data from social media and established the task of\nextracting cognitive pathways, annotating the data based on a cognitive\ntheoretical framework. We initially categorized the task of extracting\ncognitive pathways as a hierarchical text classification with four main\ncategories and nineteen subcategories. Following this, we structured a text\nsummarization task to help psychotherapists quickly grasp the essential\ninformation. Our experiments evaluate the performance of deep learning and\nlarge language models (LLMs) on these tasks. The results demonstrate that our\ndeep learning method achieved a micro-F1 score of 62.34% in the hierarchical\ntext classification task. Meanwhile, in the text summarization task, GPT-4\nattained a Rouge-1 score of 54.92 and a Rouge-2 score of 30.86, surpassing the\nexperimental deep learning model's performance. However, it may suffer from an\nissue of hallucination. We have made all models and codes publicly available to\nsupport further research in this field."},{"date":"2024-04","title":"TransLinkGuard: Safeguarding Transformer Models Against Model Stealing in Edge Deployment","author":"Qinfeng Li, Zhiqiang Shen, Zhenghan Qin, Yangfan Xie, Xuhong Zhang, Tianyu Du, and Jianwei Yin","link":"http://arxiv.org/abs/2404.11121v1","abstract":"Proprietary large language models (LLMs) have been widely applied in various\nscenarios. Additionally, deploying LLMs on edge devices is trending for\nefficiency and privacy reasons. However, edge deployment of proprietary LLMs\nintroduces new security challenges: edge-deployed models are exposed as\nwhite-box accessible to users, enabling adversaries to conduct effective model\nstealing (MS) attacks. Unfortunately, existing defense mechanisms fail to\nprovide effective protection. Specifically, we identify four critical\nprotection properties that existing methods fail to simultaneously satisfy: (1)\nmaintaining protection after a model is physically copied; (2) authorizing\nmodel access at request level; (3) safeguarding runtime reverse engineering;\n(4) achieving high security with negligible runtime overhead. To address the\nabove issues, we propose TransLinkGuard, a plug-and-play model protection\napproach against model stealing on edge devices. The core part of\nTransLinkGuard is a lightweight authorization module residing in a secure\nenvironment, e.g., TEE. The authorization module can freshly authorize each\nrequest based on its input. Extensive experiments show that TransLinkGuard\nachieves the same security protection as the black-box security guarantees with\nnegligible overhead."},{"date":"2024-04","title":"A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents","author":"Wiam Adnan, Joel Tang, Yassine Bel Khayat Zouggari, Seif Edinne Laatiri, Laurent Lam, and Fabien Caspani","link":"http://arxiv.org/abs/2404.10848v1","abstract":"Document Understanding is an evolving field in Natural Language Processing\n(NLP). In particular, visual and spatial features are essential in addition to\nthe raw text itself and hence, several multimodal models were developed in the\nfield of Visual Document Understanding (VDU). However, while research is mainly\nfocused on Key Information Extraction (KIE), Relation Extraction (RE) between\nidentified entities is still under-studied. For instance, RE is crucial to\nregroup entities or obtain a comprehensive hierarchy of data in a document. In\nthis paper, we present a model that, initialized from LayoutLMv3, can match or\noutperform the current state-of-the-art results in RE applied to Visually-Rich\nDocuments (VRD) on FUNSD and CORD datasets, without any specific pre-training\nand with fewer parameters. We also report an extensive ablation study performed\non FUNSD, highlighting the great impact of certain features and modelization\nchoices on the performances."},{"date":"2024-04","title":"Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations","author":"Yiming Li, Xueqing Peng, Jianfu Li, Xu Zuo, Suyuan Peng, Donghong Pei, Cui Tao, Hua Xu, and Na Hong","link":"http://arxiv.org/abs/2404.05415v2","abstract":"In acupuncture therapy, the accurate location of acupoints is essential for\nits effectiveness. The advanced language understanding capabilities of large\nlanguage models (LLMs) like Generative Pre-trained Transformers (GPT) present a\nsignificant opportunity for extracting relations related to acupoint locations\nfrom textual knowledge sources. This study aims to compare the performance of\nGPT with traditional deep learning models (Long Short-Term Memory (LSTM) and\nBidirectional Encoder Representations from Transformers for Biomedical Text\nMining (BioBERT)) in extracting acupoint-related location relations and assess\nthe impact of pretraining and fine-tuning on GPT's performance. We utilized the\nWorld Health Organization Standard Acupuncture Point Locations in the Western\nPacific Region (WHO Standard) as our corpus, which consists of descriptions of\n361 acupoints. Five types of relations ('direction_of,' 'distance_of,'\n'part_of,' 'near_acupoint,' and 'located_near') (n= 3,174) between acupoints\nwere annotated. Five models were compared: BioBERT, LSTM, pre-trained GPT-3.5,\nfine-tuned GPT-3.5, as well as pre-trained GPT-4. Performance metrics included\nmicro-average exact match precision, recall, and F1 scores. Our results\ndemonstrate that fine-tuned GPT-3.5 consistently outperformed other models in\nF1 scores across all relation types. Overall, it achieved the highest\nmicro-average F1 score of 0.92. This study underscores the effectiveness of\nLLMs like GPT in extracting relations related to acupoint locations, with\nimplications for accurately modeling acupuncture knowledge and promoting\nstandard implementation in acupuncture training and practice. The findings also\ncontribute to advancing informatics applications in traditional and\ncomplementary medicine, showcasing the potential of LLMs in natural language\nprocessing."},{"date":"2024-04","title":"PerkwE_COQA: Enhanced Persian Conversational Question Answering by combining contextual keyword extraction with Large Language Models","author":"Pardis Moradbeiki, and Nasser Ghadiri","link":"http://arxiv.org/abs/2404.05406v2","abstract":"Smart cities need the involvement of their residents to enhance quality of\nlife. Conversational query-answering is an emerging approach for user\nengagement. There is an increasing demand of an advanced conversational\nquestion-answering that goes beyond classic systems. Existing approaches have\nshown that LLMs offer promising capabilities for CQA, but may struggle to\ncapture the nuances of conversational contexts. The new approach involves\nunderstanding the content and engaging in a multi-step conversation with the\nuser to fulfill their needs. This paper presents a novel method to elevate the\nperformance of Persian Conversational question-answering (CQA) systems. It\ncombines the strengths of Large Language Models (LLMs) with contextual keyword\nextraction. Our method extracts keywords specific to the conversational flow,\nproviding the LLM with additional context to understand the user's intent and\ngenerate more relevant and coherent responses. We evaluated the effectiveness\nof this combined approach through various metrics, demonstrating significant\nimprovements in CQA performance compared to an LLM-only baseline. The proposed\nmethod effectively handles implicit questions, delivers contextually relevant\nanswers, and tackles complex questions that rely heavily on conversational\ncontext. The findings indicate that our method outperformed the evaluation\nbenchmarks up to 8% higher than existing methods and the LLM-only baseline."},{"date":"2024-04","title":"GLCM-Based Feature Combination for Extraction Model Optimization in Object Detection Using Machine Learning","author":"Florentina Tatrin Kurniati, Daniel HF Manongga, Eko Sediyono, Sri Yulianto Joko Prasetyo, and Roy Rudolf Huizen","link":"http://arxiv.org/abs/2404.04578v1","abstract":"In the era of modern technology, object detection using the Gray Level\nCo-occurrence Matrix (GLCM) extraction method plays a crucial role in object\nrecognition processes. It finds applications in real-time scenarios such as\nsecurity surveillance and autonomous vehicle navigation, among others.\nComputational efficiency becomes a critical factor in achieving real-time\nobject detection. Hence, there is a need for a detection model with low\ncomplexity and satisfactory accuracy. This research aims to enhance\ncomputational efficiency by selecting appropriate features within the GLCM\nframework. Two classification models, namely K-Nearest Neighbours (K-NN) and\nSupport Vector Machine (SVM), were employed, with the results indicating that\nK-Nearest Neighbours (K-NN) outperforms SVM in terms of computational\ncomplexity. Specifically, K-NN, when utilizing a combination of Correlation,\nEnergy, and Homogeneity features, achieves a 100% accuracy rate with low\ncomplexity. Moreover, when using a combination of Energy and Homogeneity\nfeatures, K-NN attains an almost perfect accuracy level of 99.9889%, while\nmaintaining low complexity. On the other hand, despite SVM achieving 100%\naccuracy in certain feature combinations, its high or very high complexity can\npose challenges, particularly in real-time applications. Therefore, based on\nthe trade-off between accuracy and complexity, the K-NN model with a\ncombination of Correlation, Energy, and Homogeneity features emerges as a more\nsuitable choice for real-time applications that demand high accuracy and low\ncomplexity. This research provides valuable insights for optimizing object\ndetection in various applications requiring both high accuracy and rapid\nresponsiveness."},{"date":"2024-04","title":"Knowledge Distillation-Based Model Extraction Attack using GAN-based Private Counterfactual Explanations","author":"Fatima Ezzeddine, Omran Ayoub, and Silvia Giordano","link":"http://arxiv.org/abs/2404.03348v2","abstract":"In recent years, there has been a notable increase in the deployment of\nmachine learning (ML) models as services (MLaaS) across diverse production\nsoftware applications. In parallel, explainable AI (XAI) continues to evolve,\naddressing the necessity for transparency and trustworthiness in ML models. XAI\ntechniques aim to enhance the transparency of ML models by providing insights,\nin terms of model's explanations, into their decision-making process.\nSimultaneously, some MLaaS platforms now offer explanations alongside the ML\nprediction outputs. This setup has elevated concerns regarding vulnerabilities\nin MLaaS, particularly in relation to privacy leakage attacks such as model\nextraction attacks (MEA). This is due to the fact that explanations can unveil\ninsights about the inner workings of the model which could be exploited by\nmalicious users. In this work, we focus on investigating how model\nexplanations, particularly counterfactual explanations (CFs), can be exploited\nfor performing MEA within the MLaaS platform. We also delve into assessing the\neffectiveness of incorporating differential privacy (DP) as a mitigation\nstrategy. To this end, we first propose a novel approach for MEA based on\nKnowledge Distillation (KD) to enhance the efficiency of extracting a\nsubstitute model of a target model exploiting CFs, without any knowledge about\nthe training data distribution by the attacker. Then, we advise an approach for\ntraining CF generators incorporating DP to generate private CFs. We conduct\nthorough experimental evaluations on real-world datasets and demonstrate that\nour proposed KD-based MEA can yield a high-fidelity substitute model with a\nreduced number of queries with respect to baseline approaches. Furthermore, our\nfindings reveal that including a privacy layer can allow mitigating the MEA.\nHowever, on the account of the quality of CFs, impacts the performance of the\nexplanations."},{"date":"2024-04","title":"Comparative Study of Domain Driven Terms Extraction Using Large Language Models","author":"Sandeep Chataut, Tuyen Do, Bichar Dip Shrestha Gurung, Shiva Aryal, Anup Khanal, Carol Lushbough, and Etienne Gnimpieba","link":"http://arxiv.org/abs/2404.02330v1","abstract":"Keywords play a crucial role in bridging the gap between human understanding\nand machine processing of textual data. They are essential to data enrichment\nbecause they form the basis for detailed annotations that provide a more\ninsightful and in-depth view of the underlying data. Keyword/domain driven term\nextraction is a pivotal task in natural language processing, facilitating\ninformation retrieval, document summarization, and content categorization. This\nreview focuses on keyword extraction methods, emphasizing the use of three\nmajor Large Language Models(LLMs): Llama2-7B, GPT-3.5, and Falcon-7B. We\nemployed a custom Python package to interface with these LLMs, simplifying\nkeyword extraction. Our study, utilizing the Inspec and PubMed datasets,\nevaluates the performance of these models. The Jaccard similarity index was\nused for assessment, yielding scores of 0.64 (Inspec) and 0.21 (PubMed) for\nGPT-3.5, 0.40 and 0.17 for Llama2-7B, and 0.23 and 0.12 for Falcon-7B. This\npaper underlines the role of prompt engineering in LLMs for better keyword\nextraction and discusses the impact of hallucination in LLMs on result\nevaluation. It also sheds light on the challenges in using LLMs for keyword\nextraction, including model complexity, resource demands, and optimization\ntechniques."},{"date":"2024-04","title":"Towards System Modelling to Support Diseases Data Extraction from the Electronic Health Records for Physicians Research Activities","author":"Bushra F. Alsaqer, Alaa F. Alsaqer, and Amna Asif","link":"http://arxiv.org/abs/2404.01218v1","abstract":"The use of Electronic Health Records (EHRs) has increased dramatically in the\npast 15 years, as, it is considered an important source of managing data od\npatients. The EHRs are primary sources of disease diagnosis and demographic\ndata of patients worldwide. Therefore, the data can be utilized for secondary\ntasks such as research. This paper aims to make such data usable for research\nactivities such as monitoring disease statistics for a specific population. As\na result, the researchers can detect the disease causes for the behavior and\nlifestyle of the target group. One of the limitations of EHRs systems is that\nthe data is not available in the standard format but in various forms.\nTherefore, it is required to first convert the names of the diseases and\ndemographics data into one standardized form to make it usable for research\nactivities. There is a large amount of EHRs available, and solving the\nstandardizing issues requires some optimized techniques. We used a first-hand\nEHR dataset extracted from EHR systems. Our application uploads the dataset\nfrom the EHRs and converts it to the ICD-10 coding system to solve the\nstandardization problem. So, we first apply the steps of pre-processing,\nannotation, and transforming the data to convert it into the standard form. The\ndata pre-processing is applied to normalize demographic formats. In the\nannotation step, a machine learning model is used to recognize the diseases\nfrom the text. Furthermore, the transforming step converts the disease name to\nthe ICD-10 coding format. The model was evaluated manually by comparing its\nperformance in terms of disease recognition with an available dictionary-based\nsystem (MetaMap). The accuracy of the proposed machine learning model is 81%,\nthat outperformed MetaMap accuracy of 67%. This paper contributed to system\nmodelling for EHR data extraction to support research activities."},{"date":"2024-03","title":"MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models","author":"Zebang Cheng, Fuqiang Niu, Yuxiang Lin, Zhi-Qi Cheng, Bowen Zhang, and Xiaojiang Peng","link":"http://arxiv.org/abs/2404.00511v3","abstract":"This paper presents our winning submission to Subtask 2 of SemEval 2024 Task\n3 on multimodal emotion cause analysis in conversations. We propose a novel\nMultimodal Emotion Recognition and Multimodal Emotion Cause Extraction\n(MER-MCE) framework that integrates text, audio, and visual modalities using\nspecialized emotion encoders. Our approach sets itself apart from\ntop-performing teams by leveraging modality-specific features for enhanced\nemotion understanding and causality inference. Experimental evaluation\ndemonstrates the advantages of our multimodal approach, with our submission\nachieving a competitive weighted F1 score of 0.3435, ranking third with a\nmargin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team.\nProject: https://github.com/MIPS-COLT/MER-MCE.git"},{"date":"2024-03","title":"Privacy Backdoors: Stealing Data with Corrupted Pretrained Models","author":"Shanglun Feng, and Florian Tram\u00e8r","link":"http://arxiv.org/abs/2404.00473v1","abstract":"Practitioners commonly download pretrained machine learning models from open\nrepositories and finetune them to fit specific applications. We show that this\npractice introduces a new risk of privacy backdoors. By tampering with a\npretrained model's weights, an attacker can fully compromise the privacy of the\nfinetuning data. We show how to build privacy backdoors for a variety of\nmodels, including transformers, which enable an attacker to reconstruct\nindividual finetuning samples, with a guaranteed success! We further show that\nbackdoored models allow for tight privacy attacks on models trained with\ndifferential privacy (DP). The common optimistic practice of training DP models\nwith loose privacy guarantees is thus insecure if the model is not trusted.\nOverall, our work highlights a crucial and overlooked supply chain attack on\nmachine learning privacy."},{"date":"2024-03","title":"Efficient Data-Free Model Stealing with Label Diversity","author":"Yiyong Liu, Rui Wen, Michael Backes, and Yang Zhang","link":"http://arxiv.org/abs/2404.00108v1","abstract":"Machine learning as a Service (MLaaS) allows users to query the machine\nlearning model in an API manner, which provides an opportunity for users to\nenjoy the benefits brought by the high-performance model trained on valuable\ndata. This interface boosts the proliferation of machine learning based\napplications, while on the other hand, it introduces the attack surface for\nmodel stealing attacks. Existing model stealing attacks have relaxed their\nattack assumptions to the data-free setting, while keeping the effectiveness.\nHowever, these methods are complex and consist of several components, which\nobscure the core on which the attack really depends. In this paper, we revisit\nthe model stealing problem from a diversity perspective and demonstrate that\nkeeping the generated data samples more diverse across all the classes is the\ncritical point for improving the attack performance. Based on this conjecture,\nwe provide a simplified attack framework. We empirically signify our conjecture\nby evaluating the effectiveness of our attack, and experimental results show\nthat our approach is able to achieve comparable or even better performance\ncompared with the state-of-the-art method. Furthermore, benefiting from the\nabsence of redundant components, our method demonstrates its advantages in\nattack efficiency and query budget."},{"date":"2024-03","title":"Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models","author":"Hyunbyung Park, Sukyung Lee, Gyoungjin Gim, Yungi Kim, Dahyun Kim, and Chanjun Park","link":"http://arxiv.org/abs/2403.19340v1","abstract":"To address the challenges associated with data processing at scale, we\npropose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline\nfor large language models (LLMs) with a user-friendly design at its core. Easy\naddition of custom processors with block-based interface in Dataverse allows\nusers to readily and efficiently use Dataverse to build their own ETL pipeline.\nWe hope that Dataverse will serve as a vital tool for LLM development and open\nsource the entire library to welcome community contribution. Additionally, we\nprovide a concise, two-minute video demonstration of our system, illustrating\nits capabilities and implementation."},{"date":"2024-03","title":"MisGUIDE : Defense Against Data-Free Deep Learning Model Extraction","author":"Mahendra Gurve, Sankar Behera, Satyadev Ahlawat, and Yamuna Prasad","link":"http://arxiv.org/abs/2403.18580v1","abstract":"The rise of Machine Learning as a Service (MLaaS) has led to the widespread\ndeployment of machine learning models trained on diverse datasets. These models\nare employed for predictive services through APIs, raising concerns about the\nsecurity and confidentiality of the models due to emerging vulnerabilities in\nprediction APIs. Of particular concern are model cloning attacks, where\nindividuals with limited data and no knowledge of the training dataset manage\nto replicate a victim model's functionality through black-box query access.\nThis commonly entails generating adversarial queries to query the victim model,\nthereby creating a labeled dataset.\n This paper proposes \"MisGUIDE\", a two-step defense framework for Deep\nLearning models that disrupts the adversarial sample generation process by\nproviding a probabilistic response when the query is deemed OOD. The first step\nemploys a Vision Transformer-based framework to identify OOD queries, while the\nsecond step perturbs the response for such queries, introducing a probabilistic\nloss function to MisGUIDE the attackers. The aim of the proposed defense method\nis to reduce the accuracy of the cloned model while maintaining accuracy on\nauthentic queries. Extensive experiments conducted on two benchmark datasets\ndemonstrate that the proposed framework significantly enhances the resistance\nagainst state-of-the-art data-free model extraction in black-box settings."},{"date":"2024-03","title":"A Path Towards Legal Autonomy: An interoperable and explainable approach to extracting, transforming, loading and computing legal information using large language models, expert systems and Bayesian networks","author":"Axel Constant, Hannes Westermann, Bryan Wilson, Alex Kiefer, Ines Hipolito, Sylvain Pronovost, Steven Swanson, Mahault Albarracin, and Maxwell J. D. Ramstead","link":"http://arxiv.org/abs/2403.18537v1","abstract":"Legal autonomy - the lawful activity of artificial intelligence agents - can\nbe achieved in one of two ways. It can be achieved either by imposing\nconstraints on AI actors such as developers, deployers and users, and on AI\nresources such as data, or by imposing constraints on the range and scope of\nthe impact that AI agents can have on the environment. The latter approach\ninvolves encoding extant rules concerning AI driven devices into the software\nof AI agents controlling those devices (e.g., encoding rules about limitations\non zones of operations into the agent software of an autonomous drone device).\nThis is a challenge since the effectivity of such an approach requires a method\nof extracting, loading, transforming and computing legal information that would\nbe both explainable and legally interoperable, and that would enable AI agents\nto reason about the law. In this paper, we sketch a proof of principle for such\na method using large language models (LLMs), expert legal systems known as\nlegal decision paths, and Bayesian networks. We then show how the proposed\nmethod could be applied to extant regulation in matters of autonomous cars,\nsuch as the California Vehicle Code."},{"date":"2024-03","title":"Segment Anything Model for Road Network Graph Extraction","author":"Congrui Hetang, Haoru Xue, Cindy Le, Tianwei Yue, Wenping Wang, and Yihui He","link":"http://arxiv.org/abs/2403.16051v3","abstract":"We propose SAM-Road, an adaptation of the Segment Anything Model (SAM) for\nextracting large-scale, vectorized road network graphs from satellite imagery.\nTo predict graph geometry, we formulate it as a dense semantic segmentation\ntask, leveraging the inherent strengths of SAM. The image encoder of SAM is\nfine-tuned to produce probability masks for roads and intersections, from which\nthe graph vertices are extracted via simple non-maximum suppression. To predict\ngraph topology, we designed a lightweight transformer-based graph neural\nnetwork, which leverages the SAM image embeddings to estimate the edge\nexistence probabilities between vertices. Our approach directly predicts the\ngraph vertices and edges for large regions without expensive and complex\npost-processing heuristics, and is capable of building complete road network\ngraphs spanning multiple square kilometers in a matter of seconds. With its\nsimple, straightforward, and minimalist design, SAM-Road achieves comparable\naccuracy with the state-of-the-art method RNGDet++, while being 40 times faster\non the City-scale dataset. We thus demonstrate the power of a foundational\nvision model when applied to a graph learning task. The code is available at\nhttps://github.com/htcr/sam_road."},{"date":"2024-03","title":"AutoRE: Document-Level Relation Extraction with Large Language Models","author":"Lilong Xue, Dan Zhang, Yuxiao Dong, and Jie Tang","link":"http://arxiv.org/abs/2403.14888v3","abstract":"Large Language Models (LLMs) have demonstrated exceptional abilities in\ncomprehending and generating text, motivating numerous researchers to utilize\nthem for Information Extraction (IE) purposes, including Relation Extraction\n(RE). Nonetheless, most existing methods are predominantly designed for\nSentence-level Relation Extraction (SentRE) tasks, which typically encompass a\nrestricted set of relations and triplet facts within a single sentence.\nFurthermore, certain approaches resort to treating relations as candidate\nchoices integrated into prompt templates, leading to inefficient processing and\nsuboptimal performance when tackling Document-Level Relation Extraction (DocRE)\ntasks, which entail handling multiple relations and triplet facts distributed\nacross a given document, posing distinct challenges. To overcome these\nlimitations, we introduce AutoRE, an end-to-end DocRE model that adopts a novel\nRE extraction paradigm named RHF (Relation-Head-Facts). Unlike existing\napproaches, AutoRE does not rely on the assumption of known relation options,\nmaking it more reflective of real-world scenarios. Additionally, we have\ndeveloped an easily extensible RE framework using a Parameters Efficient Fine\nTuning (PEFT) algorithm (QLoRA). Our experiments on the RE-DocRED dataset\nshowcase AutoRE's best performance, achieving state-of-the-art results,\nsurpassing TAG by 10.03\\% and 9.03\\% respectively on the dev and test set. The\ncode is available at https://github.com/THUDM/AutoRE and the demonstration\nvideo is provided at https://www.youtube.com/watch?v=IhKRsZUAxKk."},{"date":"2024-03","title":"Style-Extracting Diffusion Models for Semi-Supervised Histopathology Segmentation","author":"Mathias \u00d6ttl, Frauke Wilm, Jana Steenpass, Jingna Qiu, Matthias R\u00fcbner, Arndt Hartmann, Matthias Beckmann, Peter Fasching, Andreas Maier, Ramona Erber, Bernhard Kainz, and Katharina Breininger","link":"http://arxiv.org/abs/2403.14429v1","abstract":"Deep learning-based image generation has seen significant advancements with\ndiffusion models, notably improving the quality of generated images. Despite\nthese developments, generating images with unseen characteristics beneficial\nfor downstream tasks has received limited attention. To bridge this gap, we\npropose Style-Extracting Diffusion Models, featuring two conditioning\nmechanisms. Specifically, we utilize 1) a style conditioning mechanism which\nallows to inject style information of previously unseen images during image\ngeneration and 2) a content conditioning which can be targeted to a downstream\ntask, e.g., layout for segmentation. We introduce a trainable style encoder to\nextract style information from images, and an aggregation block that merges\nstyle information from multiple style inputs. This architecture enables the\ngeneration of images with unseen styles in a zero-shot manner, by leveraging\nstyles from unseen images, resulting in more diverse generations. In this work,\nwe use the image layout as target condition and first show the capability of\nour method on a natural image dataset as a proof-of-concept. We further\ndemonstrate its versatility in histopathology, where we combine prior knowledge\nabout tissue composition and unannotated data to create diverse synthetic\nimages with known layouts. This allows us to generate additional synthetic data\nto train a segmentation network in a semi-supervised fashion. We verify the\nadded value of the generated images by showing improved segmentation results\nand lower performance variability between patients when synthetic images are\nincluded during segmentation training. Our code will be made publicly available\nat [LINK]."},{"date":"2024-03","title":"Clinical information extraction for Low-resource languages with Few-shot learning using Pre-trained language models and Prompting","author":"Phillip Richter-Pechanski, Philipp Wiesenbach, Dominic M. Schwab, Christina Kiriakou, Nicolas Geis, Christoph Dieterich, and Anette Frank","link":"http://arxiv.org/abs/2403.13369v2","abstract":"Automatic extraction of medical information from clinical documents poses\nseveral challenges: high costs of required clinical expertise, limited\ninterpretability of model predictions, restricted computational resources and\nprivacy regulations. Recent advances in domain-adaptation and prompting methods\nshowed promising results with minimal training data using lightweight masked\nlanguage models, which are suited for well-established interpretability\nmethods. We are first to present a systematic evaluation of these methods in a\nlow-resource setting, by performing multi-class section classification on\nGerman doctor's letters. We conduct extensive class-wise evaluations supported\nby Shapley values, to validate the quality of our small training data set and\nto ensure the interpretability of model predictions. We demonstrate that a\nlightweight, domain-adapted pretrained model, prompted with just 20 shots,\noutperforms a traditional classification model by 30.5% accuracy. Our results\nserve as a process-oriented guideline for clinical information extraction\nprojects working with low-resource."},{"date":"2024-03","title":"Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models","author":"Joana Ribeiro de Faria, Huiyuan Xie, and Felix Steffek","link":"http://arxiv.org/abs/2403.12936v1","abstract":"Court transcripts and judgments are rich repositories of legal knowledge,\ndetailing the intricacies of cases and the rationale behind judicial decisions.\nThe extraction of key information from these documents provides a concise\noverview of a case, crucial for both legal experts and the public. With the\nadvent of large language models (LLMs), automatic information extraction has\nbecome increasingly feasible and efficient. This paper presents a comprehensive\nstudy on the application of GPT-4, a large language model, for automatic\ninformation extraction from UK Employment Tribunal (UKET) cases. We\nmeticulously evaluated GPT-4's performance in extracting critical information\nwith a manual verification process to ensure the accuracy and relevance of the\nextracted data. Our research is structured around two primary extraction tasks:\nthe first involves a general extraction of eight key aspects that hold\nsignificance for both legal specialists and the general public, including the\nfacts of the case, the claims made, references to legal statutes, references to\nprecedents, general case outcomes and corresponding labels, detailed order and\nremedies and reasons for the decision. The second task is more focused, aimed\nat analysing three of those extracted features, namely facts, claims and\noutcomes, in order to facilitate the development of a tool capable of\npredicting the outcome of employment law disputes. Through our analysis, we\ndemonstrate that LLMs like GPT-4 can obtain high accuracy in legal information\nextraction, highlighting the potential of LLMs in revolutionising the way legal\ninformation is processed and utilised, offering significant implications for\nlegal research and practice."},{"date":"2024-03","title":"Towards Interpretable Hate Speech Detection using Large Language Model-extracted Rationales","author":"Ayushi Nirmal, Amrita Bhattacharjee, Paras Sheth, and Huan Liu","link":"http://arxiv.org/abs/2403.12403v2","abstract":"Although social media platforms are a prominent arena for users to engage in\ninterpersonal discussions and express opinions, the facade and anonymity\noffered by social media may allow users to spew hate speech and offensive\ncontent. Given the massive scale of such platforms, there arises a need to\nautomatically identify and flag instances of hate speech. Although several hate\nspeech detection methods exist, most of these black-box methods are not\ninterpretable or explainable by design. To address the lack of\ninterpretability, in this paper, we propose to use state-of-the-art Large\nLanguage Models (LLMs) to extract features in the form of rationales from the\ninput text, to train a base hate speech classifier, thereby enabling faithful\ninterpretability by design. Our framework effectively combines the textual\nunderstanding capabilities of LLMs and the discriminative power of\nstate-of-the-art hate speech classifiers to make these classifiers faithfully\ninterpretable. Our comprehensive evaluation on a variety of English language\nsocial media hate speech datasets demonstrate: (1) the goodness of the\nLLM-extracted rationales, and (2) the surprising retention of detector\nperformance even after training to ensure interpretability. All code and data\nwill be made available at https://github.com/AmritaBh/shield."},{"date":"2024-03","title":"Leveraging Large Language Models to Extract Information on Substance Use Disorder Severity from Clinical Notes: A Zero-shot Learning Approach","author":"Maria Mahbub, Gregory M. Dams, Sudarshan Srinivasan, Caitlin Rizy, Ioana Danciu, Jodie Trafton, and Kathryn Knight","link":"http://arxiv.org/abs/2403.12297v1","abstract":"Substance use disorder (SUD) poses a major concern due to its detrimental\neffects on health and society. SUD identification and treatment depend on a\nvariety of factors such as severity, co-determinants (e.g., withdrawal\nsymptoms), and social determinants of health. Existing diagnostic coding\nsystems used by American insurance providers, like the International\nClassification of Diseases (ICD-10), lack granularity for certain diagnoses,\nbut clinicians will add this granularity (as that found within the Diagnostic\nand Statistical Manual of Mental Disorders classification or DSM-5) as\nsupplemental unstructured text in clinical notes. Traditional natural language\nprocessing (NLP) methods face limitations in accurately parsing such diverse\nclinical language. Large Language Models (LLMs) offer promise in overcoming\nthese challenges by adapting to diverse language patterns. This study\ninvestigates the application of LLMs for extracting severity-related\ninformation for various SUD diagnoses from clinical notes. We propose a\nworkflow employing zero-shot learning of LLMs with carefully crafted prompts\nand post-processing techniques. Through experimentation with Flan-T5, an\nopen-source LLM, we demonstrate its superior recall compared to the rule-based\napproach. Focusing on 11 categories of SUD diagnoses, we show the effectiveness\nof LLMs in extracting severity information, contributing to improved risk\nassessment and treatment planning for SUD patients."},{"date":"2024-03","title":"Towards Model Extraction Attacks in GAN-Based Image Translation via Domain Shift Mitigation","author":"Di Mi, Yanjun Zhang, Leo Yu Zhang, Shengshan Hu, Qi Zhong, Haizhuan Yuan, and Shirui Pan","link":"http://arxiv.org/abs/2403.07673v3","abstract":"Model extraction attacks (MEAs) enable an attacker to replicate the\nfunctionality of a victim deep neural network (DNN) model by only querying its\nAPI service remotely, posing a severe threat to the security and integrity of\npay-per-query DNN-based services. Although the majority of current research on\nMEAs has primarily concentrated on neural classifiers, there is a growing\nprevalence of image-to-image translation (I2IT) tasks in our everyday\nactivities. However, techniques developed for MEA of DNN classifiers cannot be\ndirectly transferred to the case of I2IT, rendering the vulnerability of I2IT\nmodels to MEA attacks often underestimated. This paper unveils the threat of\nMEA in I2IT tasks from a new perspective. Diverging from the traditional\napproach of bridging the distribution gap between attacker queries and victim\ntraining samples, we opt to mitigate the effect caused by the different\ndistributions, known as the domain shift. This is achieved by introducing a new\nregularization term that penalizes high-frequency noise, and seeking a flatter\nminimum to avoid overfitting to the shifted distribution. Extensive experiments\non different image translation tasks, including image super-resolution and\nstyle transfer, are performed on different backbone victim models, and the new\ndesign consistently outperforms the baseline by a large margin across all\nmetrics. A few real-life I2IT APIs are also verified to be extremely vulnerable\nto our attack, emphasizing the need for enhanced defenses and potentially\nrevised API publishing policies."},{"date":"2024-03","title":"RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model","author":"Mingze Wang, Lili Su, Cilin Yan, Sheng Xu, Pengcheng Yuan, Xiaolong Jiang, and Baochang Zhang","link":"http://arxiv.org/abs/2403.07564v2","abstract":"The intelligent interpretation of buildings plays a significant role in urban\nplanning and management, macroeconomic analysis, population dynamics, etc.\nRemote sensing image building interpretation primarily encompasses building\nextraction and change detection. However, current methodologies often treat\nthese two tasks as separate entities, thereby failing to leverage shared\nknowledge. Moreover, the complexity and diversity of remote sensing image\nscenes pose additional challenges, as most algorithms are designed to model\nindividual small datasets, thus lacking cross-scene generalization. In this\npaper, we propose a comprehensive remote sensing image building understanding\nmodel, termed RSBuilding, developed from the perspective of the foundation\nmodel. RSBuilding is designed to enhance cross-scene generalization and task\nuniversality. Specifically, we extract image features based on the prior\nknowledge of the foundation model and devise a multi-level feature sampler to\naugment scale information. To unify task representation and integrate image\nspatiotemporal clues, we introduce a cross-attention decoder with task prompts.\nAddressing the current shortage of datasets that incorporate annotations for\nboth tasks, we have developed a federated training strategy to facilitate\nsmooth model convergence even when supervision for some tasks is missing,\nthereby bolstering the complementarity of different tasks. Our model was\ntrained on a dataset comprising up to 245,000 images and validated on multiple\nbuilding extraction and change detection datasets. The experimental results\nsubstantiate that RSBuilding can concurrently handle two structurally distinct\ntasks and exhibits robust zero-shot generalization capabilities."},{"date":"2024-03","title":"A Semantic Mention Graph Augmented Model for Document-Level Event Argument Extraction","author":"Jian Zhang, Changlin Yang, Haiping Zhu, Qika Lin, Fangzhi Xu, and Jun Liu","link":"http://arxiv.org/abs/2403.09721v1","abstract":"Document-level Event Argument Extraction (DEAE) aims to identify arguments\nand their specific roles from an unstructured document. The advanced approaches\non DEAE utilize prompt-based methods to guide pre-trained language models\n(PLMs) in extracting arguments from input documents. They mainly concentrate on\nestablishing relations between triggers and entity mentions within documents,\nleaving two unresolved problems: a) independent modeling of entity mentions; b)\ndocument-prompt isolation. To this end, we propose a semantic mention Graph\nAugmented Model (GAM) to address these two problems in this paper. Firstly, GAM\nconstructs a semantic mention graph that captures relations within and between\ndocuments and prompts, encompassing co-existence, co-reference and co-type\nrelations. Furthermore, we introduce an ensembled graph transformer module to\naddress mentions and their three semantic relations effectively. Later, the\ngraph-augmented encoder-decoder module incorporates the relation-specific graph\ninto the input embedding of PLMs and optimizes the encoder section with\ntopology information, enhancing the relations comprehensively. Extensive\nexperiments on the RAMS and WikiEvents datasets demonstrate the effectiveness\nof our approach, surpassing baseline methods and achieving a new\nstate-of-the-art performance."},{"date":"2024-03","title":"Stealing Part of a Production Language Model","author":"Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A. Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, Itay Yona, Eric Wallace, David Rolnick, and Florian Tram\u00e8r","link":"http://arxiv.org/abs/2403.06634v2","abstract":"We introduce the first model-stealing attack that extracts precise,\nnontrivial information from black-box production language models like OpenAI's\nChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding\nprojection layer (up to symmetries) of a transformer model, given typical API\naccess. For under \\$20 USD, our attack extracts the entire projection matrix of\nOpenAI's Ada and Babbage language models. We thereby confirm, for the first\ntime, that these black-box models have a hidden dimension of 1024 and 2048,\nrespectively. We also recover the exact hidden dimension size of the\ngpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to\nrecover the entire projection matrix. We conclude with potential defenses and\nmitigations, and discuss the implications of possible future work that could\nextend our attack."},{"date":"2024-03","title":"Adversarial Sparse Teacher: Defense Against Distillation-Based Model Stealing Attacks Using Adversarial Examples","author":"Eda Yilmaz, and Hacer Yalim Keles","link":"http://arxiv.org/abs/2403.05181v2","abstract":"We introduce Adversarial Sparse Teacher (AST), a robust defense method\nagainst distillation-based model stealing attacks. Our approach trains a\nteacher model using adversarial examples to produce sparse logit responses and\nincrease the entropy of the output distribution. Typically, a model generates a\npeak in its output corresponding to its prediction. By leveraging adversarial\nexamples, AST modifies the teacher model's original response, embedding a few\naltered logits into the output while keeping the primary response slightly\nhigher. Concurrently, all remaining logits are elevated to further increase the\noutput distribution's entropy. All these complex manipulations are performed\nusing an optimization function with our proposed Exponential Predictive\nDivergence (EPD) loss function. EPD allows us to maintain higher entropy levels\ncompared to traditional KL divergence, effectively confusing attackers.\nExperiments on CIFAR-10 and CIFAR-100 datasets demonstrate that AST outperforms\nstate-of-the-art methods, providing effective defense against model stealing\nwhile preserving high accuracy. The source codes will be made publicly\navailable here soon."},{"date":"2024-03","title":"ChatUIE: Exploring Chat-based Unified Information Extraction using Large Language Models","author":"Jun Xu, Mengshu Sun, Zhiqiang Zhang, and Jun Zhou","link":"http://arxiv.org/abs/2403.05132v1","abstract":"Recent advancements in large language models have shown impressive\nperformance in general chat. However, their domain-specific capabilities,\nparticularly in information extraction, have certain limitations. Extracting\nstructured information from natural language that deviates from known schemas\nor instructions has proven challenging for previous prompt-based methods. This\nmotivated us to explore domain-specific modeling in chat-based language models\nas a solution for extracting structured information from natural language. In\nthis paper, we present ChatUIE, an innovative unified information extraction\nframework built upon ChatGLM. Simultaneously, reinforcement learning is\nemployed to improve and align various tasks that involve confusing and limited\nsamples. Furthermore, we integrate generation constraints to address the issue\nof generating elements that are not present in the input. Our experimental\nresults demonstrate that ChatUIE can significantly improve the performance of\ninformation extraction with a slight decrease in chatting ability."},{"date":"2024-03","title":"Precise Extraction of Deep Learning Models via Side-Channel Attacks on Edge/Endpoint Devices","author":"Younghan Lee, Sohee Jun, Yungi Cho, Woorim Han, Hyungon Moon, and Yunheung Paek","link":"http://arxiv.org/abs/2403.02870v1","abstract":"With growing popularity, deep learning (DL) models are becoming larger-scale,\nand only the companies with vast training datasets and immense computing power\ncan manage their business serving such large models. Most of those DL models\nare proprietary to the companies who thus strive to keep their private models\nsafe from the model extraction attack (MEA), whose aim is to steal the model by\ntraining surrogate models. Nowadays, companies are inclined to offload the\nmodels from central servers to edge/endpoint devices. As revealed in the latest\nstudies, adversaries exploit this opportunity as new attack vectors to launch\nside-channel attack (SCA) on the device running victim model and obtain various\npieces of the model information, such as the model architecture (MA) and image\ndimension (ID). Our work provides a comprehensive understanding of such a\nrelationship for the first time and would benefit future MEA studies in both\noffensive and defensive sides in that they may learn which pieces of\ninformation exposed by SCA are more important than the others. Our analysis\nadditionally reveals that by grasping the victim model information from SCA,\nMEA can get highly effective and successful even without any prior knowledge of\nthe model. Finally, to evince the practicality of our analysis results, we\nempirically apply SCA, and subsequently, carry out MEA under realistic threat\nassumptions. The results show up to 5.8 times better performance than when the\nadversary has no model information about the victim model."},{"date":"2024-03","title":"Towards Intent-Based Network Management: Large Language Models for Intent Extraction in 5G Core Networks","author":"Dimitrios Michael Manias, Ali Chouman, and Abdallah Shami","link":"http://arxiv.org/abs/2403.02238v2","abstract":"The integration of Machine Learning and Artificial Intelligence (ML/AI) into\nfifth-generation (5G) networks has made evident the limitations of network\nintelligence with ever-increasing, strenuous requirements for current and\nnext-generation devices. This transition to ubiquitous intelligence demands\nhigh connectivity, synchronicity, and end-to-end communication between users\nand network operators, and will pave the way towards full network automation\nwithout human intervention. Intent-based networking is a key factor in the\nreduction of human actions, roles, and responsibilities while shifting towards\nnovel extraction and interpretation of automated network management. This paper\npresents the development of a custom Large Language Model (LLM) for 5G and\nnext-generation intent-based networking and provides insights into future LLM\ndevelopments and integrations to realize end-to-end intent-based networking for\nfully automated network intelligence."},{"date":"2024-03","title":"Large Language Models for Simultaneous Named Entity Extraction and Spelling Correction","author":"Edward Whittaker, and Ikuo Kitagishi","link":"http://arxiv.org/abs/2403.00528v1","abstract":"Language Models (LMs) such as BERT, have been shown to perform well on the\ntask of identifying Named Entities (NE) in text. A BERT LM is typically used as\na classifier to classify individual tokens in the input text, or to classify\nspans of tokens, as belonging to one of a set of possible NE categories.\n In this paper, we hypothesise that decoder-only Large Language Models (LLMs)\ncan also be used generatively to extract both the NE, as well as potentially\nrecover the correct surface form of the NE, where any spelling errors that were\npresent in the input text get automatically corrected.\n We fine-tune two BERT LMs as baselines, as well as eight open-source LLMs, on\nthe task of producing NEs from text that was obtained by applying Optical\nCharacter Recognition (OCR) to images of Japanese shop receipts; in this work,\nwe do not attempt to find or evaluate the location of NEs in the text.\n We show that the best fine-tuned LLM performs as well as, or slightly better\nthan, the best fine-tuned BERT LM, although the differences are not\nsignificant. However, the best LLM is also shown to correct OCR errors in some\ncases, as initially hypothesised."},{"date":"2024-03","title":"Teach LLMs to Phish: Stealing Private Information from Language Models","author":"Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal","link":"http://arxiv.org/abs/2403.00871v1","abstract":"When large language models are trained on private data, it can be a\nsignificant privacy risk for them to memorize and regurgitate sensitive\ninformation. In this work, we propose a new practical data extraction attack\nthat we call \"neural phishing\". This attack enables an adversary to target and\nextract sensitive or personally identifiable information (PII), e.g., credit\ncard numbers, from a model trained on user data with upwards of 10% attack\nsuccess rates, at times, as high as 50%. Our attack assumes only that an\nadversary can insert as few as 10s of benign-appearing sentences into the\ntraining dataset using only vague priors on the structure of the user data."},{"date":"2024-02","title":"LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction","author":"Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan","link":"http://arxiv.org/abs/2403.00863v2","abstract":"Product attribute value extraction is a pivotal component in Natural Language\nProcessing (NLP) and the contemporary e-commerce industry. The provision of\nprecise product attribute values is fundamental in ensuring high-quality\nrecommendations and enhancing customer satisfaction. The recently emerging\nLarge Language Models (LLMs) have demonstrated state-of-the-art performance in\nnumerous attribute extraction tasks, without the need for domain-specific\ntraining data. Nevertheless, varying strengths and weaknesses are exhibited by\ndifferent LLMs due to the diversity in data, architectures, and\nhyperparameters. This variation makes them complementary to each other, with no\nsingle LLM dominating all others. Considering the diverse strengths and\nweaknesses of LLMs, it becomes necessary to develop an ensemble method that\nleverages their complementary potentials. In this paper, we propose a novel\nalgorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute\nvalue extraction. We iteratively learn the weights for different LLMs to\naggregate the labels with weights to predict the final attribute value. Not\nonly can our proposed method be proven theoretically optimal, but it also\nensures efficient computation, fast convergence, and safe deployment. We have\nalso conducted extensive experiments with various state-of-the-art LLMs,\nincluding Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's\ninternal data. Our offline metrics demonstrate that the LLM-ensemble method\noutperforms all the state-of-the-art single LLMs on Walmart's internal dataset.\nThis method has been launched in several production models, leading to improved\nGross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate\n(CVR), and Add-to-Cart Rate (ATC)."},{"date":"2024-02","title":"Watermark Stealing in Large Language Models","author":"Nikola Jovanovi\u0107, Robin Staab, and Martin Vechev","link":"http://arxiv.org/abs/2402.19361v2","abstract":"LLM watermarking has attracted attention as a promising way to detect\nAI-generated content, with some works suggesting that current schemes may\nalready be fit for deployment. In this work we dispute this claim, identifying\nwatermark stealing (WS) as a fundamental vulnerability of these schemes. We\nshow that querying the API of the watermarked LLM to approximately\nreverse-engineer a watermark enables practical spoofing attacks, as\nhypothesized in prior work, but also greatly boosts scrubbing attacks, which\nwas previously unnoticed. We are the first to propose an automated WS algorithm\nand use it in the first comprehensive study of spoofing and scrubbing in\nrealistic settings. We show that for under $50 an attacker can both spoof and\nscrub state-of-the-art schemes previously considered safe, with average success\nrate of over 80%. Our findings challenge common beliefs about LLM watermarking,\nstressing the need for more robust schemes. We make all our code and additional\nexamples available at https://watermark-stealing.org."},{"date":"2024-02","title":"PRSA: PRompt Stealing Attacks against Large Language Models","author":"Yong Yang, Changjiang Li, Yi Jiang, Xi Chen, Haoyu Wang, Xuhong Zhang, Zonghui Wang, and Shouling Ji","link":"http://arxiv.org/abs/2402.19200v2","abstract":"In recent years, \"prompt as a service\" has greatly enhanced the utility of\nlarge language models (LLMs) by enabling them to perform various downstream\ntasks efficiently without fine-tuning. This has also increased the commercial\nvalue of prompts. However, the potential risk of leakage in these\ncommercialized prompts remains largely underexplored. In this paper, we\nintroduce a novel attack framework, PRSA, designed for prompt stealing attacks\nagainst LLMs. The main idea of PRSA is to infer the intent behind a prompt by\nanalyzing its input-output content, enabling the generation of a surrogate\nprompt that replicates the original's functionality. Specifically, PRSA mainly\nconsists of two key phases: prompt mutation and prompt pruning. In the mutation\nphase, we propose a prompt attention algorithm based on output difference. The\nalgorithm facilitates the generation of effective surrogate prompts by learning\nkey factors that influence the accurate inference of prompt intent. During the\npruning phase, we employ a two-step related word identification strategy to\ndetect and mask words that are highly related to the input, thus improving the\ngeneralizability of the surrogate prompts. We verify the actual threat of PRSA\nthrough evaluation in both real-world settings, non-interactive and interactive\nprompt services. The results strongly confirm the PRSA's effectiveness and\ngeneralizability. We have reported these findings to prompt service providers\nand actively collaborate with them to implement defensive measures."},{"date":"2024-02","title":"Enhancing Steganographic Text Extraction: Evaluating the Impact of NLP Models on Accuracy and Semantic Coherence","author":"Mingyang Li, Maoqin Yuan, Luyao Li, and Han Pengsihua","link":"http://arxiv.org/abs/2402.18849v1","abstract":"This study discusses a new method combining image steganography technology\nwith Natural Language Processing (NLP) large models, aimed at improving the\naccuracy and robustness of extracting steganographic text. Traditional Least\nSignificant Bit (LSB) steganography techniques face challenges in accuracy and\nrobustness of information extraction when dealing with complex character\nencoding, such as Chinese characters. To address this issue, this study\nproposes an innovative LSB-NLP hybrid framework. This framework integrates the\nadvanced capabilities of NLP large models, such as error detection, correction,\nand semantic consistency analysis, as well as information reconstruction\ntechniques, thereby significantly enhancing the robustness of steganographic\ntext extraction. Experimental results show that the LSB-NLP hybrid framework\nexcels in improving the extraction accuracy of steganographic text, especially\nin handling Chinese characters. The findings of this study not only confirm the\neffectiveness of combining image steganography technology and NLP large models\nbut also propose new ideas for research and application in the field of\ninformation hiding. The successful implementation of this interdisciplinary\napproach demonstrates the great potential of integrating image steganography\ntechnology with natural language processing technology in solving complex\ninformation processing problems."},{"date":"2024-02","title":"Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction","author":"Koki Maeda, Shuhei Kurita, Taiki Miyanishi, and Naoaki Okazaki","link":"http://arxiv.org/abs/2402.17969v1","abstract":"Given the accelerating progress of vision and language modeling, accurate\nevaluation of machine-generated image captions remains critical. In order to\nevaluate captions more closely to human preferences, metrics need to\ndiscriminate between captions of varying quality and content. However,\nconventional metrics fail short of comparing beyond superficial matches of\nwords or embedding similarities; thus, they still need improvement. This paper\npresents VisCE$^2$, a vision language model-based caption evaluation method.\nOur method focuses on visual context, which refers to the detailed content of\nimages, including objects, attributes, and relationships. By extracting and\norganizing them into a structured format, we replace the human-written\nreferences with visual contexts and help VLMs better understand the image,\nenhancing evaluation performance. Through meta-evaluation on multiple datasets,\nwe validated that VisCE$^2$ outperforms the conventional pre-trained metrics in\ncapturing caption quality and demonstrates superior consistency with human\njudgment."},{"date":"2024-02","title":"Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models","author":"Jeffrey G. Wang, Jason Wang, Marvin Li, and Seth Neel","link":"http://arxiv.org/abs/2402.17012v4","abstract":"In this paper we develop state-of-the-art privacy attacks against Large\nLanguage Models (LLMs), where an adversary with some access to the model tries\nto learn something about the underlying training data. Our headline results are\nnew membership inference attacks (MIAs) against pretrained LLMs that perform\nhundreds of times better than baseline attacks, and a pipeline showing that\nover 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM\nin natural settings. We consider varying degrees of access to the underlying\nmodel, pretraining and fine-tuning data, and both MIAs and training data\nextraction. For pretraining data, we propose two new MIAs: a supervised neural\nnetwork classifier that predicts training data membership on the basis of\n(dimensionality-reduced) model gradients, as well as a variant of this attack\nthat only requires logit access to the model by leveraging recent\nmodel-stealing work on LLMs. To our knowledge this is the first MIA that\nexplicitly incorporates model-stealing information. Both attacks outperform\nexisting black-box baselines, and our supervised attack closes the gap between\nMIA attack success against LLMs and the strongest known attacks for other\nmachine learning models. In fine-tuning, we find that a simple attack based on\nthe ratio of the loss between the base and fine-tuned models is able to achieve\nnear-perfect MIA performance; we then leverage our MIA to extract a large\nfraction of the fine-tuning dataset from fine-tuned Pythia and Llama models.\nOur code is available at github.com/safr-ai-lab/pandora-llm."},{"date":"2024-02","title":"IPED: An Implicit Perspective for Relational Triple Extraction based on Diffusion Model","author":"Jianli Zhao, Changhao Xu, and Bin Jiang","link":"http://arxiv.org/abs/2403.00808v1","abstract":"Relational triple extraction is a fundamental task in the field of\ninformation extraction, and a promising framework based on table filling has\nrecently gained attention as a potential baseline for entity relation\nextraction. However, inherent shortcomings such as redundant information and\nincomplete triple recognition remain problematic. To address these challenges,\nwe propose an Implicit Perspective for relational triple Extraction based on\nDiffusion model (IPED), an innovative approach for extracting relational\ntriples. Our classifier-free solution adopts an implicit strategy using block\ncoverage to complete the tables, avoiding the limitations of explicit tagging\nmethods. Additionally, we introduce a generative model structure, the\nblock-denoising diffusion model, to collaborate with our implicit perspective\nand effectively circumvent redundant information disruptions. Experimental\nresults on two popular datasets demonstrate that IPED achieves state-of-the-art\nperformance while gaining superior inference speed and low computational\ncomplexity. To support future research, we have made our source code publicly\navailable online."},{"date":"2024-02","title":"Prompt Stealing Attacks Against Large Language Models","author":"Zeyang Sha, and Yang Zhang","link":"http://arxiv.org/abs/2402.12959v1","abstract":"The increasing reliance on large language models (LLMs) such as ChatGPT in\nvarious fields emphasizes the importance of ``prompt engineering,'' a\ntechnology to improve the quality of model outputs. With companies investing\nsignificantly in expert prompt engineers and educational resources rising to\nmeet market demand, designing high-quality prompts has become an intriguing\nchallenge. In this paper, we propose a novel attack against LLMs, named prompt\nstealing attacks. Our proposed prompt stealing attack aims to steal these\nwell-designed prompts based on the generated answers. The prompt stealing\nattack contains two primary modules: the parameter extractor and the prompt\nreconstruction. The goal of the parameter extractor is to figure out the\nproperties of the original prompts. We first observe that most prompts fall\ninto one of three categories: direct prompt, role-based prompt, and in-context\nprompt. Our parameter extractor first tries to distinguish the type of prompts\nbased on the generated answers. Then, it can further predict which role or how\nmany contexts are used based on the types of prompts. Following the parameter\nextractor, the prompt reconstructor can be used to reconstruct the original\nprompts based on the generated answers and the extracted features. The final\ngoal of the prompt reconstructor is to generate the reversed prompts, which are\nsimilar to the original prompts. Our experimental results show the remarkable\nperformance of our proposed attacks. Our proposed attacks add a new dimension\nto the study of prompt engineering and call for more attention to the security\nissues on LLMs."},{"date":"2024-02","title":"Stealing the Invisible: Unveiling Pre-Trained CNN Models through Adversarial Examples and Timing Side-Channels","author":"Shubhi Shukla, Manaar Alam, Pabitra Mitra, and Debdeep Mukhopadhyay","link":"http://arxiv.org/abs/2402.11953v1","abstract":"Machine learning, with its myriad applications, has become an integral\ncomponent of numerous technological systems. A common practice in this domain\nis the use of transfer learning, where a pre-trained model's architecture,\nreadily available to the public, is fine-tuned to suit specific tasks. As\nMachine Learning as a Service (MLaaS) platforms increasingly use pre-trained\nmodels in their backends, it's crucial to safeguard these architectures and\nunderstand their vulnerabilities. In this work, we present an approach based on\nthe observation that the classification patterns of adversarial images can be\nused as a means to steal the models. Furthermore, the adversarial image\nclassifications in conjunction with timing side channels can lead to a model\nstealing method. Our approach, designed for typical user-level access in remote\nMLaaS environments exploits varying misclassifications of adversarial images\nacross different models to fingerprint several renowned Convolutional Neural\nNetwork (CNN) and Vision Transformer (ViT) architectures. We utilize the\nprofiling of remote model inference times to reduce the necessary adversarial\nimages, subsequently decreasing the number of queries required. We have\npresented our results over 27 pre-trained models of different CNN and ViT\narchitectures using CIFAR-10 dataset and demonstrate a high accuracy of 88.8%\nwhile keeping the query budget under 20."},{"date":"2024-02","title":"Evaluating Efficacy of Model Stealing Attacks and Defenses on Quantum Neural Networks","author":"Satwik Kundu, Debarshi Kundu, and Swaroop Ghosh","link":"http://arxiv.org/abs/2402.11687v1","abstract":"Cloud hosting of quantum machine learning (QML) models exposes them to a\nrange of vulnerabilities, the most significant of which is the model stealing\nattack. In this study, we assess the efficacy of such attacks in the realm of\nquantum computing. We conducted comprehensive experiments on various datasets\nwith multiple QML model architectures. Our findings revealed that model\nstealing attacks can produce clone models achieving up to $0.9\\times$ and\n$0.99\\times$ clone test accuracy when trained using Top-$1$ and Top-$k$ labels,\nrespectively ($k:$ num\\_classes). To defend against these attacks, we leverage\nthe unique properties of current noisy hardware and perturb the victim model\noutputs and hinder the attacker's training process. In particular, we propose:\n1) hardware variation-induced perturbation (HVIP) and 2) hardware and\narchitecture variation-induced perturbation (HAVIP). Although noise and\narchitectural variability can provide up to $\\sim16\\%$ output obfuscation, our\ncomprehensive analysis revealed that models cloned under noisy conditions tend\nto be resilient, suffering little to no performance degradation due to such\nobfuscations. Despite limited success with our defense techniques, this outcome\nhas led to an important discovery: QML models trained on noisy hardwares are\nnaturally resistant to perturbation or obfuscation-based defenses or attacks."},{"date":"2024-02","title":"GenRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models","author":"Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, and Jiawei Han","link":"http://arxiv.org/abs/2402.10744v1","abstract":"The field of relation extraction (RE) is experiencing a notable shift towards\ngenerative relation extraction (GRE), leveraging the capabilities of large\nlanguage models (LLMs). However, we discovered that traditional relation\nextraction (RE) metrics like precision and recall fall short in evaluating GRE\nmethods. This shortfall arises because these metrics rely on exact matching\nwith human-annotated reference relations, while GRE methods often produce\ndiverse and semantically accurate relations that differ from the references. To\nfill this gap, we introduce GenRES for a multi-dimensional assessment in terms\nof the topic similarity, uniqueness, granularity, factualness, and completeness\nof the GRE results. With GenRES, we empirically identified that (1)\nprecision/recall fails to justify the performance of GRE methods; (2)\nhuman-annotated referential relations can be incomplete; (3) prompting LLMs\nwith a fixed set of relations or entities can cause hallucinations. Next, we\nconducted a human evaluation of GRE methods that shows GenRES is consistent\nwith human preferences for RE quality. Last, we made a comprehensive evaluation\nof fourteen leading LLMs using GenRES across document, bag, and sentence level\nRE datasets, respectively, to set the benchmark for future research in GRE"},{"date":"2024-02","title":"Where is the answer? Investigating Positional Bias in Language Model Knowledge Extraction","author":"Kuniaki Saito, Kihyuk Sohn, Chen-Yu Lee, and Yoshitaka Ushiku","link":"http://arxiv.org/abs/2402.12170v2","abstract":"Large language models require updates to remain up-to-date or adapt to new\ndomains by fine-tuning them with new documents. One key is memorizing the\nlatest information in a way that the memorized information is extractable with\na query prompt. However, LLMs suffer from a phenomenon called perplexity curse;\ndespite minimizing document perplexity during fine-tuning, LLMs struggle to\nextract information through a prompt sentence. In this new knowledge\nacquisition and extraction, we find a very intriguing fact that LLMs can\naccurately answer questions about the first sentence, but they struggle to\nextract information described in the middle or end of the documents used for\nfine-tuning. Our study suggests that the auto-regressive training causes this\nissue; each token is prompted by reliance on all previous tokens, which hinders\nthe model from recalling information from training documents by question\nprompts. To conduct the in-depth study, we publish both synthetic and real\ndatasets, enabling the evaluation of the QA performance w.r.t. the position of\nthe corresponding answer in a document. Our investigation shows that even a\nlarge model suffers from the perplexity curse, but regularization such as\ndenoising auto-regressive loss can enhance the information extraction from\ndiverse positions. These findings will be (i) a key to improving knowledge\nextraction from LLMs and (ii) new elements to discuss the trade-off between RAG\nand fine-tuning in adapting LLMs to a new domain."},{"date":"2024-02","title":"Learning to Extract Structured Entities Using Language Models","author":"Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, and Bhaskar Mitra","link":"http://arxiv.org/abs/2402.04437v5","abstract":"Recent advances in machine learning have significantly impacted the field of\ninformation extraction, with Language Models (LMs) playing a pivotal role in\nextracting structured information from unstructured text. Prior works typically\nrepresent information extraction as triplet-centric and use classical metrics\nsuch as precision and recall for evaluation. We reformulate the task to be\nentity-centric, enabling the use of diverse metrics that can provide more\ninsights from various perspectives. We contribute to the field by introducing\nStructured Entity Extraction and proposing the Approximate Entity Set OverlaP\n(AESOP) metric, designed to appropriately assess model performance. Later, we\nintroduce a new Multistage Structured Entity Extraction (MuSEE) model that\nharnesses the power of LMs for enhanced effectiveness and efficiency by\ndecomposing the extraction task into multiple stages. Quantitative and human\nside-by-side evaluations confirm that our model outperforms baselines, offering\npromising directions for future advancements in structured entity extraction.\nOur source code and datasets are available at\nhttps://github.com/microsoft/Structured-Entity-Extraction."},{"date":"2024-01","title":"Contextual Feature Extraction Hierarchies Converge in Large Language Models and the Brain","author":"Gavin Mischler, Yinghao Aaron Li, Stephan Bickel, Ashesh D. Mehta, and Nima Mesgarani","link":"http://arxiv.org/abs/2401.17671v1","abstract":"Recent advancements in artificial intelligence have sparked interest in the\nparallels between large language models (LLMs) and human neural processing,\nparticularly in language comprehension. While prior research has established\nsimilarities in the representation of LLMs and the brain, the underlying\ncomputational principles that cause this convergence, especially in the context\nof evolving LLMs, remain elusive. Here, we examined a diverse selection of\nhigh-performance LLMs with similar parameter sizes to investigate the factors\ncontributing to their alignment with the brain's language processing\nmechanisms. We find that as LLMs achieve higher performance on benchmark tasks,\nthey not only become more brain-like as measured by higher performance when\npredicting neural responses from LLM embeddings, but also their hierarchical\nfeature extraction pathways map more closely onto the brain's while using fewer\nlayers to do the same encoding. We also compare the feature extraction pathways\nof the LLMs to each other and identify new ways in which high-performing models\nhave converged toward similar hierarchical processing mechanisms. Finally, we\nshow the importance of contextual information in improving model performance\nand brain similarity. Our findings reveal the converging aspects of language\nprocessing in the brain and LLMs and offer new directions for developing models\nthat align more closely with human cognitive processing."},{"date":"2024-01","title":"LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement","author":"Renyuan Peng, Xinyue Cai, Hang Xu, Jiachen Lu, Feng Wen, Wei Zhang, and Li Zhang","link":"http://arxiv.org/abs/2401.17609v2","abstract":"Understanding road structures is crucial for autonomous driving. Intricate\nroad structures are often depicted using lane graphs, which include centerline\ncurves and connections forming a Directed Acyclic Graph (DAG). Accurate\nextraction of lane graphs relies on precisely estimating vertex and edge\ninformation within the DAG. Recent research highlights Transformer-based\nlanguage models' impressive sequence prediction abilities, making them\neffective for learning graph representations when graph data are encoded as\nsequences. However, existing studies focus mainly on modeling vertices\nexplicitly, leaving edge information simply embedded in the network.\nConsequently, these approaches fall short in the task of lane graph extraction.\nTo address this, we introduce LaneGraph2Seq, a novel approach for lane graph\nextraction. It leverages a language model with vertex-edge encoding and\nconnectivity enhancement. Our serialization strategy includes a vertex-centric\ndepth-first traversal and a concise edge-based partition sequence.\nAdditionally, we use classifier-free guidance combined with nucleus sampling to\nimprove lane connectivity. We validate our method on prominent datasets,\nnuScenes and Argoverse 2, showcasing consistent and compelling results. Our\nLaneGraph2Seq approach demonstrates superior performance compared to\nstate-of-the-art techniques in lane graph extraction."},{"date":"2024-01","title":"Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately","author":"Liang Zhang, Katherine Jijo, Spurthi Setty, Eden Chung, Fatima Javid, Natan Vidra, and Tommy Clifford","link":"http://arxiv.org/abs/2402.01722v1","abstract":"Large Language Models (LLMs) generate responses to questions; however, their\neffectiveness is often hindered by sub-optimal quality of answers and\noccasional failures to provide accurate responses to questions. To address\nthese challenges, a fine-tuning process is employed, involving feedback and\nexamples to refine models. The objective is to enhance AI models through\ncontinuous feedback loops, utilizing metrics such as cosine similarity, LLM\nevaluation and Rouge-L scores to evaluate the models. Leveraging LLMs like\nGPT-3.5, GPT4ALL, and LLaMA2, and Claude, this approach is benchmarked on\nfinancial datasets, including the FinanceBench and RAG Instruct Benchmark\nTester Dataset, illustrating the necessity of fine-tuning. The results showcase\nthe capability of fine-tuned models to surpass the accuracy of zero-shot LLMs,\nproviding superior question and answering capabilities. Notably, the\ncombination of fine-tuning the LLM with a process known as Retrieval Augmented\nGeneration (RAG) proves to generate responses with improved accuracy."},{"date":"2024-01","title":"MEA-Defender: A Robust Watermark against Model Extraction Attack","author":"Peizhuo Lv, Hualong Ma, Kai Chen, Jiachen Zhou, Shengzhi Zhang, Ruigang Liang, Shenchen Zhu, Pan Li, and Yingjun Zhang","link":"http://arxiv.org/abs/2401.15239v1","abstract":"Recently, numerous highly-valuable Deep Neural Networks (DNNs) have been\ntrained using deep learning algorithms. To protect the Intellectual Property\n(IP) of the original owners over such DNN models, backdoor-based watermarks\nhave been extensively studied. However, most of such watermarks fail upon model\nextraction attack, which utilizes input samples to query the target model and\nobtains the corresponding outputs, thus training a substitute model using such\ninput-output pairs. In this paper, we propose a novel watermark to protect IP\nof DNN models against model extraction, named MEA-Defender. In particular, we\nobtain the watermark by combining two samples from two source classes in the\ninput domain and design a watermark loss function that makes the output domain\nof the watermark within that of the main task samples. Since both the input\ndomain and the output domain of our watermark are indispensable parts of those\nof the main task samples, the watermark will be extracted into the stolen model\nalong with the main task during model extraction. We conduct extensive\nexperiments on four model extraction attacks, using five datasets and six\nmodels trained based on supervised learning and self-supervised learning\nalgorithms. The experimental results demonstrate that MEA-Defender is highly\nrobust against different model extraction attacks, and various watermark\nremoval/detection approaches."},{"date":"2024-01","title":"Extracting Process-Aware Decision Models from Object-Centric Process Data","author":"Alexandre Goossens, Johannes De Smedt, and Jan Vanthienen","link":"http://arxiv.org/abs/2401.14847v1","abstract":"Organizations execute decisions within business processes on a daily basis\nwhilst having to take into account multiple stakeholders who might require\nmultiple point of views of the same process. Moreover, the complexity of the\ninformation systems running these business processes is generally high as they\nare linked to databases storing all the relevant data and aspects of the\nprocesses. Given the presence of multiple objects within an information system\nwhich support the processes in their enactment, decisions are naturally\ninfluenced by both these perspectives, logged in object-centric process logs.\nHowever, the discovery of such decisions from object-centric process logs is\nnot straightforward as it requires to correctly link the involved objects\nwhilst considering the sequential constraints that business processes impose as\nwell as correctly discovering what a decision actually does. This paper\nproposes the first object-centric decision-mining algorithm called Integrated\nObject-centric Decision Discovery Algorithm (IODDA). IODDA is able to discover\nhow a decision is structured as well as how a decision is made. Moreover, IODDA\nis able to discover which activities and object types are involved in the\ndecision-making process. Next, IODDA is demonstrated with the first artificial\nknowledge-intensive process logs whose log generators are provided to the\nresearch community."},{"date":"2024-01","title":"Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes","author":"Darren Liu, Cheng Ding, Delgersuren Bold, Monique Bouvier, Jiaying Lu, Benjamin Shickel, Craig S. Jabaley, Wenhui Zhang, Soojin Park, Michael J. Young, Mark S. Wainwright, Gilles Clermont, Parisa Rashidi, Eric S. Rosenthal, Laurie Dimisko, Ran Xiao, Joo Heung Yoon, Carl Yang, and Xiao Hu","link":"http://arxiv.org/abs/2401.13588v1","abstract":"The field of healthcare has increasingly turned its focus towards Large\nLanguage Models (LLMs) due to their remarkable performance. However, their\nperformance in actual clinical applications has been underexplored. Traditional\nevaluations based on question-answering tasks don't fully capture the nuanced\ncontexts. This gap highlights the need for more in-depth and practical\nassessments of LLMs in real-world healthcare settings. Objective: We sought to\nevaluate the performance of LLMs in the complex clinical context of adult\ncritical care medicine using systematic and comprehensible analytic methods,\nincluding clinician annotation and adjudication. Methods: We investigated the\nperformance of three general LLMs in understanding and processing real-world\nclinical notes. Concepts from 150 clinical notes were identified by MetaMap and\nthen labeled by 9 clinicians. Each LLM's proficiency was evaluated by\nidentifying the temporality and negation of these concepts using different\nprompts for an in-depth analysis. Results: GPT-4 showed overall superior\nperformance compared to other LLMs. In contrast, both GPT-3.5 and\ntext-davinci-003 exhibit enhanced performance when the appropriate prompting\nstrategies are employed. The GPT family models have demonstrated considerable\nefficiency, evidenced by their cost-effectiveness and time-saving capabilities.\nConclusion: A comprehensive qualitative performance evaluation framework for\nLLMs is developed and operationalized. This framework goes beyond singular\nperformance aspects. With expert annotations, this methodology not only\nvalidates LLMs' capabilities in processing complex medical data but also\nestablishes a benchmark for future LLM evaluations across specialized domains."},{"date":"2024-01","title":"Large Language Models for Scientific Information Extraction: An Empirical Study for Virology","author":"Mahsa Shamsabadi, Jennifer D'Souza, and S\u00f6ren Auer","link":"http://arxiv.org/abs/2401.10040v1","abstract":"In this paper, we champion the use of structured and semantic content\nrepresentation of discourse-based scholarly communication, inspired by tools\nlike Wikipedia infoboxes or structured Amazon product descriptions. These\nrepresentations provide users with a concise overview, aiding scientists in\nnavigating the dense academic landscape. Our novel automated approach leverages\nthe robust text generation capabilities of LLMs to produce structured scholarly\ncontribution summaries, offering both a practical solution and insights into\nLLMs' emergent abilities.\n For LLMs, the prime focus is on improving their general intelligence as\nconversational agents. We argue that these models can also be applied\neffectively in information extraction (IE), specifically in complex IE tasks\nwithin terse domains like Science. This paradigm shift replaces the traditional\nmodular, pipelined machine learning approach with a simpler objective expressed\nthrough instructions. Our results show that finetuned FLAN-T5 with 1000x fewer\nparameters than the state-of-the-art GPT-davinci is competitive for the task."},{"date":"2024-01","title":"Code-Based English Models Surprising Performance on Chinese QA Pair Extraction Task","author":"Linghan Zheng, Hui Liu, Xiaojun Lin, Jiayuan Dong, Yue Sheng, Gang Shi, Zhiwei Liu, and Hongwei Chen","link":"http://arxiv.org/abs/2401.10286v3","abstract":"In previous studies, code-based models have consistently outperformed\ntext-based models in reasoning-intensive scenarios. When generating our\nknowledge base for Retrieval-Augmented Generation (RAG), we observed that\ncode-based models also perform exceptionally well in Chinese QA Pair Extraction\ntask. Further, our experiments and the metrics we designed discovered that\ncode-based models containing a certain amount of Chinese data achieve even\nbetter performance. Additionally, the capabilities of code-based English models\nin specified Chinese tasks offer a distinct perspective for discussion on the\nphilosophical \"Chinese Room\" thought experiment."},{"date":"2024-01","title":"MatSAM: Efficient Extraction of Microstructures of Materials via Visual Large Model","author":"Changtai Li, Xu Han, Chao Yao, and Xiaojuan Ban","link":"http://arxiv.org/abs/2401.05638v2","abstract":"Efficient and accurate extraction of microstructures in micrographs of\nmaterials is essential in process optimization and the exploration of\nstructure-property relationships. Deep learning-based image segmentation\ntechniques that rely on manual annotation are laborious and time-consuming and\nhardly meet the demand for model transferability and generalization on various\nsource images. Segment Anything Model (SAM), a large visual model with powerful\ndeep feature representation and zero-shot generalization capabilities, has\nprovided new solutions for image segmentation. In this paper, we propose\nMatSAM, a general and efficient microstructure extraction solution based on\nSAM. A simple yet effective point-based prompt generation strategy is designed,\ngrounded on the distribution and shape of microstructures. Specifically, in an\nunsupervised and training-free way, it adaptively generates prompt points for\ndifferent microscopy images, fuses the centroid points of the coarsely\nextracted region of interest (ROI) and native grid points, and integrates\ncorresponding post-processing operations for quantitative characterization of\nmicrostructures of materials. For common microstructures including grain\nboundary and multiple phases, MatSAM achieves superior zero-shot segmentation\nperformance to conventional rule-based methods and is even preferable to\nsupervised learning methods evaluated on 16 microscopy datasets whose\nmicrographs are imaged by the optical microscope (OM) and scanning electron\nmicroscope (SEM). Especially, on 4 public datasets, MatSAM shows unexpected\ncompetitive segmentation performance against their specialist models. We\nbelieve that, without the need for human labeling, MatSAM can significantly\nreduce the cost of quantitative characterization and statistical analysis of\nextensive microstructures of materials, and thus accelerate the design of new\nmaterials."},{"date":"2024-01","title":"Large Model based Sequential Keyframe Extraction for Video Summarization","author":"Kailong Tan, Yuxiang Zhou, Qianchen Xia, Rui Liu, and Yong Chen","link":"http://arxiv.org/abs/2401.04962v1","abstract":"Keyframe extraction aims to sum up a video's semantics with the minimum\nnumber of its frames. This paper puts forward a Large Model based Sequential\nKeyframe Extraction for video summarization, dubbed LMSKE, which contains three\nstages as below. First, we use the large model \"TransNetV21\" to cut the video\ninto consecutive shots, and employ the large model \"CLIP2\" to generate each\nframe's visual feature within each shot; Second, we develop an adaptive\nclustering algorithm to yield candidate keyframes for each shot, with each\ncandidate keyframe locating nearest to a cluster center; Third, we further\nreduce the above candidate keyframes via redundancy elimination within each\nshot, and finally concatenate them in accordance with the sequence of shots as\nthe final sequential keyframes. To evaluate LMSKE, we curate a benchmark\ndataset and conduct rich experiments, whose results exhibit that LMSKE performs\nmuch better than quite a few SOTA competitors with average F1 of 0.5311,\naverage fidelity of 0.8141, and average compression ratio of 0.9922."},{"date":"2024-01","title":"Segment anything model (SAM) for brain extraction in fMRI studies","author":"Dwith Chenna, and Suyash Bhogawar","link":"http://arxiv.org/abs/2401.04740v1","abstract":"Brain extraction and removal of skull artifacts from magnetic resonance\nimages (MRI) is an important preprocessing step in neuroimaging analysis. There\nare many tools developed to handle human fMRI images, which could involve\nmanual steps for verifying results from brain segmentation that makes it time\nconsuming and inefficient. In this study, we will use the segment anything\nmodel (SAM), a freely available neural network released by Meta[4], which has\nshown promising results in many generic segmentation applications. We will\nanalyze the efficiency of SAM for neuroimaging brain segmentation by removing\nskull artifacts. The results of the experiments showed promising results that\nexplore using automated segmentation algorithms for neuroimaging without the\nneed to train on custom medical imaging dataset."},{"date":"2024-01","title":"A Span-based Model for Extracting Overlapping PICO Entities from RCT Publications","author":"Gongbo Zhang, Yiliang Zhou, Yan Hu, Hua Xu, Chunhua Weng, and Yifan Peng","link":"http://arxiv.org/abs/2401.06791v1","abstract":"Objectives Extraction of PICO (Populations, Interventions, Comparison, and\nOutcomes) entities is fundamental to evidence retrieval. We present a novel\nmethod PICOX to extract overlapping PICO entities.\n Materials and Methods PICOX first identifies entities by assessing whether a\nword marks the beginning or conclusion of an entity. Then it uses a multi-label\nclassifier to assign one or more PICO labels to a span candidate. PICOX was\nevaluated using one of the best-performing baselines, EBM-NLP, and three more\ndatasets, i.e., PICO-Corpus, and RCT publications on Alzheimer's Disease or\nCOVID-19, using entity-level precision, recall, and F1 scores.\n Results PICOX achieved superior precision, recall, and F1 scores across the\nboard, with the micro F1 score improving from 45.05 to 50.87 (p << 0.01). On\nthe PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline\nand improved the micro recall score from 56.66 to 67.33. On the COVID-19\ndataset, PICOX also outperformed the baseline and improved the micro F1 score\nfrom 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores\nwith higher precision when compared to the baseline.\n Conclusion PICOX excels in identifying overlapping entities and consistently\nsurpasses a leading baseline across multiple datasets. Ablation studies reveal\nthat its data augmentation strategy effectively minimizes false positives and\nimproves precision."},{"date":"2024-01","title":"Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models","author":"Uday Allu, Biddwan Ahmed, and Vishesh Tripathi","link":"http://arxiv.org/abs/2401.02333v3","abstract":"The conventional use of the Retrieval-Augmented Generation (RAG) architecture\nhas proven effective for retrieving information from diverse documents.\nHowever, challenges arise in handling complex table queries, especially within\nPDF documents containing intricate tabular structures.This research introduces\nan innovative approach to enhance the accuracy of complex table queries in\nRAG-based systems. Our methodology involves storing PDFs in the retrieval\ndatabase and extracting tabular content separately. The extracted tables\nundergo a process of context enrichment, concatenating headers with\ncorresponding values. To ensure a comprehensive understanding of the enriched\ndata, we employ a fine-tuned version of the Llama-2-chat language model for\nsummarisation within the RAG architecture. Furthermore, we augment the tabular\ndata with contextual sense using the ChatGPT 3.5 API through a one-shot prompt.\nThis enriched data is then fed into the retrieval database alongside other\nPDFs. Our approach aims to significantly improve the precision of complex table\nqueries, offering a promising solution to a longstanding challenge in\ninformation retrieval."},{"date":"2024-01","title":"Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques","author":"Weijian Huang, Cheng Li, Hong-Yu Zhou, Jiarun Liu, Hao Yang, Yong Liang, Guangming Shi, Hairong Zheng, and Shanshan Wang","link":"http://arxiv.org/abs/2401.01583v2","abstract":"The development of medical vision-language foundation models has attracted\nsignificant attention in the field of medicine and healthcare due to their\npromising prospect in various clinical applications. While previous studies\nhave commonly focused on feature learning at a single learning scale,\ninvestigation on integrating multi-scale information is lacking, which may\nhinder the potential for mutual reinforcement among these features. This paper\naims to bridge this gap by proposing a method that effectively exploits\nmulti-scale information to enhance the performance of medical foundation\nmodels. The proposed method simultaneously exploits features at the local,\ninstance, modality and global aspects, facilitating comprehensive\nrepresentation learning within the models. We evaluate the effectiveness of the\nproposed method on six open-source datasets across different clinical tasks,\ndemonstrating its ability to enhance the performance of medical foundation\nmodels."},{"date":"2023-12","title":"Robust Knowledge Extraction from Large Language Models using Social Choice Theory","author":"Nico Potyka, Yuqicheng Zhu, Yunjie He, Evgeny Kharlamov, and Steffen Staab","link":"http://arxiv.org/abs/2312.14877v2","abstract":"Large-language models (LLMs) can support a wide range of applications like\nconversational agents, creative writing or general query answering. However,\nthey are ill-suited for query answering in high-stake domains like medicine\nbecause they are typically not robust - even the same query can result in\ndifferent answers when prompted multiple times. In order to improve the\nrobustness of LLM queries, we propose using ranking queries repeatedly and to\naggregate the queries using methods from social choice theory. We study ranking\nqueries in diagnostic settings like medical and fault diagnosis and discuss how\nthe Partial Borda Choice function from the literature can be applied to merge\nmultiple query results. We discuss some additional interesting properties in\nour setting and evaluate the robustness of our approach empirically."},{"date":"2023-12","title":"MEAOD: Model Extraction Attack against Object Detectors","author":"Zeyu Li, Chenghui Shi, Yuwen Pu, Xuhong Zhang, Yu Li, Jinbao Li, and Shouling Ji","link":"http://arxiv.org/abs/2312.14677v1","abstract":"The widespread use of deep learning technology across various industries has\nmade deep neural network models highly valuable and, as a result, attractive\ntargets for potential attackers. Model extraction attacks, particularly\nquery-based model extraction attacks, allow attackers to replicate a substitute\nmodel with comparable functionality to the victim model and present a\nsignificant threat to the confidentiality and security of MLaaS platforms.\nWhile many studies have explored threats of model extraction attacks against\nclassification models in recent years, object detection models, which are more\nfrequently used in real-world scenarios, have received less attention. In this\npaper, we investigate the challenges and feasibility of query-based model\nextraction attacks against object detection models and propose an effective\nattack method called MEAOD. It selects samples from the attacker-possessed\ndataset to construct an efficient query dataset using active learning and\nenhances the categories with insufficient objects. We additionally improve the\nextraction effectiveness by updating the annotations of the query dataset.\nAccording to our gray-box and black-box scenarios experiments, we achieve an\nextraction performance of over 70% under the given condition of a 10k query\nbudget."},{"date":"2023-12","title":"Zero-shot Building Attribute Extraction from Large-Scale Vision and Language Models","author":"Fei Pan, Sangryul Jeon, Brian Wang, Frank Mckenna, and Stella X. Yu","link":"http://arxiv.org/abs/2312.12479v1","abstract":"Existing building recognition methods, exemplified by BRAILS, utilize\nsupervised learning to extract information from satellite and street-view\nimages for classification and segmentation. However, each task module requires\nhuman-annotated data, hindering the scalability and robustness to regional\nvariations and annotation imbalances. In response, we propose a new zero-shot\nworkflow for building attribute extraction that utilizes large-scale vision and\nlanguage models to mitigate reliance on external annotations. The proposed\nworkflow contains two key components: image-level captioning and segment-level\ncaptioning for the building images based on the vocabularies pertinent to\nstructural and civil engineering. These two components generate descriptive\ncaptions by computing feature representations of the image and the\nvocabularies, and facilitating a semantic match between the visual and textual\nrepresentations. Consequently, our framework offers a promising avenue to\nenhance AI-driven captioning for building attribute extraction in the\nstructural and civil engineering domains, ultimately reducing reliance on human\nannotations while bolstering performance and adaptability."},{"date":"2023-12","title":"Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity","author":"Zhihao Zhu, Chenwang Wu, Rui Fan, Yi Yang, Zhen Wang, Defu Lian, and Enhong Chen","link":"http://arxiv.org/abs/2312.10943v3","abstract":"Recent research demonstrates that GNNs are vulnerable to the model stealing\nattack, a nefarious endeavor geared towards duplicating the target model via\nquery permissions. However, they mainly focus on node classification tasks,\nneglecting the potential threats entailed within the domain of graph\nclassification tasks. Furthermore, their practicality is questionable due to\nunreasonable assumptions, specifically concerning the large data requirements\nand extensive model knowledge. To this end, we advocate following strict\nsettings with limited real data and hard-label awareness to generate synthetic\ndata, thereby facilitating the stealing of the target model. Specifically,\nfollowing important data generation principles, we introduce three model\nstealing attacks to adapt to different actual scenarios: MSA-AU is inspired by\nactive learning and emphasizes the uncertainty to enhance query value of\ngenerated samples; MSA-AD introduces diversity based on Mixup augmentation\nstrategy to alleviate the query inefficiency issue caused by over-similar\nsamples generated by MSA-AU; MSA-AUD combines the above two strategies to\nseamlessly integrate the authenticity, uncertainty, and diversity of the\ngenerated samples. Finally, extensive experiments consistently demonstrate the\nsuperiority of the proposed methods in terms of concealment, query efficiency,\nand stealing performance."},{"date":"2023-12","title":"Model Stealing Attack against Recommender System","author":"Zhihao Zhu, Rui Fan, Chenwang Wu, Yi Yang, Defu Lian, and Enhong Chen","link":"http://arxiv.org/abs/2312.11571v2","abstract":"Recent studies have demonstrated the vulnerability of recommender systems to\ndata privacy attacks. However, research on the threat to model privacy in\nrecommender systems, such as model stealing attacks, is still in its infancy.\nSome adversarial attacks have achieved model stealing attacks against\nrecommender systems, to some extent, by collecting abundant training data of\nthe target model (target data) or making a mass of queries. In this paper, we\nconstrain the volume of available target data and queries and utilize auxiliary\ndata, which shares the item set with the target data, to promote model stealing\nattacks. Although the target model treats target and auxiliary data\ndifferently, their similar behavior patterns allow them to be fused using an\nattention mechanism to assist attacks. Besides, we design stealing functions to\neffectively extract the recommendation list obtained by querying the target\nmodel. Experimental results show that the proposed methods are applicable to\nmost recommender systems and various scenarios and exhibit excellent attack\nperformance on multiple datasets."},{"date":"2023-12","title":"SAME: Sample Reconstruction against Model Extraction Attacks","author":"Yi Xie, Jie Zhang, Shiqian Zhao, Tianwei Zhang, and Xiaofeng Chen","link":"http://arxiv.org/abs/2312.10578v2","abstract":"While deep learning models have shown significant performance across various\ndomains, their deployment needs extensive resources and advanced computing\ninfrastructure. As a solution, Machine Learning as a Service (MLaaS) has\nemerged, lowering the barriers for users to release or productize their deep\nlearning models. However, previous studies have highlighted potential privacy\nand security concerns associated with MLaaS, and one primary threat is model\nextraction attacks. To address this, there are many defense solutions but they\nsuffer from unrealistic assumptions and generalization issues, making them less\npractical for reliable protection. Driven by these limitations, we introduce a\nnovel defense mechanism, SAME, based on the concept of sample reconstruction.\nThis strategy imposes minimal prerequisites on the defender's capabilities,\neliminating the need for auxiliary Out-of-Distribution (OOD) datasets, user\nquery history, white-box model access, and additional intervention during model\ntraining. It is compatible with existing active defense methods. Our extensive\nexperiments corroborate the superior efficacy of SAME over state-of-the-art\nsolutions. Our code is available at https://github.com/xythink/SAME."},{"date":"2023-12","title":"High-throughput Biomedical Relation Extraction for Semi-Structured Web Articles Empowered by Large Language Models","author":"Songchi Zhou, and Sheng Yu","link":"http://arxiv.org/abs/2312.08274v4","abstract":"Objective: To develop a high-throughput biomedical relation extraction system\nthat takes advantage of the large language models'(LLMs) reading comprehension\nability and biomedical world knowledge in a scalable and evidential manner.\nMethods: We formulate the relation extraction task as binary classifications\nfor large language models. Specifically, LLMs make the decision based on the\nexternal corpus and its world knowledge, giving the reason for the judgment for\nfactual verification. This method is tailored for semi-structured web articles,\nwherein we designate the main title as the tail entity and explicitly\nincorporate it into the context, and the potential head entities are matched\nbased on a biomedical thesaurus. Moreover, lengthy contents are sliced into\ntext chunks, embedded, and retrieved with additional embedding models. Results:\nUsing an open-source LLM, we extracted 248659 relation triplets of three\ndistinct relation types from three reputable biomedical websites. To assess the\nefficacy of the basic pipeline employed for biomedical relation extraction, we\ncurated a benchmark dataset annotated by a medical expert. Evaluation results\nindicate that the pipeline exhibits performance comparable to that of GPT-4.\nCase studies further illuminate challenges faced by contemporary LLMs in the\ncontext of biomedical relation extraction for semi-structured web articles.\nConclusion: The proposed method has demonstrated its effectiveness in\nleveraging the strengths of LLMs for high-throughput biomedical relation\nextraction. Its adaptability is evident, as it can be seamlessly extended to\ndiverse semi-structured biomedical websites, facilitating the extraction of\nvarious types of biomedical relations with ease."},{"date":"2023-12","title":"BED: Bi-Encoder-Decoder Model for Canonical Relation Extraction","author":"Nantao Zheng, Siyu Long, and Xinyu Dai","link":"http://arxiv.org/abs/2312.07088v1","abstract":"Canonical relation extraction aims to extract relational triples from\nsentences, where the triple elements (entity pairs and their relationship) are\nmapped to the knowledge base. Recently, methods based on the encoder-decoder\narchitecture are proposed and achieve promising results. However, these methods\ncannot well utilize the entity information, which is merely used as augmented\ntraining data. Moreover, they are incapable of representing novel entities,\nsince no embeddings have been learned for them. In this paper, we propose a\nnovel framework, Bi-Encoder-Decoder (BED), to solve the above issues.\nSpecifically, to fully utilize entity information, we employ an encoder to\nencode semantics of this information, leading to high-quality entity\nrepresentations. For novel entities, given a trained entity encoder, their\nrepresentations can be easily generated. Experimental results on two datasets\nshow that, our method achieves a significant performance improvement over the\nprevious state-of-the-art and handle novel entities well without retraining."},{"date":"2023-12","title":"Model Extraction Attacks Revisited","author":"Jiacheng Liang, Ren Pang, Changjiang Li, and Ting Wang","link":"http://arxiv.org/abs/2312.05386v1","abstract":"Model extraction (ME) attacks represent one major threat to\nMachine-Learning-as-a-Service (MLaaS) platforms by ``stealing'' the\nfunctionality of confidential machine-learning models through querying\nblack-box APIs. Over seven years have passed since ME attacks were first\nconceptualized in the seminal work. During this period, substantial advances\nhave been made in both ME attacks and MLaaS platforms, raising the intriguing\nquestion: How has the vulnerability of MLaaS platforms to ME attacks been\nevolving? In this work, we conduct an in-depth study to answer this critical\nquestion. Specifically, we characterize the vulnerability of current,\nmainstream MLaaS platforms to ME attacks from multiple perspectives including\nattack strategies, learning techniques, surrogate-model design, and benchmark\ntasks. Many of our findings challenge previously reported results, suggesting\nemerging patterns of ME vulnerability. Further, by analyzing the vulnerability\nof the same MLaaS platforms using historical datasets from the past four years,\nwe retrospectively characterize the evolution of ME vulnerability over time,\nleading to a set of interesting findings. Finally, we make suggestions about\nimproving the current practice of MLaaS in terms of attack robustness. Our\nstudy sheds light on the current state of ME vulnerability in the wild and\npoints to several promising directions for future research."},{"date":"2023-12","title":"Fine-tuning pre-trained extractive QA models for clinical document parsing","author":"Ashwyn Sharma, David I. Feldman, and Aneesh Jain","link":"http://arxiv.org/abs/2312.02314v1","abstract":"Electronic health records (EHRs) contain a vast amount of high-dimensional\nmulti-modal data that can accurately represent a patient's medical history.\nUnfortunately, most of this data is either unstructured or semi-structured,\nrendering it unsuitable for real-time and retrospective analyses. A remote\npatient monitoring (RPM) program for Heart Failure (HF) patients needs to have\naccess to clinical markers like EF (Ejection Fraction) or LVEF (Left\nVentricular Ejection Fraction) in order to ascertain eligibility and\nappropriateness for the program. This paper explains a system that can parse\nechocardiogram reports and verify EF values. This system helps identify\neligible HF patients who can be enrolled in such a program. At the heart of\nthis system is a pre-trained extractive QA transformer model that is fine-tuned\non custom-labeled data. The methods used to prepare such a model for deployment\nare illustrated by running experiments on a public clinical dataset like\nMIMIC-IV-Note. The pipeline can be used to generalize solutions to similar\nproblems in a low-resource setting. We found that the system saved over 1500\nhours for our clinicians over 12 months by automating the task at scale."},{"date":"2023-12","title":"LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models","author":"Reza Yousefi Maragheh, Chenhao Fang, Charan Chand Irugu, Parth Parikh, Jason Cho, Jianpeng Xu, Saranyan Sukumar, Malay Patel, Evren Korpeoglu, Sushant Kumar, and Kannan Achan","link":"http://arxiv.org/abs/2312.00909v1","abstract":"Keyword extraction is one of the core tasks in natural language processing.\nClassic extraction models are notorious for having a short attention span which\nmake it hard for them to conclude relational connections among the words and\nsentences that are far from each other. This, in turn, makes their usage\nprohibitive for generating keywords that are inferred from the context of the\nwhole text. In this paper, we explore using Large Language Models (LLMs) in\ngenerating keywords for items that are inferred from the items textual\nmetadata. Our modeling framework includes several stages to fine grain the\nresults by avoiding outputting keywords that are non informative or sensitive\nand reduce hallucinations common in LLM. We call our LLM-based framework\nTheme-Aware Keyword Extraction (LLM TAKE). We propose two variations of\nframework for generating extractive and abstractive themes for products in an E\ncommerce setting. We perform an extensive set of experiments on three real data\nsets and show that our modeling framework can enhance accuracy based and\ndiversity based metrics when compared with benchmark models."},{"date":"2023-11","title":"Scalable Extraction of Training Data from (Production) Language Models","author":"Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tram\u00e8r, and Katherine Lee","link":"http://arxiv.org/abs/2311.17035v1","abstract":"This paper studies extractable memorization: training data that an adversary\ncan efficiently extract by querying a machine learning model without prior\nknowledge of the training dataset. We show an adversary can extract gigabytes\nof training data from open-source language models like Pythia or GPT-Neo,\nsemi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing\ntechniques from the literature suffice to attack unaligned models; in order to\nattack the aligned ChatGPT, we develop a new divergence attack that causes the\nmodel to diverge from its chatbot-style generations and emit training data at a\nrate 150x higher than when behaving properly. Our methods show practical\nattacks can recover far more data than previously thought, and reveal that\ncurrent alignment techniques do not eliminate memorization."},{"date":"2023-11","title":"GPT Struct Me: Probing GPT Models on Narrative Entity Extraction","author":"Hugo Sousa, Nuno Guimar\u00e3es, Al\u00edpio Jorge, and Ricardo Campos","link":"http://arxiv.org/abs/2311.14583v1","abstract":"The importance of systems that can extract structured information from\ntextual data becomes increasingly pronounced given the ever-increasing volume\nof text produced on a daily basis. Having a system that can effectively extract\nsuch information in an interoperable manner would be an asset for several\ndomains, be it finance, health, or legal. Recent developments in natural\nlanguage processing led to the production of powerful language models that can,\nto some degree, mimic human intelligence. Such effectiveness raises a pertinent\nquestion: Can these models be leveraged for the extraction of structured\ninformation? In this work, we address this question by evaluating the\ncapabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5,\ncommonly known as ChatGPT -- in the extraction of narrative entities, namely\nevents, participants, and temporal expressions. This study is conducted on the\nText2Story Lusa dataset, a collection of 119 Portuguese news articles whose\nannotation framework includes a set of entity structures along with several\ntags and attribute values. We first select the best prompt template through an\nablation study over prompt components that provide varying degrees of\ninformation on a subset of documents of the dataset. Subsequently, we use the\nbest templates to evaluate the effectiveness of the models on the remaining\ndocuments. The results obtained indicate that GPT models are competitive with\nout-of-the-box baseline systems, presenting an all-in-one alternative for\npractitioners with limited resources. By studying the strengths and limitations\nof these models in the context of information extraction, we offer insights\nthat can guide future improvements and avenues to explore in this field."},{"date":"2023-11","title":"Steal My Artworks for Fine-tuning? A Watermarking Framework for Detecting Art Theft Mimicry in Text-to-Image Models","author":"Ge Luo, Junqiang Huang, Manman Zhang, Zhenxing Qian, Sheng Li, and Xinpeng Zhang","link":"http://arxiv.org/abs/2311.13619v1","abstract":"The advancement in text-to-image models has led to astonishing artistic\nperformances. However, several studios and websites illegally fine-tune these\nmodels using artists' artworks to mimic their styles for profit, which violates\nthe copyrights of artists and diminishes their motivation to produce original\nworks. Currently, there is a notable lack of research focusing on this issue.\nIn this paper, we propose a novel watermarking framework that detects mimicry\nin text-to-image models through fine-tuning. This framework embeds subtle\nwatermarks into digital artworks to protect their copyrights while still\npreserving the artist's visual expression. If someone takes watermarked\nartworks as training data to mimic an artist's style, these watermarks can\nserve as detectable indicators. By analyzing the distribution of these\nwatermarks in a series of generated images, acts of fine-tuning mimicry using\nstolen victim data will be exposed. In various fine-tune scenarios and against\nwatermark attack methods, our research confirms that analyzing the distribution\nof watermarks in artificially generated images reliably detects unauthorized\nmimicry."},{"date":"2023-11","title":"Use GPT-J Prompt Generation with RoBERTa for NER Models on Diagnosis Extraction of Periodontal Diagnosis from Electronic Dental Records","author":"Yao-Shun Chuang, Xiaoqian Jiang, Chun-Teh Lee, Ryan Brandon, Duong Tran, Oluwabunmi Tokede, and Muhammad F. Walji","link":"http://arxiv.org/abs/2311.10810v1","abstract":"This study explored the usability of prompt generation on named entity\nrecognition (NER) tasks and the performance in different settings of the\nprompt. The prompt generation by GPT-J models was utilized to directly test the\ngold standard as well as to generate the seed and further fed to the RoBERTa\nmodel with the spaCy package. In the direct test, a lower ratio of negative\nexamples with higher numbers of examples in prompt achieved the best results\nwith a F1 score of 0.72. The performance revealed consistency, 0.92-0.97 in the\nF1 score, in all settings after training with the RoBERTa model. The study\nhighlighted the importance of seed quality rather than quantity in feeding NER\nmodels. This research reports on an efficient and accurate way to mine clinical\nnotes for periodontal diagnoses, allowing researchers to easily and quickly\nbuild a NER model with the prompt generation approach."},{"date":"2023-11","title":"Semi-automatic Data Enhancement for Document-Level Relation Extraction with Distant Supervision from Large Language Models","author":"Junpeng Li, Zixia Jia, and Zilong Zheng","link":"http://arxiv.org/abs/2311.07314v1","abstract":"Document-level Relation Extraction (DocRE), which aims to extract relations\nfrom a long context, is a critical challenge in achieving fine-grained\nstructural comprehension and generating interpretable document representations.\nInspired by recent advances in in-context learning capabilities emergent from\nlarge language models (LLMs), such as ChatGPT, we aim to design an automated\nannotation method for DocRE with minimum human effort. Unfortunately, vanilla\nin-context learning is infeasible for document-level relation extraction due to\nthe plenty of predefined fine-grained relation types and the uncontrolled\ngenerations of LLMs. To tackle this issue, we propose a method integrating a\nlarge language model (LLM) and a natural language inference (NLI) module to\ngenerate relation triples, thereby augmenting document-level relation datasets.\nWe demonstrate the effectiveness of our approach by introducing an enhanced\ndataset known as DocGNRE, which excels in re-annotating numerous long-tail\nrelation types. We are confident that our method holds the potential for\nbroader applications in domain-specific relation type definitions and offers\ntangible benefits in advancing generalized language semantic comprehension."},{"date":"2023-11","title":"Army of Thieves: Enhancing Black-Box Model Extraction via Ensemble based sample selection","author":"Akshit Jindal, Vikram Goyal, Saket Anand, and Chetan Arora","link":"http://arxiv.org/abs/2311.04588v1","abstract":"Machine Learning (ML) models become vulnerable to Model Stealing Attacks\n(MSA) when they are deployed as a service. In such attacks, the deployed model\nis queried repeatedly to build a labelled dataset. This dataset allows the\nattacker to train a thief model that mimics the original model. To maximize\nquery efficiency, the attacker has to select the most informative subset of\ndata points from the pool of available data. Existing attack strategies utilize\napproaches like Active Learning and Semi-Supervised learning to minimize costs.\nHowever, in the black-box setting, these approaches may select sub-optimal\nsamples as they train only one thief model. Depending on the thief model's\ncapacity and the data it was pretrained on, the model might even select noisy\nsamples that harm the learning process. In this work, we explore the usage of\nan ensemble of deep learning models as our thief model. We call our attack Army\nof Thieves(AOT) as we train multiple models with varying complexities to\nleverage the crowd's wisdom. Based on the ensemble's collective decision,\nuncertain samples are selected for querying, while the most confident samples\nare directly included in the training data. Our approach is the first one to\nutilize an ensemble of thief models to perform model extraction. We outperform\nthe base approaches of existing state-of-the-art methods by at least 3% and\nachieve a 21% higher adversarial sample transferability than previous work for\nmodels trained on the CIFAR-10 dataset."},{"date":"2023-11","title":"JPAVE: A Generation and Classification-based Model for Joint Product Attribute Prediction and Value Extraction","author":"Zhongfen Deng, Hao Peng, Tao Zhang, Shuaiqi Liu, Wenting Zhao, Yibo Wang, and Philip S. Yu","link":"http://arxiv.org/abs/2311.04196v1","abstract":"Product attribute value extraction is an important task in e-Commerce which\ncan help several downstream applications such as product search and\nrecommendation. Most previous models handle this task using sequence labeling\nor question answering method which rely on the sequential position information\nof values in the product text and are vulnerable to data discrepancy between\ntraining and testing. This limits their generalization ability to real-world\nscenario in which each product can have multiple descriptions across various\nshopping platforms with different composition of text and style. They also have\nlimited zero-shot ability to new values. In this paper, we propose a multi-task\nlearning model with value generation/classification and attribute prediction\ncalled JPAVE to predict values without the necessity of position information of\nvalues in the text. Furthermore, the copy mechanism in value generator and the\nvalue attention module in value classifier help our model address the data\ndiscrepancy issue by only focusing on the relevant part of input text and\nignoring other information which causes the discrepancy issue such as sentence\nstructure in the text. Besides, two variants of our model are designed for\nopen-world and closed-world scenarios. In addition, copy mechanism introduced\nin the first variant based on value generation can improve its zero-shot\nability for identifying unseen values. Experimental results on a public dataset\ndemonstrate the superiority of our model compared with strong baselines and its\ngeneralization ability of predicting new values."},{"date":"2023-11","title":"Extracting human interpretable structure-property relationships in chemistry using XAI and large language models","author":"Geemi P. Wellawatte, and Philippe Schwaller","link":"http://arxiv.org/abs/2311.04047v1","abstract":"Explainable Artificial Intelligence (XAI) is an emerging field in AI that\naims to address the opaque nature of machine learning models. Furthermore, it\nhas been shown that XAI can be used to extract input-output relationships,\nmaking them a useful tool in chemistry to understand structure-property\nrelationships. However, one of the main limitations of XAI methods is that they\nare developed for technically oriented users. We propose the XpertAI framework\nthat integrates XAI methods with large language models (LLMs) accessing\nscientific literature to generate accessible natural language explanations of\nraw chemical data automatically. We conducted 5 case studies to evaluate the\nperformance of XpertAI. Our results show that XpertAI combines the strengths of\nLLMs and XAI tools in generating specific, scientific, and interpretable\nexplanations."},{"date":"2023-11","title":"Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features","author":"Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, V\u00edctor Abia Alonso, Charlie Griffin, and Bogdan-Ionut Cirstea","link":"http://arxiv.org/abs/2311.04046v1","abstract":"Many capable large language models (LLMs) are developed via self-supervised\npre-training followed by a reinforcement-learning fine-tuning phase, often\nbased on human or AI feedback. During this stage, models may be guided by their\ninductive biases to rely on simpler features which may be easier to extract, at\na cost to robustness and generalisation. We investigate whether principles\ngoverning inductive biases in the supervised fine-tuning of LLMs also apply\nwhen the fine-tuning process uses reinforcement learning. Following Lovering et\nal (2021), we test two hypotheses: that features more $\\textit{extractable}$\nafter pre-training are more likely to be utilised by the final policy, and that\nthe evidence for/against a feature predicts whether it will be utilised.\nThrough controlled experiments on synthetic and natural language tasks, we find\nstatistically significant correlations which constitute strong evidence for\nthese hypotheses."},{"date":"2023-11","title":"Enhancing AI Research Paper Analysis: Methodology Component Extraction using Factored Transformer-based Sequence Modeling Approach","author":"Madhusudan Ghosh, Debasis Ganguly, Partha Basuchowdhuri, and Sudip Kumar Naskar","link":"http://arxiv.org/abs/2311.03401v1","abstract":"Research in scientific disciplines evolves, often rapidly, over time with the\nemergence of novel methodologies and their associated terminologies. While\nmethodologies themselves being conceptual in nature and rather difficult to\nautomatically extract and characterise, in this paper, we seek to develop\nsupervised models for automatic extraction of the names of the various\nconstituents of a methodology, e.g., `R-CNN', `ELMo' etc. The main research\nchallenge for this task is effectively modeling the contexts around these\nmethodology component names in a few-shot or even a zero-shot setting. The main\ncontributions of this paper towards effectively identifying new evolving\nscientific methodology names are as follows: i) we propose a factored approach\nto sequence modeling, which leverages a broad-level category information of\nmethodology domains, e.g., `NLP', `RL' etc.; ii) to demonstrate the feasibility\nof our proposed approach of identifying methodology component names under a\npractical setting of fast evolving AI literature, we conduct experiments\nfollowing a simulated chronological setup (newer methodologies not seen during\nthe training process); iii) our experiments demonstrate that the factored\napproach outperforms state-of-the-art baselines by margins of up to 9.257\\% for\nthe methodology extraction task with the few-shot setup."},{"date":"2023-11","title":"Extraction of Atypical Aspects from Customer Reviews: Datasets and Experiments with Language Models","author":"Smita Nannaware, Erfan Al-Hossami, and Razvan Bunescu","link":"http://arxiv.org/abs/2311.02702v1","abstract":"A restaurant dinner may become a memorable experience due to an unexpected\naspect enjoyed by the customer, such as an origami-making station in the\nwaiting area. If aspects that are atypical for a restaurant experience were\nknown in advance, they could be leveraged to make recommendations that have the\npotential to engender serendipitous experiences, further increasing user\nsatisfaction. Although relatively rare, whenever encountered, atypical aspects\noften end up being mentioned in reviews due to their memorable quality.\nCorrespondingly, in this paper we introduce the task of detecting atypical\naspects in customer reviews. To facilitate the development of extraction\nmodels, we manually annotate benchmark datasets of reviews in three domains -\nrestaurants, hotels, and hair salons, which we use to evaluate a number of\nlanguage models, ranging from fine-tuning the instruction-based text-to-text\ntransformer Flan-T5 to zero-shot and few-shot prompting of GPT-3.5."},{"date":"2023-10","title":"rTsfNet: a DNN model with Multi-head 3D Rotation and Time Series Feature Extraction for IMU-based Human Activity Recognition","author":"Yu Enokibori","link":"http://arxiv.org/abs/2310.19283v3","abstract":"Although many deep learning (DL) algorithms have been proposed for the\nIMU-based HAR domain, traditional machine learning that utilizes handcrafted\ntime series features (TSFs) still often performs well. It is not rare that\ncombinations among DL and TSFs show better accuracy than DL-only approaches.\nHowever, there is a problem with time series features in IMU-based HAR. The\namount of derived features can vary greatly depending on the method used to\nselect the 3D basis. Fortunately, DL's strengths include capturing the features\nof input data and adaptively deriving parameters. Thus, as a new DNN model for\nIMU-based human activity recognition (HAR), this paper proposes rTsfNet, a DNN\nmodel with Multi-head 3D Rotation and Time Series Feature Extraction. rTsfNet\nautomatically selects 3D bases from which features should be derived by\nextracting 3D rotation parameters within the DNN. Then, time series features\n(TSFs), based on many researchers' wisdom, are derived to achieve HAR using\nMLP. Although rTsfNet is a model that does not use CNN, it achieved higher\naccuracy than existing models under well-managed benchmark conditions and\nmultiple datasets: UCI HAR, PAMAP2, Daphnet, and OPPORTUNITY, all of which\ntarget different activities."},{"date":"2023-10","title":"Open Visual Knowledge Extraction via Relation-Oriented Multimodality Model Prompting","author":"Hejie Cui, Xinyu Fang, Zihan Zhang, Ran Xu, Xuan Kan, Xin Liu, Yue Yu, Manling Li, Yangqiu Song, and Carl Yang","link":"http://arxiv.org/abs/2310.18804v1","abstract":"Images contain rich relational knowledge that can help machines understand\nthe world. Existing methods on visual knowledge extraction often rely on the\npre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation\ntypes), restricting the expressiveness of the extracted knowledge. In this\nwork, we take a first exploration to a new paradigm of open visual knowledge\nextraction. To achieve this, we present OpenVik which consists of an open\nrelational region detector to detect regions potentially containing relational\nknowledge and a visual knowledge generator that generates format-free knowledge\nby prompting the large multimodality model with the detected region of\ninterest. We also explore two data enhancement techniques for diversifying the\ngenerated format-free visual knowledge. Extensive knowledge quality evaluations\nhighlight the correctness and uniqueness of the extracted open visual knowledge\nby OpenVik. Moreover, integrating our extracted knowledge across various visual\nreasoning applications shows consistent improvements, indicating the real-world\napplicability of OpenVik."},{"date":"2023-10","title":"Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages","author":"Qusai Khraisha, Sophie Put, Johanna Kappenberg, Azza Warraitch, and Kristin Hadfield","link":"http://arxiv.org/abs/2310.17526v2","abstract":"Systematic reviews are vital for guiding practice, research, and policy, yet\nthey are often slow and labour-intensive. Large language models (LLMs) could\noffer a way to speed up and automate systematic reviews, but their performance\nin such tasks has not been comprehensively evaluated against humans, and no\nstudy has tested GPT-4, the biggest LLM so far. This pre-registered study\nevaluates GPT-4's capability in title/abstract screening, full-text review, and\ndata extraction across various literature types and languages using a\n'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human\nperformance in most tasks, results were skewed by chance agreement and dataset\nimbalance. After adjusting for these, there was a moderate level of performance\nfor data extraction, and - barring studies that used highly reliable prompts -\nscreening performance levelled at none to moderate for different stages and\nlanguages. When screening full-text literature using highly reliable prompts,\nGPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key\nstudies using highly reliable prompts improved its performance even more. Our\nfindings indicate that, currently, substantial caution should be used if LLMs\nare being used to conduct systematic reviews, but suggest that, for certain\nsystematic review tasks delivered under reliable prompts, LLMs can rival human\nperformance."},{"date":"2023-10","title":"Prompt-Driven Building Footprint Extraction in Aerial Images with Offset-Building Model","author":"Kai Li, Yupeng Deng, Yunlong Kong, Diyou Liu, Jingbo Chen, Yu Meng, and Junxian Ma","link":"http://arxiv.org/abs/2310.16717v3","abstract":"More accurate extraction of invisible building footprints from\nvery-high-resolution (VHR) aerial images relies on roof segmentation and\nroof-to-footprint offset extraction. Existing state-of-the-art methods based on\ninstance segmentation suffer from poor generalization when extended to\nlarge-scale data production and fail to achieve low-cost human interactive\nannotation. The latest prompt paradigms inspire us to design a promptable\nframework for roof and offset extraction, which transforms end-to-end\nalgorithms into promptable methods. Within this framework, we propose a novel\nOffset-Building Model (OBM). To rigorously evaluate the algorithm's\ncapabilities, we introduce a prompt-based evaluation method, where our model\nreduces offset errors by 16.6% and improves roof Intersection over Union (IoU)\nby 10.8% compared to other models. Leveraging the common patterns in predicting\noffsets, we propose Distance-NMS (DNMS) algorithms, enabling the model to\nfurther reduce offset vector loss by 6.5%. To further validate the\ngeneralization of models, we tested them using a new dataset with over 7,000\nmanually annotated instance samples. Our algorithms and dataset are available\nat https://anonymous.4open.science/r/OBM-B3EC."},{"date":"2023-10","title":"Defense Against Model Extraction Attacks on Recommender Systems","author":"Sixiao Zhang, Hongzhi Yin, Hongxu Chen, and Cheng Long","link":"http://arxiv.org/abs/2310.16335v1","abstract":"The robustness of recommender systems has become a prominent topic within the\nresearch community. Numerous adversarial attacks have been proposed, but most\nof them rely on extensive prior knowledge, such as all the white-box attacks or\nmost of the black-box attacks which assume that certain external knowledge is\navailable. Among these attacks, the model extraction attack stands out as a\npromising and practical method, involving training a surrogate model by\nrepeatedly querying the target model. However, there is a significant gap in\nthe existing literature when it comes to defending against model extraction\nattacks on recommender systems. In this paper, we introduce Gradient-based\nRanking Optimization (GRO), which is the first defense strategy designed to\ncounter such attacks. We formalize the defense as an optimization problem,\naiming to minimize the loss of the protected target model while maximizing the\nloss of the attacker's surrogate model. Since top-k ranking lists are\nnon-differentiable, we transform them into swap matrices which are instead\ndifferentiable. These swap matrices serve as input to a student model that\nemulates the surrogate model's behavior. By back-propagating the loss of the\nstudent model, we obtain gradients for the swap matrices. These gradients are\nused to compute a swap loss, which maximizes the loss of the student model. We\nconducted experiments on three benchmark datasets to evaluate the performance\nof GRO, and the results demonstrate its superior effectiveness in defending\nagainst model extraction attacks."},{"date":"2023-10","title":"Efficient Data Learning for Open Information Extraction with Pre-trained Language Models","author":"Zhiyuan Fan, and Shizhu He","link":"http://arxiv.org/abs/2310.15021v2","abstract":"Open Information Extraction (OpenIE) is a fundamental yet challenging task in\nNatural Language Processing, which involves extracting all triples (subject,\npredicate, object) from a given sentence. While labeling-based methods have\ntheir merits, generation-based techniques offer unique advantages, such as the\nability to generate tokens not present in the original sentence. However, these\ngeneration-based methods often require a significant amount of training data to\nlearn the task form of OpenIE and substantial training time to overcome slow\nmodel convergence due to the order penalty. In this paper, we introduce a novel\nframework, OK-IE, that ingeniously transforms the task form of OpenIE into the\npre-training task form of the T5 model, thereby reducing the need for extensive\ntraining data. Furthermore, we introduce an innovative concept of Anchor to\ncontrol the sequence of model outputs, effectively eliminating the impact of\norder penalty on model convergence and significantly reducing training time.\nExperimental results indicate that, compared to previous SOTA methods, OK-IE\nrequires only 1/100 of the training data (900 instances) and 1/120 of the\ntraining time (3 minutes) to achieve comparable results."},{"date":"2023-10","title":"Knowledge Extraction and Distillation from Large-Scale Image-Text Colonoscopy Records Leveraging Large Language and Vision Models","author":"Shuo Wang, Yan Zhu, Xiaoyuan Luo, Zhiwei Yang, Yizhe Zhang, Peiyao Fu, Manning Wang, Zhijian Song, Quanlin Li, Pinghong Zhou, and Yike Guo","link":"http://arxiv.org/abs/2310.11173v1","abstract":"The development of artificial intelligence systems for colonoscopy analysis\noften necessitates expert-annotated image datasets. However, limitations in\ndataset size and diversity impede model performance and generalisation.\nImage-text colonoscopy records from routine clinical practice, comprising\nmillions of images and text reports, serve as a valuable data source, though\nannotating them is labour-intensive. Here we leverage recent advancements in\nlarge language and vision models and propose EndoKED, a data mining paradigm\nfor deep knowledge extraction and distillation. EndoKED automates the\ntransformation of raw colonoscopy records into image datasets with pixel-level\nannotation. We validate EndoKED using multi-centre datasets of raw colonoscopy\nrecords (~1 million images), demonstrating its superior performance in training\npolyp detection and segmentation models. Furthermore, the EndoKED pre-trained\nvision backbone enables data-efficient and generalisable learning for optical\nbiopsy, achieving expert-level performance in both retrospective and\nprospective validation."},{"date":"2023-10","title":"Document-Level In-Context Few-Shot Relation Extraction via Pre-Trained Language Models","author":"Yilmazcan Ozyurt, Stefan Feuerriegel, and Ce Zhang","link":"http://arxiv.org/abs/2310.11085v4","abstract":"Document-level relation extraction aims at inferring structured human\nknowledge from textual documents. State-of-the-art methods for this task use\npre-trained language models (LMs) via fine-tuning, yet fine-tuning is\ncomputationally expensive and cannot adapt to new relation types or new LMs. As\na remedy, we leverage the generalization capabilities of pre-trained LMs and\npresent a novel framework for document-level in-context few-shot relation\nextraction. Our framework has three strengths: it eliminates the need (1) for\nnamed entity recognition and (2) for human annotations of documents, and (3) it\ncan be updated to new LMs without re-training. We evaluate our framework using\nDocRED, the largest publicly available dataset for document-level relation\nextraction, and demonstrate that our framework achieves state-of-the-art\nperformance. We further show that our framework actually performs much better\nthan the original labels from the development set of DocRED. Finally, we\nconduct an extensive benchmark demonstrating the effectiveness of our\nframework, achieving state-of-the-art results across six relation extraction\ndatasets and outperforming more than 30 baseline methods. Unlike our framework,\nthe baseline methods have large computational overhead (e.g., from\nfine-tuning). To the best of our knowledge, we are the first to reformulate the\ndocument-level relation extraction task as a tailored in-context few-shot\nlearning paradigm."},{"date":"2023-10","title":"Convolutional Neural Network Model for Diabetic Retinopathy Feature Extraction and Classification","author":"Sharan Subramanian, and Leilani H. Gilpin","link":"http://arxiv.org/abs/2310.10806v1","abstract":"The application of Artificial Intelligence in the medical market brings up\nincreasing concerns but aids in more timely diagnosis of silent progressing\ndiseases like Diabetic Retinopathy. In order to diagnose Diabetic Retinopathy\n(DR), ophthalmologists use color fundus images, or pictures of the back of the\nretina, to identify small distinct features through a difficult and\ntime-consuming process. Our work creates a novel CNN model and identifies the\nseverity of DR through fundus image input. We classified 4 known DR features,\nincluding micro-aneurysms, cotton wools, exudates, and hemorrhages, through\nconvolutional layers and were able to provide an accurate diagnostic without\nadditional user input. The proposed model is more interpretable and robust to\noverfitting. We present initial results with a sensitivity of 97% and an\naccuracy of 71%. Our contribution is an interpretable model with similar\naccuracy to more complex models. With that, our model advances the field of DR\ndetection and proves to be a key step towards AI-focused medical diagnosis."},{"date":"2023-10","title":"SCME: A Self-Contrastive Method for Data-free and Query-Limited Model Extraction Attack","author":"Renyang Liu, Jinhong Zhang, Kwok-Yan Lam, Jun Zhao, and Wei Zhou","link":"http://arxiv.org/abs/2310.09792v1","abstract":"Previous studies have revealed that artificial intelligence (AI) systems are\nvulnerable to adversarial attacks. Among them, model extraction attacks fool\nthe target model by generating adversarial examples on a substitute model. The\ncore of such an attack is training a substitute model as similar to the target\nmodel as possible, where the simulation process can be categorized in a\ndata-dependent and data-free manner. Compared with the data-dependent method,\nthe data-free one has been proven to be more practical in the real world since\nit trains the substitute model with synthesized data. However, the distribution\nof these fake data lacks diversity and cannot detect the decision boundary of\nthe target model well, resulting in the dissatisfactory simulation effect.\nBesides, these data-free techniques need a vast number of queries to train the\nsubstitute model, increasing the time and computing consumption and the risk of\nexposure. To solve the aforementioned problems, in this paper, we propose a\nnovel data-free model extraction method named SCME (Self-Contrastive Model\nExtraction), which considers both the inter- and intra-class diversity in\nsynthesizing fake data. In addition, SCME introduces the Mixup operation to\naugment the fake data, which can explore the target model's decision boundary\neffectively and improve the simulating capacity. Extensive experiments show\nthat the proposed method can yield diversified fake data. Moreover, our method\nhas shown superiority in many different attack settings under the query-limited\nscenario, especially for untargeted attacks, the SCME outperforms SOTA methods\nby 11.43\\% on average for five baseline datasets."},{"date":"2023-10","title":"Notes on Applicability of Explainable AI Methods to Machine Learning Models Using Features Extracted by Persistent Homology","author":"Naofumi Hama","link":"http://arxiv.org/abs/2310.09780v1","abstract":"Data analysis that uses the output of topological data analysis as input for\nmachine learning algorithms has been the subject of extensive research. This\napproach offers a means of capturing the global structure of data. Persistent\nhomology (PH), a common methodology within the field of TDA, has found\nwide-ranging applications in machine learning. One of the key reasons for the\nsuccess of the PH-ML pipeline lies in the deterministic nature of feature\nextraction conducted through PH. The ability to achieve satisfactory levels of\naccuracy with relatively simple downstream machine learning models, when\nprocessing these extracted features, underlines the pipeline's superior\ninterpretability. However, it must be noted that this interpretation has\nencountered issues. Specifically, it fails to accurately reflect the feasible\nparameter region in the data generation process, and the physical or chemical\nconstraints that restrict this process. Against this backdrop, we explore the\npotential application of explainable AI methodologies to this PH-ML pipeline.\nWe apply this approach to the specific problem of predicting gas adsorption in\nmetal-organic frameworks and demonstrate that it can yield suggestive results.\nThe codes to reproduce our results are available at\nhttps://github.com/naofumihama/xai_ph_ml"},{"date":"2023-10","title":"Polynomial Time Cryptanalytic Extraction of Neural Network Models","author":"Adi Shamir, Isaac Canales-Martinez, Anna Hambitzer, Jorge Chavez-Saab, Francisco Rodrigez-Henriquez, and Nitin Satpute","link":"http://arxiv.org/abs/2310.08708v1","abstract":"Billions of dollars and countless GPU hours are currently spent on training\nDeep Neural Networks (DNNs) for a variety of tasks. Thus, it is essential to\ndetermine the difficulty of extracting all the parameters of such neural\nnetworks when given access to their black-box implementations. Many versions of\nthis problem have been studied over the last 30 years, and the best current\nattack on ReLU-based deep neural networks was presented at Crypto 2020 by\nCarlini, Jagielski, and Mironov. It resembles a differential chosen plaintext\nattack on a cryptosystem, which has a secret key embedded in its black-box\nimplementation and requires a polynomial number of queries but an exponential\namount of time (as a function of the number of neurons). In this paper, we\nimprove this attack by developing several new techniques that enable us to\nextract with arbitrarily high precision all the real-valued parameters of a\nReLU-based DNN using a polynomial number of queries and a polynomial amount of\ntime. We demonstrate its practical efficiency by applying it to a full-sized\nneural network for classifying the CIFAR10 dataset, which has 3072 inputs, 8\nhidden layers with 256 neurons each, and over million neuronal parameters. An\nattack following the approach by Carlini et al. requires an exhaustive search\nover 2 to the power 256 possibilities. Our attack replaces this with our new\ntechniques, which require only 30 minutes on a 256-core computer."},{"date":"2023-10","title":"I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction","author":"Yusheng Huang, and Zhouhan Lin","link":"http://arxiv.org/abs/2310.06326v1","abstract":"Multimodal information extraction is attracting research attention nowadays,\nwhich requires aggregating representations from different modalities. In this\npaper, we present the Intra- and Inter-Sample Relationship Modeling (I2SRM)\nmethod for this task, which contains two modules. Firstly, the intra-sample\nrelationship modeling module operates on a single sample and aims to learn\neffective representations. Embeddings from textual and visual modalities are\nshifted to bridge the modality gap caused by distinct pre-trained language and\nimage models. Secondly, the inter-sample relationship modeling module considers\nrelationships among multiple samples and focuses on capturing the interactions.\nAn AttnMixup strategy is proposed, which not only enables collaboration among\nsamples but also augments data to improve generalization. We conduct extensive\nexperiments on the multimodal named entity recognition datasets Twitter-2015\nand Twitter-2017, and the multimodal relation extraction dataset MNRE. Our\nproposed method I2SRM achieves competitive results, 77.12% F1-score on\nTwitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE."},{"date":"2023-10","title":"Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction","author":"Cheng Peng, Xi Yang, Kaleb E Smith, Zehao Yu, Aokun Chen, Jiang Bian, and Yonghui Wu","link":"http://arxiv.org/abs/2310.06239v1","abstract":"Objective To develop soft prompt-based learning algorithms for large language\nmodels (LLMs), examine the shape of prompts, prompt-tuning using\nfrozen/unfrozen LLMs, transfer learning, and few-shot learning abilities.\nMethods We developed a soft prompt-based LLM model and compared 4 training\nstrategies including (1) fine-tuning without prompts; (2) hard-prompt with\nunfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with\nfrozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for\nclinical concept and relation extraction on two benchmark datasets. We\nevaluated the transfer learning ability of the prompt-based learning algorithms\nin a cross-institution setting. We also assessed the few-shot learning ability.\nResults and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft\nprompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept\nextraction, outperforming the traditional fine-tuning and hard prompt-based\nmodels by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft\nprompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end\nrelation extraction, outperforming the other two models by 0.2~2% and\n0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million\nparameters) LLMs have a big gap to be competitive with unfrozen models; scaling\nLLMs up to billions of parameters makes frozen LLMs competitive with unfrozen\nLLMs. For cross-institute evaluation, soft prompting with a frozen\nGatorTron-8.9B model achieved the best performance. This study demonstrates\nthat (1) machines can learn soft prompts better than humans, (2) frozen LLMs\nhave better few-shot learning ability and transfer learning ability to\nfacilitate muti-institution applications, and (3) frozen LLMs require large\nmodels."},{"date":"2023-10","title":"GeoLLM: Extracting Geospatial Knowledge from Large Language Models","author":"Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, and Stefano Ermon","link":"http://arxiv.org/abs/2310.06213v2","abstract":"The application of machine learning (ML) in a range of geospatial tasks is\nincreasingly common but often relies on globally available covariates such as\nsatellite imagery that can either be expensive or lack predictive power. Here\nwe explore the question of whether the vast amounts of knowledge found in\nInternet language corpora, now compressed within large language models (LLMs),\ncan be leveraged for geospatial prediction tasks. We first demonstrate that\nLLMs embed remarkable spatial information about locations, but naively querying\nLLMs using geographic coordinates alone is ineffective in predicting key\nindicators like population density. We then present GeoLLM, a novel method that\ncan effectively extract geospatial knowledge from LLMs with auxiliary map data\nfrom OpenStreetMap. We demonstrate the utility of our approach across multiple\ntasks of central interest to the international community, including the\nmeasurement of population density and economic livelihoods. Across these tasks,\nour method demonstrates a 70% improvement in performance (measured using\nPearson's $r^2$) relative to baselines that use nearest neighbors or use\ninformation directly from the prompt, and performance equal to or exceeding\nsatellite-based benchmarks in the literature. With GeoLLM, we observe that\nGPT-3.5 outperforms Llama 2 and RoBERTa by 19% and 51% respectively, suggesting\nthat the performance of our method scales well with the size of the model and\nits pretraining dataset. Our experiments reveal that LLMs are remarkably\nsample-efficient, rich in geospatial information, and robust across the globe.\nCrucially, GeoLLM shows promise in mitigating the limitations of existing\ngeospatial covariates and complementing them well. Code is available on the\nproject website: https://rohinmanvi.github.io/GeoLLM"},{"date":"2023-10","title":"Conditional Diffusion Model for Target Speaker Extraction","author":"Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, and Philip C Woodland","link":"http://arxiv.org/abs/2310.04791v1","abstract":"We propose DiffSpEx, a generative target speaker extraction method based on\nscore-based generative modelling through stochastic differential equations.\nDiffSpEx deploys a continuous-time stochastic diffusion process in the complex\nshort-time Fourier transform domain, starting from the target speaker source\nand converging to a Gaussian distribution centred on the mixture of sources.\nFor the reverse-time process, a parametrised score function is conditioned on a\ntarget speaker embedding to extract the target speaker from the mixture of\nsources. We utilise ECAPA-TDNN target speaker embeddings and condition the\nscore function alternately on the SDE time embedding and the target speaker\nembedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix\ndataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we\nshow that fine-tuning a pre-trained DiffSpEx model to a specific speaker\nfurther improves performance, enabling personalisation in target speaker\nextraction."},{"date":"2023-10","title":"Do self-supervised speech and language models extract similar representations as human brain?","author":"Peili Chen, Linyang He, Li Fu, Lu Fan, Edward F. Chang, and Yuanning Li","link":"http://arxiv.org/abs/2310.04645v2","abstract":"Speech and language models trained through self-supervised learning (SSL)\ndemonstrate strong alignment with brain activity during speech and language\nperception. However, given their distinct training modalities, it remains\nunclear whether they correlate with the same neural aspects. We directly\naddress this question by evaluating the brain prediction performance of two\nrepresentative SSL models, Wav2Vec2.0 and GPT-2, designed for speech and\nlanguage tasks. Our findings reveal that both models accurately predict speech\nresponses in the auditory cortex, with a significant correlation between their\nbrain predictions. Notably, shared speech contextual information between\nWav2Vec2.0 and GPT-2 accounts for the majority of explained variance in brain\nactivity, surpassing static semantic and lower-level acoustic-phonetic\ninformation. These results underscore the convergence of speech contextual\nrepresentations in SSL models and their alignment with the neural network\nunderlying speech perception, offering valuable insights into both SSL models\nand the neural basis of speech and language processing."},{"date":"2023-10","title":"Extraction of Medication and Temporal Relation from Clinical Text using Neural Language Models","author":"Hangyu Tu, Lifeng Han, and Goran Nenadic","link":"http://arxiv.org/abs/2310.02229v2","abstract":"Clinical texts, represented in electronic medical records (EMRs), contain\nrich medical information and are essential for disease prediction, personalised\ninformation recommendation, clinical decision support, and medication pattern\nmining and measurement. Relation extractions between medication mentions and\ntemporal information can further help clinicians better understand the\npatients' treatment history. To evaluate the performances of deep learning (DL)\nand large language models (LLMs) in medication extraction and temporal\nrelations classification, we carry out an empirical investigation of\n\\textbf{MedTem} project using several advanced learning structures including\nBiLSTM-CRF and CNN-BiLSTM for a clinical domain named entity recognition (NER),\nand BERT-CNN for temporal relation extraction (RE), in addition to the\nexploration of different word embedding techniques. Furthermore, we also\ndesigned a set of post-processing roles to generate structured output on\nmedications and the temporal relation. Our experiments show that CNN-BiLSTM\nslightly wins the BiLSTM-CRF model on the i2b2-2009 clinical NER task yielding\n75.67, 77.83, and 78.17 for precision, recall, and F1 scores using Macro\nAverage. BERT-CNN model also produced reasonable evaluation scores 64.48,\n67.17, and 65.03 for P/R/F1 using Macro Avg on the temporal relation extraction\ntest set from i2b2-2012 challenges. Code and Tools from MedTem will be hosted\nat \\url{https://github.com/HECTA-UoM/MedTem}"},{"date":"2023-10","title":"An evaluation of pre-trained models for feature extraction in image classification","author":"Erick da Silva Puls, Matheus V. Todescato, and Joel L. Carbonera","link":"http://arxiv.org/abs/2310.02037v1","abstract":"In recent years, we have witnessed a considerable increase in performance in\nimage classification tasks. This performance improvement is mainly due to the\nadoption of deep learning techniques. Generally, deep learning techniques\ndemand a large set of annotated data, making it a challenge when applying it to\nsmall datasets. In this scenario, transfer learning strategies have become a\npromising alternative to overcome these issues. This work aims to compare the\nperformance of different pre-trained neural networks for feature extraction in\nimage classification tasks. We evaluated 16 different pre-trained models in\nfour image datasets. Our results demonstrate that the best general performance\nalong the datasets was achieved by CLIP-ViT-B and ViT-H-14, where the\nCLIP-ResNet50 model had similar performance but with less variability.\nTherefore, our study provides evidence supporting the choice of models for\nfeature extraction in image classification tasks."},{"date":"2023-10","title":"Beyond Labeling Oracles: What does it mean to steal ML models?","author":"Avital Shafran, Ilia Shumailov, Murat A. Erdogdu, and Nicolas Papernot","link":"http://arxiv.org/abs/2310.01959v3","abstract":"Model extraction attacks are designed to steal trained models with only query\naccess, as is often provided through APIs that ML-as-a-Service providers offer.\nMachine Learning (ML) models are expensive to train, in part because data is\nhard to obtain, and a primary incentive for model extraction is to acquire a\nmodel while incurring less cost than training from scratch. Literature on model\nextraction commonly claims or presumes that the attacker is able to save on\nboth data acquisition and labeling costs. We thoroughly evaluate this\nassumption and find that the attacker often does not. This is because current\nattacks implicitly rely on the adversary being able to sample from the victim\nmodel's data distribution. We thoroughly research factors influencing the\nsuccess of model extraction. We discover that prior knowledge of the attacker,\ni.e., access to in-distribution data, dominates other factors like the attack\npolicy the adversary follows to choose which queries to make to the victim\nmodel API. Our findings urge the community to redefine the adversarial goals of\nME attacks as current evaluation methods misinterpret the ME performance."},{"date":"2023-10","title":"Unsupervised Roofline Extraction from True Orthophotos for LoD2 Building Model Reconstruction","author":"Weixiao Gao, Ravi Peters, and Jantien Stoter","link":"http://arxiv.org/abs/2310.01067v1","abstract":"This paper discusses the reconstruction of LoD2 building models from 2D and\n3D data for large-scale urban environments. Traditional methods involve the use\nof LiDAR point clouds, but due to high costs and long intervals associated with\nacquiring such data for rapidly developing areas, researchers have started\nexploring the use of point clouds generated from (oblique) aerial images.\nHowever, using such point clouds for traditional plane detection-based methods\ncan result in significant errors and introduce noise into the reconstructed\nbuilding models. To address this, this paper presents a method for extracting\nrooflines from true orthophotos using line detection for the reconstruction of\nbuilding models at the LoD2 level. The approach is able to extract relatively\ncomplete rooflines without the need for pre-labeled training data or\npre-trained models. These lines can directly be used in the LoD2 building model\nreconstruction process. The method is superior to existing plane\ndetection-based methods and state-of-the-art deep learning methods in terms of\nthe accuracy and completeness of the reconstructed building. Our source code is\navailable at https://github.com/tudelft3d/Roofline-extraction-from-orthophotos."}] \ No newline at end of file diff --git a/assets/jupyter/blog.ipynb.html b/assets/jupyter/blog.ipynb.html index 53ea0034..e4bcc420 100644 --- a/assets/jupyter/blog.ipynb.html +++ b/assets/jupyter/blog.ipynb.html @@ -1,4 +1,4 @@ - jekyll-jupyter-notebook20241113-1796-gjjdq4