LLM
ยท NLP
Text2All
ยท All2All
Multi-modal
ยท Multi-task
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
- MoVA: Adapting Mixture of Vision Experts to Multimodal Context
- Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
- Chat Vector: A Simple Approach to Equip LLMs with Instruction Following and Model Alignment in New Languages
- From r to Qโ: Your Language Model is Secretly a Q-Function
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
- DoRA: Weight-Decomposed Low-Rank Adaptation
- Many-Shot In-Context Learning
Human intelligence thrives on the concept of cognitive synergy, where collaboration and information integration among different cognitive processes yield superior outcomes compared to individual cognitive processes in isolation. Although Large Language Models (LLMs) have demonstrated promising performance as general task-solving agents, they still struggle with tasks that require intensive domain knowledge and complex reasoning. In this work, we propose Solo Performance Prompting (SPP), which transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist refers to an intelligent agent that collaborates with multiple minds, combining their individual strengths and knowledge, to enhance problem-solving and overall performance in complex tasks. By dynamically identifying and simulating different personas based on task inputs, SPP unleashes the potential of cognitive synergy in LLMs. We have discovered that assigning multiple, fine-grained personas in LLMs elicits better problem-solving abilities compared to using a single or fixed number of personas. We evaluate SPP on three challenging tasks: Trivia Creative Writing, Codenames Collaborative, and Logic Grid Puzzle, encompassing both knowledge-intensive and reasoning-intensive types. Unlike previous works, such as Chain-of-Thought, that solely enhance the reasoning abilities in LLMs, SPP effectively elicits internal knowledge acquisition abilities, reduces hallucination, and maintains strong reasoning capabilities. Code, data, and prompts can be found at: this https URL.
We present LLM-Blender, an ensembling framework designed to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). Our framework consists of two modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. It jointly encodes the input text and a pair of candidates, using cross-attention encoders to determine the superior one. Our results demonstrate that PairRanker exhibits the highest correlation with ChatGPT-based ranking. Then, GenFuser aims to merge the top-ranked candidates, generating an improved output by capitalizing on their strengths and mitigating their weaknesses. To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons. Our LLM-Blender significantly outperform individual LLMs and baseline methods across various metrics, establishing a substantial performance gap.
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selectionโ a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.
Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks.
Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. In this work, we introduce LongNet, a Transformer variant that can scale sequence length to more than 1 billion tokens, without sacrificing the performance on shorter sequences. Specifically, we propose dilated attention, which expands the attentive field exponentially as the distance grows. LongNet has significant advantages: 1) it has a linear computation complexity and a logarithm dependency between tokens; 2) it can be served as a distributed trainer for extremely long sequences; 3) its dilated attention is a drop-in replacement for standard attention, which can be seamlessly integrated with the existing Transformer-based optimization. Experiments results demonstrate that LongNet yields strong performance on both long-sequence modeling and general language tasks. Our work opens up new possibilities for modeling very long sequences, e.g., treating a whole corpus or even the entire Internet as a sequence.
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We introduce Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains state-of-the-art performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO).
Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. We release Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible user updates or version changes. It also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, we introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. The model and code of Gorilla are available at https://github.com/ShishirPatil/gorilla.
Large language models (LLMs) have achieved remarkable progress in various natural language processing tasks with emergent abilities. However, they face inherent limitations, such as an inability to access up-to-date information, utilize external tools, or perform precise mathematical reasoning. In this paper, we introduce Chameleon, a plug-and-play compositional reasoning framework that augments LLMs to help address these challenges. Chameleon synthesizes programs to compose various tools, including LLM models, off-the-shelf vision models, web search engines, Python functions, and rule-based modules tailored to user interests. Built on top of an LLM as a natural language planner, Chameleon infers the appropriate sequence of tools to compose and execute in order to generate a final response. We showcase the adaptability and effectiveness of Chameleon on two tasks: ScienceQA and TabMWP. Notably, Chameleon with GPT-4 achieves an 86.54% accuracy on ScienceQA, significantly improving upon the best published few-shot model by 11.37%; using GPT-4 as the underlying LLM, Chameleon achieves a 17.8% increase over the state-of-the-art model, leading to a 98.78% overall accuracy on TabMWP. Further studies suggest that using GPT-4 as a planner exhibits more consistent and rational tool selection and is able to infer potential constraints given the instructions, compared to other LLMs like ChatGPT.
How to efficiently transform large language models (LLMs) into instruction followers is recently a popular research direction, while training LLM for multi-modal reasoning remains less explored. Although the recent LLaMA-Adapter demonstrates the potential to handle visual inputs with LLMs, it still cannot generalize well to open-ended visual instructions and lags behind GPT-4. In this paper, we present LLaMA-Adapter V2, a parameter-efficient visual instruction model. Specifically, we first augment LLaMA-Adapter by unlocking more learnable parameters (e.g., norm, bias and scale), which distribute the instruction-following ability across the entire LLaMA model besides adapters. Secondly, we propose an early fusion strategy to feed visual tokens only into the early LLM layers, contributing to better visual knowledge incorporation. Thirdly, a joint training paradigm of image-text pairs and instruction-following data is introduced by optimizing disjoint groups of learnable parameters. This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset. During inference, we incorporate additional expert models (e.g. captioning/OCR systems) into LLaMA-Adapter to further enhance its image understanding capability without incurring training costs. Compared to the original LLaMA-Adapter, our LLaMA-Adapter V2 can perform open-ended multi-modal instructions by merely introducing 14M parameters over LLaMA. The newly designed framework also exhibits stronger language-only instruction-following capabilities and even excels in chat interactions. Our code and models are available at this https URL.
Believable proxies of human behavior can empower interactive applications ranging from immersive environments to rehearsal spaces for interpersonal communication to prototyping tools. In this paper, we introduce generative agents--computational software agents that simulate believable human behavior. Generative agents wake up, cook breakfast, and head to work; artists paint, while authors write; they form opinions, notice each other, and initiate conversations; they remember and reflect on days past as they plan the next day. To enable generative agents, we describe an architecture that extends a large language model to store a complete record of the agent's experiences using natural language, synthesize those memories over time into higher-level reflections, and retrieve them dynamically to plan behavior. We instantiate generative agents to populate an interactive sandbox environment inspired by The Sims, where end users can interact with a small town of twenty five agents using natural language. In an evaluation, these generative agents produce believable individual and emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine's Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party, and coordinate to show up for the party together at the right time. We demonstrate through ablation that the components of our agent architecture--observation, planning, and reflection--each contribute critically to the believability of agent behavior. By fusing large language models with computational, interactive agents, this work introduces architectural and interaction patterns for enabling believable simulations of human behavior.
Recent advancements in decision-making large language model (LLM) agents have demonstrated impressive performance across various benchmarks. However, these state-of-the-art approaches typically necessitate internal model fine-tuning, external model fine-tuning, or policy optimization over a defined state space. Implementing these methods can prove challenging due to the scarcity of high-quality training data or the lack of well-defined state space. Moreover, these agents do not possess certain qualities inherent to human decision-making processes, specifically the ability to learn from mistakes. Self-reflection allows humans to efficiently solve novel problems through a process of trial and error. Building on recent research, we propose Reflexion, an approach that endows an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities. To achieve full automation, we introduce a straightforward yet effective heuristic that enables the agent to pinpoint hallucination instances, avoid repetition in action sequences, and, in some environments, construct an internal memory map of the given environment. To assess our approach, we evaluate the agent's ability to complete decision-making tasks in AlfWorld environments and knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments. We observe success rates of 97% and 51%, respectively, and provide a discussion on the emergent property of self-reflection.
Like people, LLMs do not always generate the best text for a given generation problem on their first try (e.g., summaries, answers, explanations). Just as people then refine their text, we introduce SELF-REFINE, a framework for similarly improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an output using an LLM, then allow the same model to provide multi-aspect feedback for its own output; finally, the same model refines its previously generated output given its own feedback. Unlike earlier work, our iterative refinement framework does not require supervised training data or reinforcement learning, and works with a single LLM. We experiment with 7 diverse tasks, ranging from review rewriting to math reasoning, demonstrating that our approach outperforms direct generation. In all tasks, outputs generated with SELF-REFINE are preferred by humans and by automated metrics over those generated directly with GPT-3.5 and GPT-4, improving on average by absolute 20% across tasks.
Solving complicated AI tasks with different domains and modalities is a key step toward advanced artificial intelligence. While there are abundant AI models available for different domains and modalities, they cannot handle complicated AI tasks. Considering large language models (LLMs) have exhibited exceptional ability in language understanding, generation, interaction, and reasoning, we advocate that LLMs could act as a controller to manage existing AI models to solve complicated AI tasks and language could be a generic interface to empower this. Based on this philosophy, we present HuggingGPT, a framework that leverages LLMs (e.g., ChatGPT) to connect various AI models in machine learning communities (e.g., Hugging Face) to solve AI tasks. Specifically, we use ChatGPT to conduct task planning when receiving a user request, select models according to their function descriptions available in Hugging Face, execute each subtask with the selected AI model, and summarize the response according to the execution results. By leveraging the strong language capability of ChatGPT and abundant AI models in Hugging Face, HuggingGPT is able to cover numerous sophisticated AI tasks in different modalities and domains and achieve impressive results in language, vision, speech, and other challenging tasks, which paves a new way towards advanced artificial intelligence.
Auto-GPT is an experimental open-source application showcasing the capabilities of the GPT-4 language model. This program, driven by GPT-4, chains together LLM "thoughts", to autonomously achieve whatever goal you set. As one of the first examples of GPT-4 running fully autonomously, Auto-GPT pushes the boundaries of what is possible with AI.
There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ by two orders of magnitude. In particular, using LLMs on large collections of queries and text can be expensive. Motivated by this, we outline and discuss three types of strategies that users can exploit to reduce the inference cost associated with using LLMs: 1) prompt adaptation, 2) LLM approximation, and 3) LLM cascade. As an example, we propose FrugalGPT, a simple yet flexible instantiation of LLM cascade which learns which combinations of LLMs to use for different queries in order to reduce cost and improve accuracy. Our experiments show that FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost. The ideas and findings presented here lay a foundation for using LLMs sustainably and efficiently.
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. However, existing methods are difficult to reproduce or build on, due to private code, data, and large compute requirements. This has created substantial barriers to research on machine learning methods for theorem proving. This paper removes these barriers by introducing LeanDojo: an open-source Lean playground consisting of toolkits, data, models, and benchmarks. LeanDojo extracts data from Lean and enables interaction with the proof environment programmatically. It contains fine-grained annotations of premises in proofs, providing valuable data for premise selectionโ a key bottleneck in theorem proving. Using this data, we develop ReProver (Retrieval-Augmented Prover): the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library. It is inexpensive and needs only one GPU week of training. Our retriever leverages LeanDojo's program analysis capability to identify accessible premises and hard negative examples, which makes retrieval much more effective. Furthermore, we construct a new benchmark consisting of 96,962 theorems and proofs extracted from Lean's math library. It features challenging data split requiring the prover to generalize to theorems relying on novel premises that are never used in training. We use this benchmark for training and evaluation, and experimental results demonstrate the effectiveness of ReProver over non-retrieval baselines and GPT-4. We thus provide the first set of open-source LLM-based theorem provers without any proprietary datasets and release it under a permissive MIT license to facilitate further research.
Recent work has shown that prompting language models with code-like representations of natural language leads to performance improvements on structured reasoning tasks. However, such tasks comprise only a small subset of all natural language tasks. In our work, we seek to answer whether or not code-prompting is the preferred way of interacting with language models in general. We compare code and text prompts across three popular GPT models (davinci, code-davinci-002, and text-davinci-002) on a broader selection of tasks (e.g., QA, sentiment, summarization) and find that with few exceptions, code prompts do not consistently outperform text prompts. Furthermore, we show that the style of code prompt has a large effect on performance for some but not all tasks and that fine-tuning on text instructions leads to better relative performance of code prompts.
Large Language Models (LLMs) perform complex reasoning by generating explanations for their predictions. However, a complementary goal of explanations is to also communicate useful knowledge that improves weaker agents. Hence, we investigate whether LLMs also make good teachers for weaker agents. In particular, we consider a student-teacher framework between two LLM agents and study if, when, and how the teacher should intervene with natural language explanations to improve the student's performance. Since communication is expensive, we define a budget such that the teacher only communicates explanations for a fraction of the data, after which the student should perform well on its own. We decompose the teaching problem along four axes: (1) if teacher's test time intervention improve student predictions, (2) when it is worth explaining a data point, (3) how the teacher should personalize explanations to better teach the student, and (4) if teacher explanations also improve student performance on future unexplained data. We first show that teacher LLMs can indeed intervene on student reasoning to improve their performance. Next, we propose a Theory of Mind approach, in which the teacher builds two few-shot mental models of the student. The first model defines an Intervention Function that simulates the utility of an intervention, allowing the teacher to intervene when this utility is the highest and improving student performance at lower budgets. The second model enables the teacher to personalize explanations for a particular student and outperform unpersonalized teachers. We also demonstrate that in multi-turn interactions, teacher explanations generalize and learning from explained data improves student performance on future unexplained data. Finally, we also verify that misaligned teachers can lower student performance to random chance by intentionally misleading them.
- [Kosmos-2: Grounding Multimodal Large Language Models to the World]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world. Specifically, we represent refer expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where object descriptions are sequences of location tokens. Together with multimodal corpora, we construct large-scale data of grounded image-text pairs (called GrIT) to train the model. In addition to the existing capabilities of MLLMs (e.g., perceiving general modalities, following instructions, and performing in-context learning), Kosmos-2 integrates the grounding capability into downstream applications. We evaluate Kosmos-2 on a wide range of tasks, including (i) multimodal grounding, such as referring expression comprehension, and phrase grounding, (ii) multimodal referring, such as referring expression generation, (iii) perception-language tasks, and (iv) language understanding and generation. This work lays out the foundation for the development of Embodiment AI and sheds light on the big convergence of language, multimodal perception, action, and world modeling, which is a key step toward artificial general intelligence. Code and pretrained models are available at this https URL.
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.
Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Codes shall be released upon acceptance.
Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. With Multimodal-CoT, our model under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by 16 percentage points (75.17%->91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance. Code is publicly available available at this https URL.
-
unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
-
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
-
โLow-Resourceโ Text Classification: A Parameter-Free Classification Method with Compressors
-
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
-
Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks
-
Full Parameter Fine-tuning for Large Language Models with Limited Resources
-
Unifying Large Language Models and Knowledge Graphs: A Roadmap
-
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
-
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
-
Dr. LLaMA: Improving Small Language Models Through Generative Data Augmentation
-
Long Sequence Modeling with XGen: A 7B LLM Trained on 8K Input Sequence Length
-
Chinaโs Baidu claims its Ernie Bot beats ChatGPT on key tests as A.I. race heats up
-
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
-
Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM
-
Language Is Not All You Need: Aligning Perception with Language Models
-
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
-
Do Prompt-Based Models Really Understand the Meaning of their Prompts?
-
Improving language models by retrieving from trillions of tokens
-
Structure and Content-Guided Video Synthesis with Diffusion Models
-
InstructGPT : Training language models to follow instructions with human feedback
-
BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining
-
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
-
What learning algorithm is in-context learning? Investigations with linear models
-
Toolformer: Language Models Can Teach Themselves to Use Tools
-
Improving alignment of dialogue agents via targeted human judgements
-
RLHF: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
-
BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores
-
Flamingo: a Visual Language Model for Few-Shot Learning, Blog
-
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
-
GPU and learning method required for KoChatLlaMA fine-tuning
-
GPT-4 is coming next week โ and it will be multimodal, says Microsoft Germany
-
Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages
-
Tightly-Integrated Generative Encoder-Decoder Representation
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
-
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
-
Low-rank Adaptation for Fast Text-to-Image Diffusion Fine-tuning
-
Improving language models by retrieving from trillions of tokens
-
T0: Multitask Prompted Training Enables Zero-Shot Task Generalization
-
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
-
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
-
Exploring the Benefits of Training Expert Language Models over Instruction Tuning
-
Unsupervised Imputation of Non-ignorably Missing Data Using Importance-Weighted Autoencoders
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
-
Do Prompt-Based Models Really Understand the Meaning of their Prompts?
-
Muse: Text-To-Image Generation via Masked Generative Transformers
-
Structure and Content-Guided Video Synthesis with Diffusion Models
-
Accurate global machine learning force fields for molecules with hundreds of atoms
-
Algorithms with More Granular Differential Privacy Guarantees
-
Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types
-
Are we cobblers without shoes? Making Computer Science data FAIR
-
Creating, Calibrating, and Validating Large-Scale Microscopic Traffic Simulation
-
Increasing Impact of Mobile Health Programs: SAHELI for Maternal and Child Care
-
Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges
-
Developer Productivity for Humans: A Human-Centered Approach to Developer Productivity
-
Development of a Machine Learning Model for Sonographic Assessment of Gestational Age
-
Estimates of broadband upwelling irradiance from GOES-16 ABI
-
Flexible Budgets in Restless Bandits: A Primal-Dual Algorithm for Efficient Budget Allocation
-
Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
-
High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs
-
Helpful Neighbors: Leveraging Neighbors in Geographic Feature Pronunciation
-
KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals
-
Machine Learning for Healthcare: A Bibliometric Study of Contributions from Africa
-
Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale Applications
-
Deepmind: Improving language models by retrieving from trillions of tokens
-
Deepmind: Mastering Stratego, the classic game of imperfect information
-
Deepmind: AlphaFold reveals the structure of the protein universe
-
Deepmind: Exploring the beauty of pure mathematics in novel ways
-
Deepmind: Putting the power of AlphaFold into the worldโs hands
-
Google Research: Deciphering clinical abbreviations with privacy protecting ML
-
Google Research: Google Research, 2022 & beyond: Language, vision and generative models
-
Google Research: Google Research, 2022 & beyond: Responsible AI
-
Google Research: Google Research, 2022 & beyond: ML & computer systems
-
Google Research: Real-time tracking of wildfire boundaries using satellite imagery
-
Google Research: DiffQG: Generating Questions on Paired Sentences
-
Google Research: Assessment of Security Defense of Native Programs Against Software Faults
-
Google Research: Adaptive mixing of auxiliary losses in supervised learning
-
[2013/01] Efficient Estimation of Word Representations in Vector Space
-
[2014/12] Dependency-Based Word Embeddings
-
[2015/07] Neural Machine Translation of Rare Words with Subword Units
-
[2014/07] GloVe: Global Vectors for Word Representation : GloVe
-
[2016/06] Siamese CBOW: Optimizing Word Embeddings for Sentence Representations : Siamese CBOW
-
[2016/07] Enriching Word Vectors with Subword Information : fastText
-
[2014/09] Sequence to Sequence Learningwith Neural Networks : seq2seq
-
[2017/07] Attention Is All You Need : Transformer
-
[2017/08] Learned in Translation: Contextualized Word Vectors : CoVe
-
[2018/01] Universal Language Model Fine-tuning for Text Classification : ULMFIT
-
[2018/02] Deep contextualized word representations : ELMo
-
[2018/06] Improving Language Understanding by Generative Pre-Training : GPT-1
-
[2018/10] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding : BERT
-
[2019/02] Language Models are Unsupervised Multitask Learners : GPT-2
-
[2019/04] Language Models with Transformers
-
[2019/01] Cross-lingual Language Model Pretraining XLM
-
[2019/01] Multi-Task Deep Neural Networks for Natural Language Understanding : MT-DNN
-
[2019/01] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context : Transformer-XL
-
[2019/06] XLNet: Generalized Autoregressive Pretraining for Language Understanding : XLNet
-
[2019/09] Fine-Tuning Language Models from Human Preferences
-
[2019/01] BioBERT: a pre-trained biomedical language representation model for biomedical text mining : BioBERT
-
[2019/03] SciBERT: A Pretrained Language Model for Scientific Text : SciBERT
-
[2019/04] ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission : ClinicalBERT
-
[2019/06] HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization : HIBERT
-
[2019/07] SpanBERT: Improving Pre-training by Representing and Predicting Spans : SpanBERT
-
[2019/08] Pre-Training with Whole Word Masking for Chinese BERT
-
[2019/07] R-Transformer: Recurrent Neural Network Enhanced Transformer : R-Transformer
-
[2019/09] FREELB: ENHANCED ADVERSARIAL TRAINING FOR LANGUAGE UNDERSTANDING : FREELB
-
[2019/09] Mixup Inference: Better Exploiting Mixup to Defend Adversarial Attacks
-
[2019/10] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer : T5
-
[2018/07] Subword-level Word Vector Representations for Korean
-
[2019/08] Zero-shot Word Sense Disambiguation using Sense Definition Embeddings
-
[2019/06] Bridging the Gap between Training and Inference for Neural Machine Translation
-
[2019/06] Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts
-
[2019/07] A Simple Theoretical Model of Importance for Summarization
-
[2019/05] Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
-
[2019/07] We need to talk about standard splits
-
[2019/07] ERNIE 2.0: A Continual Pre-training Framework for Language Understanding : ERNIE 2.0
-
[2019/05] SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems : SuperGLUE
-
[2020/01] Towards a Human-like Open-Domain Chatbot + Google AI Blog
-
[2020/03] ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators : ELECTRA
-
[2019/04] Mask-Predict: Parallel Decoding of Conditional Masked Language Models : Mask-Predict
-
[2020/01] Reformer: The Efficient Transformer : Reformer
-
[2020/04] Longformer: The Long-Document Transformer : Longformer
-
[2019/11] DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation : DialoGPT
-
[2020/01] Towards a Human-like Open-Domain Chatbot
-
[2020/04] You Impress Me: Dialogue Generation via Mutual Persona Perception
-
[2020/04] ToD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogues : ToD-BERT
-
[2020/04] SOLOIST: Few-shot Task-Oriented Dialog with A Single Pre-trained Auto-regressive Model : SOLOIST
-
[2020/05] A Simple Language Model for Task-Oriented Dialogue
-
[2019/07] ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation : ReCoSa
-
[2020/04] FastBERT: a Self-distilling BERT with Adaptive Inference Time : FastBERT
-
[2020/01] PoWER-BERT: Accelerating BERT inference for Classification Tasks : PoWER-BERT
-
[2019/10] DistillBERT, a distilled version of BERT: smaller, faster, cheaper and lighter : DistillBERT
-
[2019/10] TinyBERT: Distilling BERT for Natural Language Understanding : TinyBERT
-
[2018/12] Conditional BERT Contextual Augmentation
-
[2020/03] Data Augmentation using Pre-trained Transformer Models
-
[2020/04] FLAT: Chinese NER Using Flat-Lattice Transformer : FLAT
-
[2019/12] Big Transfer (BiT): General Visual Representation Learning : BiT
-
[2019/04] ERNIE: Enhanced Representation through Knowledge Integration : ERNIE
-
[2019/07] ERNIE 2.0: A Continual Pre-training Framework for Language Understanding : ERNIE 2.0
-
[2020/06] ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph : ERNIE-ViL
-
[2020/12] ERNIE-Doc: A Retrospective Long-Document Modeling Transformer : ERNIE-Doc
-
[2021/07] ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation : ERNIE 3.0
-
[2022/10] Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning
-
[2017/03] Distilling Task-Specific Knowledge from BERT into Simple Neural Networks
-
[2020/10] DiPair: Fast and Accurate Distillation for Trillion-Scale Text Matching and Pair Modeling : DiPair
-
[2021/08] Distilling Transformers for Neural Cross-Domain Search
-
[2020/06] DeBERTa: Decoding-enhanced BERT with Disentangled Attention : DeBERTa
-
[2020/11] VEGA: Towards an End-to-End Configurable AutoML Pipeline : VEGA
-
[2020/12] FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding : FILTER
-
[2019/12] StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding : StructBERT
-
[2019/04] Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding : MT-DNN
-
[2021/05] Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation
์ค์ง
์ต์ MLLM ๊ด๋ จ ์คํฐ๋. ๊ธฐ๋ณธ ์คํ์ ์งํ. ๋ ผ๋ฌธ, ๊ฐ์, ์ฝ๋, ๋ด์ค, ๋ธ๋ก๊ทธ ๋ฑ ๋ค์ํ ์๋ฃ๋ก ํ์ต.
MLLM, LLM, NLG, Dialogue, Reinforcement learning, Distillation, Efficient, Sentence similarity, multiple tasks, multimodal, Stable diffusion, TTS, Text-To-Video, All-To-All, ์ฐ์ฃผ, ์๋ช , ์ง๋ฅ, ์ค๋ฆฌ, ๊ท์ , ๋ฒ, ๋ ธํ, ์ํ, ํฌ์, ๊ฐ๋ฐ, ์ธํ๋ผ, ๋์์ธ, ๊ฒฝ์, ETC...
์ ๋ง ์คํํธ์ C๋ ๋ฒจ, ๊ตญ๋ด์ธ ํํฐ์ด ์ฐ๊ตฌ์, ๊ตญ๋ด์ธ ํํฐ์ด ๋ํ, ๋ํ์ ์ฌํ์๊ณผ ์กธ์ ์, ์ํ, ๊ต์ ๋ฑ A๊ธ ์ธ์ฌ๋ค์ด ์ต์ ๋ ผ๋ฌธ, ๊ฐ์ ๋ฑ ์คํฐ๋ ๋ฐ ํ๋ก์ ํธ ์งํ.
๊ธฐ๋ณธ ๋งค์ฃผ ์์์ผ ์คํ 7์๋ฐ. ์ฌ์ ํ์ต ์์ด ๋ ผ๋ฌธ ์ฝ๊ธฐ ์ต๋ 20๋ถ, ํ ๋ก ์ต๋ 40๋ถ. ํ ๋ฒ์ 1 ~ 10๊ฐ ๋ ผ๋ฌธ, ๊ฐ์ ๋ฑ ์งํ. ์ง๊ธ๊น์ง๋ ํญ์ 3๊ฐ. ์ฃผ์ ๋ ผ๋ฌธ ์ ์ ์ ์์ . ํํฐ์ด ํํ ๋ ผ๋ฌธ ๋ฐ ํ๋ก์ ํธ ์ ์ ์์ .
์ฃผ๋ง์ ํฌํจํ์ฌ, ๊ฑฐ์ ๋งค์ผ ์ถ๊ฐ ์คํฐ๋ ์กด์ฌ. ํฅ๋ฏธ๋ก์ด ์ฃผ์ ๊ฑฐ๋ ์ฐธ์ฌ ๋๋ ๋ ๋ง ์ค๊ฐ์ ๋ค์ด์์ ์ค๊ฐ์ ๋๊ฐ๋ ๋ฌด๊ด. ๋ชจ๋ ๊ท์น์ ํ์ ๊ฐ๋ฅ. ์คํ๋ผ์ธ ๋ชจ์๋ ์์ . ์์จ ์ฐธ์ฌ.
- ์์ด๋ง ์ฌ์ฉ์ ๊ธ์ง. ํ๊ตญ์ด ์ค์ฌ ์ฌ์ฉ. ํน์ ์ฉ์ด๋ ์์ด ์ฌ์ฉ.
- 1์ฃผ์ผ์ ๋ ผ๋ฌธ 2๊ฐ ์ด์ ์คํฐ๋. ๋๋ ์ฌ๋์ 10๊ฐ ์ด์.
- 3๋ถ์์ 20๋ถ ํ์ฅ์์ ๋ ผ๋ฌธ ์ฝ๊ธฐ. 5๋ถ์์ 30๋ถ ํ ๋ก .
- 1์๊ฐ ์คํฐ๋ ์, ๋ฐ๋ก ๋๊ฐ๋ ๋จ. ์ํ ๋ 10๋ถ ์ดํ ์ฐธ์ฌ๋ ๋ฌด๊ด. ์์ ๋กญ๊ฒ ์งํ. 2์๊ฐ ๋งค์ผ๋ ๊ฐ๋ฅ.
- ๊ฐ์ ๋ ๋ฐ์ด๋ ๊ฒ ์๋ค๋ ๊ฒ์ ์ธ์ง. ๋ค๋ค ๋๋จํ ๋ถ๋ค์ด๋ ์ง๋ฌธ ๋ง์ด ํ๊ณ , ์ ๋ณด ๊ณต์ ์์ฃผ.
- ๋ณธ์ธ์ด ํ๊ธฐ๋ก ํ ์ผ๋ง์ ์ํ. ํ๋ค๊ณ ๋งํ๊ณ , ์ ํ๋ ๊ฒ์ ๋ฏผํ๋ค.
- ๊ธฐ๋ณธ์ ์ผ๋ก ๋ นํ ํ ๋ด๋ถ ๊ณต์ .
- ์ ๋ณด๋ฅผ ํผ์ ์๊ฒ ์ฐ์ง ๋ง๊ณ , ๋ค ๊ฐ์ด ์๊ฒ ๋งํ๊ธฐ.
- ๊ฐ์ธ ์ฌ์ ์ผ๋ก ์คํฐ๋ ํํด ์, ์๊ธฐ์๊ฐ์ ์ธ์ฌ ์์ฑ.
- ์ฌ๋ฌ ๊ธฐ๊ด ์ข์ ๊ท์น ๋ถ์ฌ๋ฃ๊ธฐ.
- ํ์ ๋์์ด ๋๋ค๊ณ ํ๋จํ๋ฉด, ์ ๊ท์น์ ๋ชจ๋ ๋ฌด์ํ๊ณ ํ๋.
- ์ถ๊ฐ.
mathematics | machine learning | Transformer | Hugging Face |
---|---|---|---|
mathematics for machine learning | Pattern Recognition and Machine Learning | Getting Started with Google BERT | Natural Language Processing with Transformers |