-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
saber
committed
Sep 8, 2024
1 parent
8155010
commit 681c493
Showing
7 changed files
with
701 additions
and
15 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
Empty file.
This file was deleted.
Oops, something went wrong.
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,303 @@ | ||
--- | ||
title: 大模型第一次例会 | ||
author: saber | ||
date: 2024/9/9 | ||
--- | ||
# (Andrej Karpathy) Intro to Large Language Models | ||
|
||
>批注或想法以这种方式记录 | ||
>[!notation] | ||
>这种形式用于概念的补充说明 | ||
## From | ||
[\[1hr Talk\] Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g) | ||
## Intro to LLM | ||
### What is LLM | ||
以llama2-70b为例 | ||
![Pasted image 20240708170432|398](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240708170432.png) | ||
- llama 是模型名字 | ||
- 2 是指这是该模型的第二代 | ||
- 70b 是模型参数大小 (b-billion) | ||
|
||
#### 2 files | ||
![Pasted image 20240607000442|300](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240607000442.png) | ||
- the parameters files (for the nerual network) | ||
- the code to run the parameters | ||
In LLama2-70B, each parameter stored as 2 bytes(float16 number) and so therefore the parameters file is 140,000,000,000 Bytes | ||
|
||
>[!float16] | ||
>什么是float16 👈详见机组 | ||
>十六位的浮点数数字格式 -- 半精度浮点数(FP16/float16) | ||
>相比于32位双精度,精度低,但是更小更快 | ||
>常用于存储不要求高精度的浮点值,例如图像处理和神经网络 | ||
>![Pasted image 20240707153315|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707153315.png) | ||
大模型是参数+运行这些参数的代码,只要满足这两个条件就能运行大模型而不需要任何魔法 | ||
|
||
#### Obtain the parameters | ||
![Pasted image 20240707154310|550](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707154310.png) | ||
- a chunk of the internet --- for llama2 10TB of text | ||
- GPU cluster --- 6000GPU | ||
- 10TB fo text - 6000GPUs - 12days for llama2-70b 2million dollars | ||
- generate 140GB parameters.zip file | ||
- 可以这么理解,训练过程是把10TB的知识进行压缩,最终得到一个压缩包(参数) | ||
- **BUT** zip is a lossless compression, but here is a **lossy compression** | ||
- 100x | ||
- 通常表现更好的模型需要更多的数据和GPU和训练时间和钱 | ||
|
||
#### function of parameters | ||
|
||
**predict the next word** | ||
give some words and it give you the next word | ||
![Pasted image 20240707154945|475](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707154945.png) | ||
|
||
|
||
很多人因此认为大模型就是在做词语接龙而已,不具备真正的推理能力。 | ||
|
||
值得一提的是 | ||
当预测下一个词的时候,实际上需要将大量世界知识压缩到权重中,可能并不只是token接龙 | ||
![Pasted image 20240707155445|550](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707155445.png) | ||
- 如果大模型生成这样的文章 | ||
- 在每一个词之后实际上有很多种可能,大模型需要知道宝可梦的发布的具体时间,需要知道创始人,需要知道2016年有pokemon go的发布 | ||
- 不可否认完成这项任务的大模型拥有了广泛的世界知识 | ||
|
||
**the networks dreams internet documents** | ||
看上去是正确的,但难以确保哪些是对的哪些是不对的,概率上只是输出符合要求的形式,内容像dreaming | ||
|
||
那你需要知道的 | ||
- 大模型通过概率计算生成下一个词 | ||
- 概率是神经网络得到的 | ||
- 生成的东西不一定正确,不正确的称为幻觉 (hallucination /həˌluː.sɪˈneɪ.ʃən/) | ||
- 大模型确实拥有世界知识(世界知识被压缩在大模型的参数之中) | ||
|
||
|
||
### How does it work? | ||
|
||
**Transformer structure** | ||
![Pasted image 20240708170512|425](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240708170512.png) | ||
![Pasted image 20240707160436|428](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707160436.png) | ||
- 每一层的数学原理都是可知的 | ||
- 但是权重(参数)分布在整个神经网络中 | ||
- 所以我们只能知道怎么调整这些参数(作为一个整体),来改善模型的表现 | ||
- 没办法知道哪一个参数具体在做什么 | ||
|
||
> [!可解释性] | ||
> 模型的可解释性 | ||
> 有专门研究可解释性的领域,尝试深入黑盒探寻各个参数的行为 | ||
> 现在并不完善,如果有兴趣可以进一步研究 | ||
|
||
可以将这些权重看成是建立了一个奇怪的不完美的知识库 | ||
- reversal curse | ||
- ![Pasted image 20240608213353|286](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240608213353.png) | ||
- 知识像是一维的,而不是确切地存有知识,能从各个方向访问,只能是单向 | ||
- **empirical 经验主义** | ||
|
||
>[!经验论和唯理论的争辩] | ||
>人工智能有两派观点 | ||
>符号主义和联结主义 | ||
>- 符号主义主张用符号描述世界 | ||
>- 联结主义利用神经网络的连接机制来模仿人类 | ||
>- 早期人工智能学者大多是符号主义,现在联结主义兴盛 | ||
> | ||
>不是很正确的说,符号主义接近一种唯理论,得出规范统一的理论后在理论的指导下进行决策,联结主义(深度神经网络)接近一种经验论,收集大量数据,从数据(经验)中获取知识,这一点在大模型上尤为突出 | ||
>将神经网络和符号方法结合是一个新的方向,大模型现在也在做 | ||
|
||
|
||
|
||
### Two stages training | ||
Internet document generators阶段,称预训练,第一阶段 | ||
fine tuning 如果有的话,为第二阶段 | ||
**obtain assistant model** | ||
- keep the trainning the same -- next word prediction | ||
- modify the data set | ||
- 一般是手工收集数据 (找一堆廉价劳动力,或者使用大模型) | ||
- 为特定的数据打标签 | ||
- ![Pasted image 20240608215345|272](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240608215345.png) | ||
微调后,模型使用assistant的方式输出,但仍然能使用前一阶段的训练得到的知识,同样难以被解释 | ||
|
||
### Procedure | ||
|
||
![Pasted image 20240609003034|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609003034.png) | ||
|
||
**第一阶段称为pre-trainning, about knowledge** --- base model | ||
**第二阶段称为fine-tuning, about alignment** --- assistant model | ||
|
||
>第一阶段的训练基本决定了模型的表现,第二阶段在小规模数据下更接近于学习某种回答方式 | ||
>但是有通过第二阶段获取专有能力模型的情况(医疗模型,化学模型,金融模型...) | ||
|
||
所以大模型大概是这么生产的 | ||
1. 预训练模型(十几天一两个月) | ||
2. 微调(一两天) | ||
3. 测试,找出有问题的回答 | ||
4. 解决这些问题 | ||
|
||
**怎么解决问题** | ||
- 微调后得到的模型出错的地方,人为修正,整合到数据集然后再train | ||
- 迭代过程 | ||
|
||
>[!base model & assistant model] | ||
>- base model is not super helpful, it may not answer question with answers | ||
> We can do our own fine tuning on it | ||
>- assitant model | ||
> question answer | ||
### Second kind of label | ||
|
||
In fine-tuning stage, step 2 and 3 | ||
**comparison labels** | ||
比较哪个label好要比直接写答案简单 | ||
![Pasted image 20240609003341|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609003341.png) | ||
通过对比模型输出,选择一个最好的加入训练集 | ||
openai称为:**RLHF** reinforcement learning from human feedback 通过人类反馈来进行强化学习 | ||
|
||
**现实中label逐渐变成人和模型协同的工作** | ||
|
||
## Improvement | ||
|
||
#### scaling laws: | ||
- **Performance of LLMs is a smooth, well-behaved, predictable function** | ||
- with 2 varibles: | ||
- **N - the number of parameters in the network** 参数量 | ||
- **D - the amount of text we train on** 训练的文本量 | ||
**only these two numbers can predict what accuracy LLM will achieve on the next word prediction task** | ||
and the trend **do not show signs of topping out** | ||
不会到达顶点 | ||
algorithm is not that necessary(as nice bonus), but we can just get better model for free with better computer | ||
|
||
> 力大砖飞! | ||
![Pasted image 20240609005827|525](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609005827.png) | ||
|
||
#### Evolving | ||
agent | ||
|
||
以chatGPT为例 | ||
实际上使用时不是完全由模型生成,会使用工具 | ||
- web | ||
- 计算器 | ||
- 代码 | ||
|
||
using tools and infrastructures, tying everything together | ||
|
||
>藏在Web背后的大模型基本就是一个工具集合体,agent的路是不是已经被大企业们遥遥领先 | ||
|
||
|
||
## Future | ||
|
||
#### System 2 | ||
![Pasted image 20240609013341|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609013341.png) | ||
|
||
可以认为,人类有两种思考方式 | ||
- 一种是system1,自动的,快速的,无需过多思考的,感情化的,直觉的等等 | ||
- speed chess | ||
- 一种是system2,理性,慢 | ||
- competition | ||
|
||
现在的大模型基本是第一种 | ||
|
||
> 大模型是一个函数,给定输入产生输出,没有为自己的“思考”进行处理(深层思考),有一些工作真在改善这一点,如 CoT,ToT等方法。 | ||
|
||
#### self-improvement | ||
|
||
- AlphaGo | ||
- learn by imitation, can't supass human | ||
- self improvement, with a simple reward function (win) | ||
|
||
|
||
对于大模型来说,难以确认奖励和惩罚 | ||
**在小的领域内可行** | ||
|
||
#### Custom | ||
为每一个领域训练一个专门的模型 | ||
或 | ||
RAG | ||
|
||
GPTs <- 但是GPTs没有做起来 | ||
众多国内的乱七八糟助手 | ||
|
||
#### LLM OS | ||
![Pasted image 20240609014639|425](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609014639.png) | ||
as kernel of a emerging operating system | ||
|
||
将现有的都合起来 | ||
- 对话能力 | ||
- RAG个人语料检索 | ||
- 多工具使用 | ||
- 多模态,语音,图片,视频 | ||
- 微调形式上的优化 | ||
- 个性化设计 | ||
- 小领域的self-improvement | ||
|
||
multi-threading, multi-processing, speculative execution(预测执行), user space, kernel space 都能在LLM OS中找到对应抽象 | ||
|
||
除此之外: | ||
- 主流系统与其他系统构成的庞大生态 | ||
- ![Pasted image 20240609015613](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609015613.png) | ||
|
||
|
||
## Security | ||
|
||
#### jailbreak attacks | ||
越狱攻击 | ||
制作**napalm** | ||
![Pasted image 20240609020024|575](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020024.png) | ||
![Pasted image 20240609020107|575](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020107.png) | ||
|
||
- base64 encoding of the same query | ||
- ![Pasted image 20240609020224|300](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020224.png) | ||
- 可以认为是另一种语言,一些语言也可能产生问题 | ||
- 安全问题大多基于英语 | ||
- 没有学会拒绝非安全query,只是学会了某种模式而已 | ||
|
||
- universal transferable suffix | ||
- 训练出来的攻击模型的后缀 | ||
- ![Pasted image 20240707211504|500](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211504.png) | ||
|
||
- 噪声处理(训练)出来的图片实现越狱 | ||
- ![Pasted image 20240707211616|525](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211616.png) | ||
|
||
#### Prompt injection | ||
![Pasted image 20240609021048|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021048.png) | ||
|
||
将prompt藏在人眼不可见的位置(白色字体),当LLM看到这段prompt时,会忘记之前的prompt(取代原来的prompt),执行当前prompt | ||
|
||
- 将prompt injection attack放在网页里 | ||
- ![Pasted image 20240609021327|325](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021327.png) | ||
- ![Pasted image 20240707211839|450](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211839.png) | ||
|
||
- 放在文档里 | ||
- ![Pasted image 20240609021657|350](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021657.png) | ||
- 当获得一份邮件,让bard总结时,内部有一个新prompt,提示大模型向特定的黑客控制的url发送请求,发送请求时会编码用户个人信息,数据泄露 | ||
- 谷歌工程师设置了权限来解决 | ||
- 然后出现了一个 Google Apps Scriptes,此时权限没有拒绝,可以将数据传送到Google doc,同时攻击者能够访问这个文件,再次泄露 | ||
- 分享文件实现攻击 | ||
|
||
#### Data poisoning/Backdoor attacks | ||
|
||
**trigger phrase** triger LLM to perform specific prediction | ||
黑客通过在网络上设置一些文档,包含trigger phrase,使用这些文档进行训练的模型就会受影响 | ||
下毒 | ||
![Pasted image 20240609022325|450](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609022325.png) | ||
**trigger words corrupt model** | ||
目前是在微调时有效 | ||
|
||
#### 总 | ||
![Pasted image 20240609022518|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609022518.png) | ||
|
||
> 大模型安全也是一个新兴领域,有兴趣从事这方面科研工作的可以多做了解 | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
Oops, something went wrong.