test syntax

DLUT-TIC · Sep 8, 2024 · 681c493 · 681c493
1 parent 8155010
commit 681c493
Show file tree

Hide file tree

Showing 7 changed files with 701 additions and 15 deletions.
diff --git a/.vuepress/config/zh/series.ts b/.vuepress/config/zh/series.ts
@@ -3,6 +3,6 @@ export const series = {
     'overview', 'golangSyntax', 'gin', 'gorm', 'docker-k8s', 'practice'
   ],
   '/docs/ai/':[
-    '学习建议', 
+    '学习建议', '大模型第一次例会','大模型第二次例会'
   ]
 }
diff --git a/docs/ai/become-Ai.md b/docs/ai/become-Ai.md
diff --git a/docs/ai/learn-Ai.md b/docs/ai/learn-Ai.md
diff --git a/docs/ai/overview.md b/docs/ai/overview.md
diff --git a/docs/ai/surpass-Ai.md b/docs/ai/surpass-Ai.md
diff --git a/docs/ai/大模型第一次例会.md b/docs/ai/大模型第一次例会.md
@@ -0,0 +1,303 @@
+---
+title: 大模型第一次例会
+author: saber
+date: 2024/9/9
+---
+# (Andrej Karpathy) Intro to Large Language Models
+
+>批注或想法以这种方式记录
+
+>[!notation]
+>这种形式用于概念的补充说明
+
+## From 
+[\[1hr Talk\] Intro to Large Language Models - YouTube](https://www.youtube.com/watch?v=zjkBMFhNj_g)
+## Intro to LLM
+### What is  LLM
+以llama2-70b为例  
+![Pasted image 20240708170432|398](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240708170432.png)
+- llama 是模型名字
+- 2 是指这是该模型的第二代
+- 70b 是模型参数大小  (b-billion)
+
+#### 2 files   
+![Pasted image 20240607000442|300](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240607000442.png)
+- the parameters files  (for the nerual network)
+- the code to run the parameters 
+In LLama2-70B, each parameter stored as 2 bytes(float16 number) and so therefore the parameters file is 140,000,000,000 Bytes  
+
+>[!float16]
+>什么是float16 👈详见机组
+>十六位的浮点数数字格式 -- 半精度浮点数(FP16/float16)
+>相比于32位双精度，精度低，但是更小更快
+>常用于存储不要求高精度的浮点值，例如图像处理和神经网络
+>![Pasted image 20240707153315|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707153315.png)
+
+大模型是参数+运行这些参数的代码，只要满足这两个条件就能运行大模型而不需要任何魔法    
+
+#### Obtain the parameters 
+![Pasted image 20240707154310|550](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707154310.png)
+- a chunk of the internet  --- for llama2 10TB of text
+- GPU cluster  ---  6000GPU  
+- 10TB fo text - 6000GPUs - 12days for llama2-70b 2million dollars  
+- generate 140GB parameters.zip file
+- 可以这么理解，训练过程是把10TB的知识进行压缩，最终得到一个压缩包(参数)
+- **BUT** zip is a lossless compression, but here is a **lossy compression**  
+- 100x 
+- 通常表现更好的模型需要更多的数据和GPU和训练时间和钱
+
+#### function of parameters
+
+**predict the next word**  
+give some words and it give you the next word  
+![Pasted image 20240707154945|475](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707154945.png)
+
+
+很多人因此认为大模型就是在做词语接龙而已，不具备真正的推理能力。  
+
+值得一提的是
+当预测下一个词的时候，实际上需要将大量世界知识压缩到权重中，可能并不只是token接龙  
+![Pasted image 20240707155445|550](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707155445.png)
+- 如果大模型生成这样的文章
+- 在每一个词之后实际上有很多种可能，大模型需要知道宝可梦的发布的具体时间，需要知道创始人，需要知道2016年有pokemon go的发布
+- 不可否认完成这项任务的大模型拥有了广泛的世界知识
+
+**the networks dreams internet documents**  
+看上去是正确的，但难以确保哪些是对的哪些是不对的，概率上只是输出符合要求的形式，内容像dreaming  
+
+那你需要知道的  
+- 大模型通过概率计算生成下一个词
+- 概率是神经网络得到的
+- 生成的东西不一定正确，不正确的称为幻觉 (hallucination /həˌluː.sɪˈneɪ.ʃən/)
+- 大模型确实拥有世界知识（世界知识被压缩在大模型的参数之中）
+
+
+### How does it work?
+
+**Transformer  structure**  
+![Pasted image 20240708170512|425](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240708170512.png)
+![Pasted image 20240707160436|428](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707160436.png)
+- 每一层的数学原理都是可知的  
+- 但是权重(参数)分布在整个神经网络中  
+- 所以我们只能知道怎么调整这些参数(作为一个整体)，来改善模型的表现
+- 没办法知道哪一个参数具体在做什么
+
+> [!可解释性]
+> 模型的可解释性
+> 有专门研究可解释性的领域，尝试深入黑盒探寻各个参数的行为
+> 现在并不完善，如果有兴趣可以进一步研究
+
+
+可以将这些权重看成是建立了一个奇怪的不完美的知识库  
+- reversal curse
+- ![Pasted image 20240608213353|286](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240608213353.png)
+- 知识像是一维的，而不是确切地存有知识，能从各个方向访问，只能是单向
+- **empirical 经验主义**  
+
+>[!经验论和唯理论的争辩]
+>人工智能有两派观点  
+>符号主义和联结主义
+>- 符号主义主张用符号描述世界
+>- 联结主义利用神经网络的连接机制来模仿人类
+>- 早期人工智能学者大多是符号主义，现在联结主义兴盛
+>
+>不是很正确的说，符号主义接近一种唯理论，得出规范统一的理论后在理论的指导下进行决策，联结主义(深度神经网络)接近一种经验论，收集大量数据，从数据(经验)中获取知识，这一点在大模型上尤为突出
+>将神经网络和符号方法结合是一个新的方向，大模型现在也在做
+
+
+
+
+### Two stages training
+Internet document generators阶段，称预训练，第一阶段 
+fine tuning 如果有的话，为第二阶段
+**obtain assistant model**  
+- keep the trainning the same -- next word prediction
+- modify the data set
+	- 一般是手工收集数据  (找一堆廉价劳动力，或者使用大模型)
+	- 为特定的数据打标签  
+	- ![Pasted image 20240608215345|272](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240608215345.png)
+微调后，模型使用assistant的方式输出，但仍然能使用前一阶段的训练得到的知识，同样难以被解释  
+
+### Procedure
+
+![Pasted image 20240609003034|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609003034.png)
+
+**第一阶段称为pre-trainning, about knowledge**  --- base model
+**第二阶段称为fine-tuning, about alignment**  --- assistant model
+
+>第一阶段的训练基本决定了模型的表现，第二阶段在小规模数据下更接近于学习某种回答方式
+>但是有通过第二阶段获取专有能力模型的情况(医疗模型，化学模型，金融模型...)
+
+
+所以大模型大概是这么生产的  
+1. 预训练模型(十几天一两个月)
+2. 微调(一两天)
+3. 测试，找出有问题的回答
+4. 解决这些问题
+
+**怎么解决问题**  
+- 微调后得到的模型出错的地方，人为修正，整合到数据集然后再train
+- 迭代过程
+
+>[!base model & assistant model]
+>- base model is not super helpful, it may not answer question with answers
+>  We can do our own fine tuning on it 
+>- assitant model
+>  question answer
+
+### Second kind of label
+
+In fine-tuning stage, step 2 and 3  
+**comparison labels**  
+比较哪个label好要比直接写答案简单  
+![Pasted image 20240609003341|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609003341.png)
+通过对比模型输出，选择一个最好的加入训练集  
+openai称为：**RLHF** reinforcement learning from human feedback  通过人类反馈来进行强化学习  
+
+ **现实中label逐渐变成人和模型协同的工作**  
+
+## Improvement
+
+#### scaling laws:  
+- **Performance of LLMs is a smooth, well-behaved, predictable function**
+- with 2 varibles:
+	- **N - the number of parameters in the network**  参数量
+	- **D - the amount of text we train on**  训练的文本量
+**only these two numbers can predict what accuracy LLM will achieve  on the next word prediction task**  
+and the trend **do not show signs of topping out**  
+不会到达顶点  
+algorithm is not that necessary(as nice bonus), but we can just get better model for free with better computer  
+
+> 力大砖飞！
+
+![Pasted image 20240609005827|525](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609005827.png)
+
+#### Evolving
+agent   
+
+以chatGPT为例  
+实际上使用时不是完全由模型生成，会使用工具  
+- web
+- 计算器
+- 代码
+
+using tools and infrastructures, tying everything together  
+
+>藏在Web背后的大模型基本就是一个工具集合体，agent的路是不是已经被大企业们遥遥领先  
+
+
+
+## Future
+
+#### System 2
+![Pasted image 20240609013341|400](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609013341.png)   
+
+可以认为，人类有两种思考方式  
+- 一种是system1，自动的，快速的，无需过多思考的，感情化的，直觉的等等
+	- speed chess
+- 一种是system2，理性，慢
+	- competition
+
+现在的大模型基本是第一种  
+
+> 大模型是一个函数，给定输入产生输出，没有为自己的“思考”进行处理(深层思考)，有一些工作真在改善这一点，如 CoT，ToT等方法。
+
+
+#### self-improvement
+
+- AlphaGo  
+	- learn by imitation, can't supass human
+	- self improvement, with a simple reward function (win)
+
+
+对于大模型来说，难以确认奖励和惩罚  
+**在小的领域内可行**   
+
+#### Custom 
+为每一个领域训练一个专门的模型  
+或  
+RAG  
+
+GPTs  <-  但是GPTs没有做起来
+众多国内的乱七八糟助手  
+
+#### LLM OS
+![Pasted image 20240609014639|425](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609014639.png)
+as kernel of a emerging operating system  
+
+将现有的都合起来  
+- 对话能力
+- RAG个人语料检索
+- 多工具使用
+- 多模态，语音，图片，视频
+- 微调形式上的优化
+- 个性化设计
+- 小领域的self-improvement
+
+multi-threading, multi-processing, speculative execution(预测执行), user space, kernel space 都能在LLM OS中找到对应抽象   
+
+除此之外：
+- 主流系统与其他系统构成的庞大生态  
+- ![Pasted image 20240609015613](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609015613.png)
+
+
+## Security
+
+#### jailbreak attacks
+越狱攻击  
+制作**napalm**  
+![Pasted image 20240609020024|575](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020024.png)
+![Pasted image 20240609020107|575](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020107.png)
+
+- base64 encoding of the same query  
+- ![Pasted image 20240609020224|300](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609020224.png)
+- 可以认为是另一种语言，一些语言也可能产生问题
+	- 安全问题大多基于英语
+	- 没有学会拒绝非安全query，只是学会了某种模式而已
+
+- universal transferable suffix
+- 训练出来的攻击模型的后缀  
+- ![Pasted image 20240707211504|500](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211504.png)
+
+- 噪声处理(训练)出来的图片实现越狱
+- ![Pasted image 20240707211616|525](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211616.png)
+
+#### Prompt injection
+![Pasted image 20240609021048|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021048.png)
+
+将prompt藏在人眼不可见的位置(白色字体)，当LLM看到这段prompt时，会忘记之前的prompt（取代原来的prompt），执行当前prompt
+
+- 将prompt injection attack放在网页里  
+- ![Pasted image 20240609021327|325](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021327.png)
+- ![Pasted image 20240707211839|450](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240707211839.png)
+
+- 放在文档里  
+- ![Pasted image 20240609021657|350](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609021657.png)
+- 当获得一份邮件，让bard总结时，内部有一个新prompt，提示大模型向特定的黑客控制的url发送请求，发送请求时会编码用户个人信息，数据泄露
+- 谷歌工程师设置了权限来解决    
+- 然后出现了一个 Google Apps Scriptes，此时权限没有拒绝，可以将数据传送到Google doc，同时攻击者能够访问这个文件，再次泄露
+	- 分享文件实现攻击
+
+#### Data poisoning/Backdoor attacks
+
+**trigger phrase** triger LLM to perform specific prediction  
+黑客通过在网络上设置一些文档，包含trigger phrase，使用这些文档进行训练的模型就会受影响  
+下毒  
+![Pasted image 20240609022325|450](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609022325.png)
+**trigger words corrupt model**  
+目前是在微调时有效  
+
+#### 总
+![Pasted image 20240609022518|375](https://raw.githubusercontent.com/Emisaber/pic_obsidian/main/Pasted%20image%2020240609022518.png)
+
+> 大模型安全也是一个新兴领域，有兴趣从事这方面科研工作的可以多做了解
+
+
+
+
+
+
+
+
+
+