-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Conversation
## 训练模型 | ||
## 应用模型 | ||
## 总结 | ||
## 参考文献 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第一章很重要
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
第6行和第7行应该是三个###
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
抱歉,这个文档还在改,没有完成。之前commit是为了check下公式的格式。我大概今晚会改一个比较完整的版本再ci,之前先不用review。
@@ -1 +1,20 @@ | |||
TODO: Basing on https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/basic_usage/index_cn.rst | |||
# 线性回归 | |||
作为这部教程的读者,您或许已经从这份文档里学会了如何使用PaddlePaddle训练一个单变量的线性回归模型来拟合一个人造的数据集。这里,让我们更进一步,把线性回归模型应用到一个真实世界的问题上-房价预测。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这章是全书的第一章,之前没有单变量的线性回归的内容
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
抱歉,这个文档还在改,没有完成。之前commit是为了check下公式的格式。我大概今晚会改一个比较完整的版本再ci,之前先不用review。
3594afb
to
70af1ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 缺少.gitignore文件,datas目录改名为data
- travis中的yapf格式化没过,请本地使用yapf 0.13.2版本来格式化
- 预测可以不用swig么?直接用paddle train行么,因为swig脚本看起来会更复杂点 @Zrachel 你觉得呢
@@ -1,11 +1,11 @@ | |||
# 深度学习入门 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
README.md应该是merge了老版本的,这里请更新,且不用修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
## 总结 | ||
## 参考文献 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
缺失总结和参考文献。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1 +1,206 @@ | |||
TODO: Basing on https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/basic_usage/index_cn.rst | |||
# 线性回归 | |||
让我们从经典的线性回归模型开始这份教程。在这一章里,你将使用真实的数据集建立起一个房价预测模型,并且若干重要的概念。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ”并且若干重要的概念“不通
- 线性回归模型(贴上英文,和wiki链接),最好有参考文献。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
``` | ||
|
||
## 模型配置说明 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
数据定义
算法配置
网络结构
缺少小标题,请依次补充
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + ... + \omega_dx_{id} + b, i=1,...,n$$ | ||
|
||
例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数),而 $y_i$是房屋的价格。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
比如房价的个数,后面再列一点属性(模型概览里面写了有13个属性),现在只列了一个属性,有点空。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
is_predict = get_config_arg('is_predict', bool, False) | ||
|
||
# 1. read data. Suppose you saved above python code as dataprovider.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
121行可以去掉,因为后面的注释是中文,这里是英文,不统一
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#输入数据,13维的房屋信息 | ||
x = data_layer(name='x', size=13) | ||
|
||
#$$Y\string^ = \omega_1X_{1} + \omega_2X_{2} + ... + \omega_{13}X_{13} + b$$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
139行可以去掉,显示不出latex公式
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
## 训练模型 | ||
在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`,训练30轮,结果保存在`output`路径下。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以直接执行训练脚本train.sh
,然后介绍训练脚本的内容。请重新组织下156行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
## 应用模型 | ||
现在来看下如何使用已经训练好的模型进行预测。这里我们指定一个pass保存的模型,并对测试集中的每一条数据进行预测。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
预测脚本和下面画图的脚本都不用贴内容,只要告诉用户怎么用这些脚本即可。 @Zrachel 你觉得呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同意
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
# 2. learning algorithm | ||
settings(batch_size=2) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多余空行:29,32,42行都可以去掉
@zhouxiao-coder 我看到这个PR的TravisCI test没有过。从log看,是因为有些Python代码的格式不符合标准,没有通过yapf的检查。可以在本机运行一下 pre-commit -a run 命令,来自动修正代码的排版格式,随后 git push 一下,应该就解决了。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
全文修改建议
|
||
$$y_i = \omega_1x_{i1} + \omega_2x_{i2} + ... + \omega_dx_{id} + b, i=1,...,n$$ | ||
|
||
例如,在我们将要建模的房价预测问题里,$x_{ij}$是描述房子$i$的各种属性(比如房间的个数),而 $y_i$是房屋的价格。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这句话提前,插入到背景介绍下的第一句中
其中$x{i1}, ... x_{id}$是$d$个属性上的取值,如描述房子$i$的各种属性(比如房间的个数、周围学校和医院的个数、房屋所在楼层等)
y也这么写
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- x 和 y 的介绍我改了,增加了举例。
- 位置调整是说要把这里的 “例如,xxx” 调整到前面 "$y_i$是待预测的目标" 后面么?我觉得这样不太好。前面是用数学语言比较严格定义线性回归模型的过程,不易割裂开。如果读者不习惯看公式的话,下面的举例离得也很近,应该不会造成困扰。周志华老师和李航老师的书里都是类似的处理方法。
我们使用从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)获得的波士顿房价数据集进行模型的训练和预测。下面的散点图展示了使用模型对部分房屋价格进行的预测。其中,横轴展示了房屋的真实值,纵轴为预测值,当二者值完全相等的时候就会落在虚线上。所以模型预测的越准确,则点离虚线越近。 | ||
<p align="center"> | ||
<img src = "image/predictions.png"><br/> | ||
图1. 预测值 V.S. 真实值 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
横纵轴需要标明单位,比如$million,还有前面说明图上表示每平米还是房屋出售值这么多钱
|
||
$$Y\string^ = \omega_1X_{1} + \omega_2X_{2} + ... + \omega_{13}X_{13} + b$$ | ||
|
||
$Y\string^$ 表示模型的预测,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \dot, \omega_{13}, b$。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
表示模型的预测
->
表示模型的预测结果
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
$Y\string^$ 表示模型的预测,用来和真实值$Y$区分。模型要学习的参数即:$\omega_1, \dot, \omega_{13}, b$。 | ||
|
||
有了模型表示之后,我们还需要能够度量给定一组参数后,模型表现的好坏。也就是说,我们需要一个损失函数来指导参数的调整。对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, MSE)了,它的形式是: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有了模型表示之后,我们还需要能够度量给定一组参数后,模型表现的好坏。
->
建立模型后,我们需要给模型一个优化目标,使得学到的参数能够让预测值$Y\string^$尽可能地接近真实值$Y$。这里我们引入损失函数[reference]这个概念。 损失函数(loss function,或cost function)将一组含参数的表达式……(怎么说请参考具体文献)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
做归一化至少有以下3个理由: | ||
- 过大或过小的数值范围会导致计算时的浮点上溢或下溢。 | ||
- 不同的数值范围会导致不同属性对模型的重要性不同(至少在训练的初始阶段如此),而这个隐含的假设常常是不合理的。这会对优化的过程造成困难,使训练时间大大的加长。 | ||
- 很多的机器学习技巧/模型(例如L1,L2正则项,Vector Space Model)都基于这样的假设,即所有的属性取值范围都差不多是以0为均值,范围接近于1的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 向量空间模型
```python | ||
settings(batch_size=2) | ||
``` | ||
最后使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
表示-实现
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
test_list='datas/test.list', | ||
module='dataprovider', | ||
obj='process', | ||
args={}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉args={}这一行
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ys.append([y_true, y_predict]) | ||
ys = np.matrix(ys) | ||
|
||
#计算在测试集上的MSE,大概在8.92左右 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“大概在8.92左右” 写在 代码外面,说下执行了多少个pass,得到8.92结果这种。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉了,原因见上面comment
|
||
将模型的预测结果和真实值进行对比,就得到了在本章开头处展示的结果。 | ||
```python | ||
# draw a scatter plot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一个函数要写在一个代码段里,这样写到时候执行会出语法错误。
另外,这段其实没必要显式给出,放到另一个文件import进来或者再单独调用吧,不要显示了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉了。
ys.append([y_true, y_predict]) | ||
ys = np.matrix(ys) | ||
|
||
#计算在测试集上的MSE,大概在8.92左右 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
结果希望再加一个图,13个feature的scale画出来,xlabel 标明13个feature,y为这13个feature的value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
图补在了预处理一节。done。
70af1ce
to
02f966b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
还有一些小问题需要修改。
@@ -1 +1,177 @@ | |||
TODO: Basing on https://github.com/PaddlePaddle/Paddle/blob/develop/doc/getstarted/basic_usage/index_cn.rst | |||
# 线性回归 | |||
让我们从经典的线性回归(Linear Regression [1])模型开始这份教程。在这一章里,你将使用真实的数据集建立起一个房价预测模型,并且了解到机器学习中的若干重要概念。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\[[1](#参考文献)\]
,这样可以跳转。全文参考文献引用处都要修改。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
让我们从经典的线性回归(Linear Regression [1])模型开始这份教程。在这一章里,你将使用真实的数据集建立起一个房价预测模型,并且了解到机器学习中的若干重要概念。 | ||
|
||
## 背景介绍 | ||
给定一个大小为$n$的数据集 ${\\{y_{i}, x_{i1}, ..., x_{id}\\}}\_{i=1}\^{n}$,其中$x_{i1}, \ldots, x_{id}$是$d$个属性上的取值,$y_i$是待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, MSE [6])了,它的形式是: | ||
|
||
$$MSE=\frac{1}{n}\sum\_{i=1}\^{n}{(\hat{Y_i}-Y_i)}^2$$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我也是用github latex插件看,http://latex.codecogs.com/eqneditor/editor.php , 你可以用这个网站看。现在你的公式在这个网站上是显示不了的。把转义符号去掉吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我这里也是luotao显示的样子,改一下吧~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我用 @luotao1 给的地址改了下公式显示,麻烦再帮忙check下
| RAD | 到径向公路的可达性指数 | 连续值 | | ||
| TAX | 全值财产税率 | 连续值 | | ||
| PTRATIO | 学生与教师的比例 | 连续值 | | ||
| B | 1000(Bk - 0.63)^2,其中BK为黑人占比 | 连续值 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
57行,Bk,BK,K一个大写一个小写,请统一
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
### 数据预处理 | ||
#### 连续值与离散值 | ||
观察一下数据,我们的第一个发现是:所有的13维属性中,有12维的连续值和1维的离散值(CHAS)。离散值虽然也常使用类似0、1、2这样的数字表示,但是其含义与连续值是不同的,因为这里的差值没有实际意义。例如,我们用0、1、2来分别表示红色、绿色和蓝色的话,我们并不能因此说“蓝色和红色”比“绿色和红色”的距离更远。所以通常对一个有$d$ 个可能取值的离散属性,我们会将它们转为 $d$ 个取值为0或1的二值属性。不过就这里而言,因为CHAS本身就是一个二值属性,就省去了这个麻烦。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
#### 网络结构 | ||
最后使用`fc_layer`和`LinearActivation`来表示线性回归的模型本身。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最后,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
## 模型配置说明 | ||
我们通过一个模型配置文件来定义模型相关的各种细节。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
104行可以删掉,因为后面训练模型那节讲了是trainer_config.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
## 训练模型 | ||
在对应代码的根目录下执行PaddlePaddle的命令行训练程序。这里指定模型配置文件为`trainer_config.py`,训练30轮,结果保存在`output`路径下。 | ||
```bash | ||
bash train.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
./train.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
``` | ||
|
||
## 总结 | ||
在这章里,我们借助波士顿房价这一数据集介绍了线性回归模型的基本概念以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清线性模型的原理和局限非常重要。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在这章里,我们借助波士顿房价这一数据集,介绍了线性回归模型的基本概念,以及如何使用PaddlePaddle实现训练和测试的过程。很多的模型和技巧都是从简单的线性回归模型演化而来,因此弄清楚线性模型的原理和局限非常重要。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
5. https://en.wikipedia.org/wiki/Loss_function | ||
6. https://en.wikipedia.org/wiki/Mean_squared_error | ||
7. http://scikit-learn.org/stable/modules/preprocessing.html | ||
8. https://en.wikipedia.org/wiki/Hyperparameter_optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5,6,7,8直接在对应的文字里面加超链接即可,不用再列出参考文献了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参考文献里2,3,4都没有用到?如果漏了请在前面加上,如果不用请删掉并重新编码后面的几个
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
对于线性回归模型来讲,最常见的损失函数就是均方误差(Mean Squared Error, MSE [6])了,它的形式是: | ||
|
||
$$MSE=\frac{1}{n}\sum\_{i=1}\^{n}{(\hat{Y_i}-Y_i)}^2$$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我这里也是luotao显示的样子,改一下吧~
| CRIM | 该镇的人均犯罪率 | 连续值 | | ||
| ZN | 占地面积超过25,000平方呎的住宅用地比例 | 连续值 | | ||
| INDUS | 非零售商业用地比例 | 连续值 | | ||
| CHAS | 是否临近 Charles River | 离散值,1=邻近;0=不邻近 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
临近 -> 邻近
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
做归一化至少有以下3个理由: | ||
- 过大或过小的数值范围会导致计算时的浮点上溢或下溢。 | ||
- 不同的数值范围会导致不同属性对模型的重要性不同(至少在训练的初始阶段如此),而这个隐含的假设常常是不合理的。这会对优化的过程造成困难,使训练时间大大的加长。 | ||
- 很多的机器学习技巧/模型(例如L1,L2正则项,向量空间模型)都基于这样的假设,即所有的属性取值都差不多是以0为均值且取值范围相近的 [7]。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
引用https://en.wikipedia.org/wiki/Normalization_(statistics) 吧
[7]是sklearn里的用法,还是不太好。描述也对应改一下吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改过了。不过没用这个wiki,这个链接对应的是统计里的概念,我用了这个 https://en.wikipedia.org/wiki/Feature_scaling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需小修
让我们从经典的线性回归(Linear Regression \[[1](#参考文献)\])模型开始这份教程。在这一章里,你将使用真实的数据集建立起一个房价预测模型,并且了解到机器学习中的若干重要概念。 | ||
|
||
## 背景介绍 | ||
给定一个大小为$n$的数据集 ${\{y_{i}, x_{i1}, ..., x_{id}\}}_{i=1}^{n}$,其中$x_{i1}, \ldots, x_{id}$是$d$个属性上的取值,$y_i$是待预测的目标。线性回归模型假设目标$y_i$可以被属性间的线性组合描述,即 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其中$x_{i1}, \ldots, x_{id}$是$d$个属性上的取值,$y_i$是待预测的目标。
->
其中$x_{i1}, \ldots, x_{id}$是第$i$个样本$d$个属性上的取值,$y_i$是该样本待预测的目标。
初看起来,这个假设实在过于简单了,变量间的真实关系很难是线性的。但由于线性回归模型有形式简单和易于建模分析的优点,它在实际问题中得到了大量的应用。很多经典的统计学习、机器学习书籍\[[2,3,4](#参考文献)\]也选择对线性模型独立成章重点讲解。 | ||
|
||
## 效果展示 | ||
我们使用从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)获得的波士顿房价数据集进行模型的训练和预测。下面的散点图展示了使用模型对部分房屋价格进行的预测。其中,横轴展示了该类房屋价格的中位数,纵轴为模型的预测结果,当二者值完全相等的时候就会落在虚线上。所以模型预测得越准确,则点离虚线越近。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
其中,横轴展示了该类房屋价格的中位数,纵轴为模型的预测结果
->
其中,每个点的横坐标表示同一类房屋真实价格的中位数,纵坐标表示线性回归模型根据特征预测的结果
``` | ||
这段代码将从[UCI Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing)下载数据并进行[预处理](#数据预处理),最后数据将被分为训练集和测试集。 | ||
|
||
这份数据集共506行,每行包含了波士顿郊区的一类房屋的相关信息及价格的中位数。其各维属性的意义如下: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
价格的中位数
-》
该类房屋价格的中位数
| PTRATIO | 学生与教师的比例 | 连续值 | | ||
| B | 1000(BK - 0.63)^2,其中BK为黑人占比 | 连续值 | | ||
| LSTAT | 低收入人群占比 | 连续值 | | ||
| MEDV | 房屋价格的中位数 | 连续值 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
房屋价格的中位数
->
同类房屋价格的中位数
|
||
### 数据预处理 | ||
#### 连续值与离散值 | ||
观察一下数据,我们的第一个发现是:所有的13维属性中,有12维的连续值和1维的离散值(CHAS)。离散值虽然也常使用类似0、1、2这样的数字表示,但是其含义与连续值是不同的,因为这里的差值没有实际意义。例如,我们用0、1、2来分别表示红色、绿色和蓝色的话,我们并不能因此说“蓝色和红色”比“绿色和红色”的距离更远。所以通常对一个有$d$个可能取值的离散属性,我们会将它们转为$d$个取值为0或1的二值属性。不过就这里而言,因为CHAS本身就是一个二值属性,就省去了这个麻烦。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为0或1的二值属性
->
为0或1的二值属性或者将每个可能取值映射为一个多维向量
Just a placeholder for now. Contents will be added later.