数据清洗

其实网络数据的获取在大多数时候并没有stair1中说的那么简单。因为互联网上存在很多不规则的数据。像get()一个网络地址，它返回给你的可能是一团乱麻的HTML。这时候我们需要对它进行适当的解析和清洗。

解析

Beautiful Soup模块可以从HTML或XML文件中提取数据。

在此之前，你需要对HTML和CSS有初步的认识。

注意安装第四版，先前的版本已经停止维护了。

pip install beautifulsoup4

# 注意从第四版的bs4引入
from bs4 import BeautifulSoup

# 拿这段HTML举个例子
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

BeautifulSoup类接受两个参数，第一个是html字符串文本，我们传入html_doc，第二个是解析器，我们使用'html.parser'。

# 现在soup是一个实例化的BeautifulSoup对象了
soup = BeautifulSoup(html_doc, 'html.parser')

我们可以通过BeautifulSoup的prettify()方法对html进行格式化，使它具有美观的缩进。

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

现在，重点来了。我们可以通过BeautifulSoup的find_all()方法获取所有a标签。它返回一个由Tag对象组成的列表。[Tag,...,Tag]

some_list = soup.find_all('a')

print(some_list)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

遍历这个列表我们可以获得由所有名字组成的列表。

# 声明一个空列表，用于存放name
name_list = []

for i in some_list:
    # 通过Tag的text属性获取标签中的文字
    name = i.text
    # 列表的append()方法为其添加一个元素
    name_list.append(name)

print(name_list)
# ['Elsie', 'Lacie', 'Tillie']

这样我们就从html文件中提取出了一个包含所有名字的数组。更多的功能可以参考文档进行学习。

清洗

获得数据后我们需要将其保存起来。这个就是stair3中要讲的储存与持久化。

在保存之前，我们一般就会使用python内置函数对其进行适当的预处理。

而保存之后，我们主要使用pandas对数据进行清洗，当然，也会结合内置函数。

尽管多数时候我们要面对需要清洗的数据集，但我们始终要牢记，应该在数据采集录入这一步就做好脏数据的处理工作。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.md

2.md

数据清洗

解析

清洗

缺失、重复、异常

缺失值的补全

重复值的删除

异常值的处理

单位、格式、类型

单位统一

格式转换

类型转化

Files

2.md

Latest commit

History

2.md

File metadata and controls

数据清洗

解析

清洗

缺失、重复、异常

缺失值的补全

重复值的删除

异常值的处理

单位、格式、类型

单位统一

格式转换

类型转化