准备和清理机器学习数据

https://www.jianshu.com/p/a8037a38e219 Lending Club贷款数据分析（上）

https://www.jianshu.com/p/4b4e0c343d3e Lending Club贷款数据分析（下）

https://www.peercube.com/blog/post/fico-score-trends-and-defaults-for-lending-club-loans

这个博客文章详细分析了是否会泄露数据.

在任何机器学习项目中, 清理和预处理数据都是很关键的第一步.在这片文章中, Daniel Osei带领大家练习检查数据集, 选择特征列, 数据可视化, 以及对特征进行编码.

After first reading about Machine Learning on Quora in 2015, Daniel became excited at the prospect of an area that could combine his love of Mathematics and Programming. After reading this article on how to learn data science, Daniel started following the steps, eventually joining Dataquest to learn Data Science with us in in April 2016.

在2015年第一次阅读关于Quora的机器学习之后，Daniel对能够结合他热爱的数学和编程领域的前景感到兴奋。在阅读了关于如何学习数据科学的文章之后，Daniel开始遵循这些步骤学习.

We'd like to thank Daniel for his hard work, and generously letting us publish this post. This walkthrough uses Python 3.5 and Jupyter notebook.

我在此感谢丹尼尔的努力工作, 并且慷慨的让我们发表这篇文章, 本演练使用Python 3.5和Jupyter notebook。

Understanding the Data

理解数据

Before you start working with data for a machine learning project, it is vital to understand what the data is, and what we want to achieve. Without it, we have no basis from which to make our decisions about what data is relevant as we clean and prepare our data.

在开始使用机器学习项目的数据之前，理解数据是什么以及我们想要实现什么是至关重要的。没有它，当我们清理和准备我们的数据时，我们没有任何基础来做出我们关于什么数据相关的决定。

Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return. Each borrower fills out a comprehensive application, providing their past financial history, the reason for the loan, and more. Lending Club evaluates each borrower's credit score using past historical data (and their own data science process!) and assigns an interest rate to the borrower.

Lending Club是一个个人贷款市场，它将那些正在寻求贷款的借款者和想要借款获得收益的投资者匹配起来. 每个借款人填写完整的申请表，提供他们过去的财务记录，贷款原因等等。借贷俱乐部使用过去的历史数据（以及他们自己的数据科学过程！）评估每个借款人的信用评分，并为借款人分配利率。

The Lending Club website.

贷款俱乐部网站

The loan is then listed on the Lending Club marketplace. You can read more about their marketplace here.

贷款然后在Lending Club市场上列出。你可以在这里阅读更多关于他们的市场。

Investors are primarily interested in receiving a return on their investments. Approved loans are listed on the Lending Club website, where qualified investors can browse recently approved loans, the borrower's credit score, the purpose for the loan, and other information from the application.

投资者主要兴趣在于获得投资回报。批准的贷款在Lending Club网站上列出，合格的投资者可以浏览最近批准的贷款，借款人的信用评分，贷款目的以及申请中的其他信息。

Once an investor decides to fund a loan, the borrower then makes monthly payments back to Lending Club. Lending Club redistributes these payments to the investors. This means that investors don't have to wait until the full amount is paid off to start to see money back. If a loan is fully paid off on time, the investors make a return which corresponds to the interest rate the borrower had to pay in addition to the requested amount. Many loans aren't completely paid off on time, however, and some borrowers default on the loan.

一旦投资者决定为贷款提供资金，借款人将按月支付给Lending Club。 Lending Club将这些付款重新分配给投资者。这意味着投资者不必等到全额还清才能开始收回资金。如果贷款按时全额还清，投资者将获得相应于借款人除了要求的金额外还要支付的利率的回报。然而，很多贷款并没有按时还清，有的借款人违约了贷款。

The Challenge

项目的挑战性

Suppose an investor has approached us and has asked us to build a machine learning model that can reliably predict if a loan will be paid off or not. This investor described himself/herself as a conservative investor who only wants to invest in loans that have a good chance of being paid off on time. Thus, this client is more interested in a machine learning model which does a good job of filtering out high percentage of loan defaulters.

假设投资者已经联系我们，并要求我们建立一个机器学习模型，可以可靠地预测贷款是否将会被偿还。这位投资者形容自己是一个保守的投资者，他们只想投资那些很有可能按时还清的贷款。因此，这个客户对机器学习模型更感兴趣，该模型很好地筛选出高比例的贷款拖欠者。

1. Examining the Data Set

1.检查数据集

Lending Club periodically releases data for all the approved and declined loan applications on their website. So you're working with the same data we are, we've mirrored the data on data.world. You can select different year ranges to download the dataset (in CSV format) for both approved and declined loans.

贷款俱乐部会定期在其网站上发布所有批准和拒绝贷款申请的数据。因此，您正在使用与我们相同的数据，我们已经对data.world上的数据进行了镜像。您可以选择不同的年份范围以下载已批准和已拒绝贷款的数据集（CSV格式）。

You'll also find a data dictionary (in XLS format), towards the bottom of the page, which contains information on the different column names. The data dictionary is useful to help understand what a column represents in the dataset.

您还可以在页面底部找到数据字典（采用XLS格式），其中包含有关不同列名称的信息。数据字典有助于理解数据集中列的表示方式。

The data dictionary contains two sheets:

数据字典包含两张表：

LoanStats sheet: describes the approved loans dataset
RejectStats sheet: describes the rejected loans dataset

LoanStats 数据表：描述核准的贷款数据集 RejectStats 数据表：描述被拒绝的贷款数据集

We'll be using the LoanStats sheet since we're interested in the approved loans dataset.

我们将使用LoanStats表，因为我们对批准的贷款数据集感兴趣。(为什么这些贷款会被批准)

The approved loans dataset contains information on current loans, completed loans, and defaulted loans. For this challenge, we'll be working with approved loans data for the years 2007 to 2011.

核准的贷款数据集包含当前贷款，已完成贷款和违约贷款的信息。对于这一项目，我们将使用2007年至2011年的批准的贷款数据。

First, lets import some of the libraries that we'll be using, and set some parameters to make the output easier to read.

首先，让我们导入一些我们将要使用的库，并设置一些参数以使输出更易于阅读。

import pandas as pd
import numpy as np
pd.set_option('max_columns', 120)
pd.set_option('max_colwidth', 5000)


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams['figure.figsize'] = (12,8)

Loading The Data Into Pandas

将数据加载到Pandas中

We've downloaded our dataset and named it lending_club_loans.csv, but now we need to load it into a pandas DataFrame to explore it.

我们已经下载了我们的数据集，并将其命名为lending_club_loans.csv，但现在我们需要将其加载到一个pandas DataFrame 中来探索它。

To ensure that code run fast for us, we need to reduce the size of lending_club_loans.csvby doing the following:

为确保代码能够快速运行，我们需要通过执行以下操作来减少lending_club_loans.csv的大小：

Remove the first line: It contains extraneous text instead of the column titles. This text prevents the dataset from being parsed properly by the pandas library.
Remove the 'desc' column: it contains a long text explanation for the loan.
Remove the 'url' column: it contains a link to each on Lending Club which can only be accessed with an investor account.
Removing all columns with more than 50% missing values: This allows us to move faster since don't need to spend time trying to fill these values.
删除第一行：它包含无关文本而不是列标题。该文本可防止数据集被pandas库正确解析。
删除'desc'列：它包含对贷款的长文本解释。
删除'网址'栏：它包含Lending Club上的每个链接，只能通过投资者账户访问。
删除所有缺失值超过50％的列：这使我们能够更快地移动，因为不需要花费时间来填充这些值。

We'll also name the filtered dataset loans_2007 and later at the end of this section save it as loans_2007.csv to keep it separate from the raw data. This is good practice and makes sure we have our original data in case we need to go back and retrieve any of the original data we're removing.

我们还会将过滤后的数据集loan_2007命名为本节末，并将其保存为loans_2007.csv以使其与原始数据分开。这是很好的做法，并确保我们有我们的原始数据，以防我们需要返回并检索我们要删除的任何原始数据。

Now, let's go ahead and perform these steps:

现在，让我们继续并执行这些步骤：

# skip row 1 so pandas can parse the data properly.
loans_2007 = pd.read_csv('data/lending_club_loans.csv', skiprows=1, low_memory=False) 
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count,axis=1) # Drop any column with more than 50% missing values
loans_2007 = loans_2007.drop(['url','desc'],axis=1)      # These columns are not useful for our purposes

Let's use the pandas head() method to display first three rows of the loans_2007 DataFrame, just to make sure we were able to load the dataset properly:

让我们使用pandas head（）方法显示loans_2007 DataFrame的前三行，以确保我们能够正确加载数据集：

Let's also use pandas .shape attribute to view the number of samples and features we're dealing with at this stage:

2. Narrowing down our columns

2. 减少我们的列

It's a great idea to spend some time to familiarize ourselves with the columns in the dataset, to understand what each feature represents. This is important, because a poor understanding of the features could cause us to make mistakes in the data analysis and the modeling process.

花点时间熟悉数据集中的列，了解每个特征代表的是什么是一个好主意。这很重要，因为对功能的理解不足可能会导致我们在数据分析和建模过程中犯错误。

We'll be using the data dictionary Lending Club provided to help us become familiar with the columns and what each represents in the dataset. To make the process easier, we'll create a DataFrame to contain the names of the columns, data type, first row's values, and description from the data dictionary.

我们将使用LendingClub提供的数据字典帮助我们熟悉数据集中的列和每个元素。为了简化这个过程，我们将创建一个DataFrame来包含数据字典中列的名称，数据类型，第一行的值和描述。

To make this easier, we've pre-converted the data dictionary from Excel format to a CSV.

为了使这更容易，我们已经将数据字典从Excel格式预先转换为CSV。

data_dictionary = pd.read_csv('LCDataDictionary.csv') # Loading in the data dictionary
print(data_dictionary.shape[0])
print(data_dictionary.columns.tolist())

117
['LoanStatNew', 'Description']

data_dictionary.head()
data_dictionary = data_dictionary.rename(columns={'LoanStatNew': 'name',
                                                 'Description': 'description'})

Now that we've got the data dictionary loaded, let's join the first row of loans_2007 to the data_dictionary DataFrame to give us a preview DataFrame with the following columns:

现在我们已经加载了数据字典，让我们将loan_2007的第一行加入到data_dictionary DataFrame中，为我们预览DataFrame，其中包含以下列：

name — contains the column names of loans_2007.
dtypes — contains the data types of the loans_2007 columns.
first value — contains the values of loans_2007 first row.
description — explains what each column in loans_2007 represents.

loans_2007_dtypes = pd.DataFrame(loans_2007.dtypes,columns=['dtypes'])
loans_2007_dtypes = loans_2007_dtypes.reset_index()
loans_2007_dtypes['name'] = loans_2007_dtypes['index']
loans_2007_dtypes = loans_2007_dtypes[['name','dtypes']]

loans_2007_dtypes['first value'] = loans_2007.loc[0].values
preview = loans_2007_dtypes.merge(data_dictionary, on='name',how='left')

When we printed the shape of loans_2007 earlier, we noticed that it had 56 columns which also means this preview DataFrame has 56 rows. It can be cumbersome to try to explore all the rows of preview at once, so instead we'll break it up into three parts and look at smaller selection of features each time.

当我们早些时候打印出loan_2007的shape时，我们注意到它有56列，这也意味着这个预览DataFrame有56行。尝试一次性浏览预览的所有行可能会很麻烦，因此我们会将其分解为三部分，每次只查看较少的特征。

As you explore the features to better understand each of them, you'll want to pay attention to any column that:

在您探索特征以更好地了解每个特征时，您需要注意以下任何列：

leaks information from the future (after the loan has already been funded),
don't affect the borrower's ability to pay back the loan (e.g. a randomly generated ID value by Lending Club),
is formatted poorly,
requires more data or a lot of preprocessing to turn into useful a feature, or
contains redundant information.
从未来泄露信息（贷款已经获得资助后），
不影响借款人偿还贷款的能力（例如Lending Club随机生成的ID值），
格式不佳，
需要更多的数据或大量的预处理才能成为有用的特征，或者包含冗余信息。

I'll say it again to emphasize it because it's important: We need to especially pay close attention to data leakage, which can cause the model to overfit. This is because the model would be also learning from features that wouldn't be available when we're using it make predictions on future loans.

我会再次强调它，因为它很重要：我们需要特别关注数据泄漏，这可能导致模型过度拟合。这是因为该模型也将从我们使用它预测未来贷款时, 其实是不能使用的特征中学习。

First Group Of Columns

第一组特征

Let's display the first 19 rows of preview and analyze them:

让我们显示前19行预览并分析它们：

preview[:19]

我们的原则是只分析贷款发生之前的数据, 根据贷款发生之前的数据, 预测此人拖欠或者不换贷款的可能性, 因此对于贷款发生后的数据全部予以删除.

After analyzing the columns, we can conclude that the following features can be removed:

分析完这些列后，我们可以得出结论：可以删除以下功能：

id — randomly generated field by Lending Club for unique identification purposes only.
member_id — also randomly generated field by Lending Club for identification purposes only.
funded_amnt — leaks information from the future(after the loan is already started to be funded).
funded_amnt_inv — also leaks data from the future.
sub_grade — contains redundant information that is already in the gradecolumn (more below). 次级信用评价等级
int_rate — also included within the grade column.
emp_title — requires other data and a lot of processing to become potentially useful
issued_d — leaks data from the future.
id - 由Lending Club随机生成的字段，仅用于唯一标识目的。
member_id - Lending Club也是随机生成的字段，仅用于识别目的。
funded_amnt - 泄漏来自未来的信息（在贷款已经开始资助之后）。
funded_amnt_inv - 也泄漏未来的数据。
sub_grade - 包含已经在成绩栏中的冗余信息（更多下面）。
int_rate - 也包含在成绩列中。
emp_title - 需要其他数据和大量处理才能变得有用
issued_d - 泄露未来的数据。

个人分析:

grade:信用凭证, 银行开具的信用等级证明

sub_grade 更细化的信用等级证明

int_rate : 贷款利率是根据信用凭证等级给出的, 相同的sub_grade具有相同的贷款利率

emp_title: 职位, 这一行包含的不同类型太多, 想要分析需要使用文本挖掘, 职位等级评分等工作. 过于复杂

Lending Club uses a borrower's grade and payment term (30 or months) to assign an interest rate (you can read more about Rates & Fees). This causes variations in interest rate within a given grade. But, what may be useful for our model is to focus on clusters of borrowers instead of individuals. And, that's exactly what grading does - it segments borrowers based on their credit score and other behaviors, which is we should keep the grade column and drop interest int_rate and sub_grade.

Lending Club使用借款人的等级和付款期限（30天或者数月）来分配利率（您可以阅读关于费率和费用的更多信息）。这会导致给定等级内利率的变化。但是，对我们的模型可能有用的是集中于借款人集群而不是个人。而且，这正是评分所做的 - 它根据他们的信用评分和其他行为对借款人进行细分，这是我们应该保留评级列并放弃利率int_rate和sub_grade。

Let's drop these columns from the DataFrame before moving onto to the next group of columns.

让我们从DataFrame中删除这些列，然后再转到下一组列。

drop_list = ['id','member_id','funded_amnt','funded_amnt_inv',
             'int_rate','sub_grade','emp_title','issue_d']
loans_2007 = loans_2007.drop(drop_list,axis=1)

Second Group Of Columns

第二组列

我们来看下19列：

In this group,take note of the fico_range_low and fico_range_high columns. Both are in this second group of columns but because they related to some other columns, we'll talk more about them after looking at the last group of columns.

在此组中，请记下fico_range_low和fico_range_high列(我的数据集中没有这两列)。两者都在第二组列中，但因为它们与其他列相关，所以在查看最后一组列之后，我们会详细讨论它们。

We can drop the following columns:

我们可以删除一下的列:

zip_code - mostly redundant with the addr_state column since only the first 3 digits of the 5 digit zip code are visible.
out_prncp - leaks data from the future.
out_prncp_inv - also leaks data from the future.
total_pymnt - also leaks data from the future.
total_pymnt_inv - also leaks data from the future.
zip_code 由于只有5位数字邮政编码的前3位数字可见，因此zip_code - 大部分与addr_state列多余。
out_prncp - 泄漏未来的数据。剩余未偿还本金总额
out_prncp_inv - 也会泄漏未来的数据。由投资者投资的总金额中剩余的未偿还本金
total_pymnt - 也会泄漏未来的数据。迄今收到的付款资金总额
total_pymnt_inv - 也会泄漏未来的数据。迄今收到的付款总额中由投资者出资的部分

Let's go ahead and remove these 5 columns from the DataFrame:

让我们继续，从DataFrame中删除这5列：

drop_cols = [ 'zip_code','out_prncp','out_prncp_inv',
             'total_pymnt','total_pymnt_inv']
loans_2007 = loans_2007.drop(drop_cols, axis=1)

Third Group Of Columns

第三组列

Let's analyze the last group of features:

我们来分析一下最后一组特征：

In this last group of columns, we need to drop the following, all of which leak data from the future:

在最后一组专栏中，我们需要删除以下内容，所有这些内容都会泄露未来的数据：

total_rec_prncp 迄今收到的本金
total_rec_int迄今收到的利息
total_rec_late_fee迄今收到的滞纳金
recoveries贷款结束的总收益
collection_recovery_fee收发费
last_pymnt_d上个月收到付款
last_pymnt_amnt最后收到的付款总额

Let's drop our last group of columns:

drop_cols = ['total_rec_prncp','total_rec_int', 'total_rec_late_fee',
             'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
             'last_pymnt_amnt']

loans_2007 = loans_2007.drop(drop_cols, axis=1)

Investigating FICO Score Columns

调查FICO分数列

Now, besides the explanations provided here in the Description column,let's learn more about fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high.

现在，除了在说明栏中提供的解释外，让我们进一步了解fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_high.

FICO scores are a credit score, or a number used by banks and credit cards to represent how credit-worthy a person is. While there are a few types of credit scores used in the United States, the FICO score is the best known and most widely used.

FICO评分是信用评分，或者是银行和信用卡用来表示信用价值的人。虽然在美国使用的信用评分有几种类型，但FICO评分是最广为人知并且应用最广泛的。

When a borrower applies for a loan, Lending Club gets the borrowers credit score from FICO - they are given a lower and upper limit of the range that the borrowers score belongs to, and they store those values as fico_range_low, fico_range_high. After that, any updates to the borrowers score are recorded as last_fico_range_low, and last_fico_range_high.

当借款人申请贷款时，Lending Club从FICO获得借款人信用评分 - 他们获得借款人评分所属范围的上限和下限，并将这些值存储为fico_range_low，fico_range_high。之后，对借款人分数的任何更新记录为last_fico_range_low和last_fico_range_high。

A key part of any data science project is to do everything you can to understand the data. While researching this data set, I found a project done in 2014 by a group of students from Stanford University on this same dataset.

任何数据科学项目的关键部分是尽你所能去理解数据。在研究这个数据集的同时，我发现一个项目是由斯坦福大学的一组学生在2014年完成的，该项目在这个数据集上完成。

In the report for the project, the group listed the current credit score (last_fico_range) among late fees and recovery fees as fields they mistakenly added to the features but state that they later learned these columns all leak information into the future.

在该项目的报告中，该组织列出了当前信用评分（last_fico_range）中滞纳金和回收费作为他们错误添加到功能的字段，但声明他们后来了解到这些列都泄露了未来的信息。

However, following this group's project, another group from Stanford worked on this same Lending Club dataset. They used the FICO score columns, dropping only last_fico_range_low, in their modeling. This second group's report described last_fico_range_high as the one of the more important features in predicting accurate results.

然而，在这个小组的项目之后，另一个来自斯坦福大学的小组也参与了这个同样的LendingClub数据集。他们在他们的建模中使用了FICO评分列，只删除了last_fico_range_low。第二组的报告将last_fico_range_high描述为预测准确结果中更重要的特征之一。

The question we must answer is, do the FICO credit scores information into the future? Recall a column is considered leaking information when especially it won't be available at the time we use our model - in this case when we use our model on future loans.

我们必须回答的问题是，FICO信用评分信息是否能够用于未来？回想一个列被认为是泄漏信息，特别是在我们使用我们的模型时它不可用 - 在这种情况下，当我们在未来的贷款上使用我们的模型时。

This blog examines in-depth the FICO scores for lending club loans, and notes that while looking at the trend of the FICO scores is a great predictor of whether a loan will default, that because FICO scores continue to be updated by the Lending Club after a loan is funded, a defaulting loan can lower the borrowers score, or in other words, will leak data.

这个博客深入研究了贷款俱乐部贷款的FICO分数，并指出，在查看FICO分数趋势时，可以预测贷款是否违约，因为FICO分数会在Lending Club贷款后继续更新，违约贷款可以降低借款人的分数，换句话说，会泄露数据。

https://www.peercube.com/blog/post/fico-score-trends-and-defaults-for-lending-club-loans

这个博客文章详细分析了是否会泄露数据.

Therefore we can safely use fico_range_low and fico_range_high, but not last_fico_range_low, and last_fico_range_high. Lets take a look at the values in these columns:

因此，我们可以安全地使用fico_range_low和fico_range_high，但不能使用last_fico_range_low和last_fico_range_high。让我们看看这些列中的值：

print(loans_2007['fico_range_low'].unique())
print(loans_2007['fico_range_high'].unique())

我的数据集中不包含这两列

Let's get rid of the missing values, then plot histograms to look at the ranges of the two columns:

让我们摆脱缺失的值，然后绘制直方图来查看两列的范围：

fico_columns = ['fico_range_high','fico_range_low']

print(loans_2007.shape[0])
loans_2007.dropna(subset=fico_columns,inplace=True)
print(loans_2007.shape[0])

loans_2007[fico_columns].plot.hist(alpha=0.5,bins=20);

Let's now go ahead and create a column for the average of fico_range_low and fico_range_high columns and name it fico_average. Note that this is not the average FICO score for each borrower, but rather an average of the high and low range that we know the borrower is in.

现在让我们继续为fico_range_low和fico_range_high列的平均值创建一个列，并将其命名为fico_average。请注意，这不是每个借款人的平均FICO评分，而是我们知道借款人所处的高低范围的平均值。

loans_2007['fico_average'] = (loans_2007['fico_range_high'] + loans_2007['fico_range_low']) / 2

Let's check what we just did.

让我们来看看我们刚刚做了什么。

cols = ['fico_range_low','fico_range_high','fico_average']
loans_2007[cols].head()

Good! We got the mean calculations and everything right. Now, we can go ahead and drop fico_range_low, fico_range_high, last_fico_range_low, and last_fico_range_highcolumns.

好！我们得到了平均值, 一切正常。现在，我们可以继续删除fico_range_low，fico_range_high，last_fico_range_low和last_fico_range_high列。

drop_cols = ['fico_range_low','fico_range_high','last_fico_range_low',
             'last_fico_range_high']
loans_2007 = loans_2007.drop(drop_cols, axis=1)
loans_2007.shape

注意，通过熟悉数据集中的列，我们可以将列数从56减少到33。

Decide On A Target Column

决定一个目标列

Now, let's decide on the appropriate column to use as a target column for modeling - keep in mind the main goal is predict who will pay off a loan and who will default.

现在，让我们决定将适当的列用作建模的目标列 - 请记住主要目标是预测谁将偿还贷款以及谁将违约。

We learned from the description of columns in the preview DataFrame that loan_status is the only field in the main dataset that describe a loan status, so let's use this column as the target column.

我们从预览DataFrame中的列描述中了解到，loan_status是主数据集中唯一描述贷款状态的字段，因此我们将此列用作目标列。

Currently, this column contains text values that need to be converted to numerical values to be able use for training a model.

目前，该列包含需要转换为数值才能用于训练模型的文本值。

Let's explore the different values in this column and come up with a strategy for converting the values in this column. We'll use the DataFrame method value_counts()to return the frequency of the unique values in the loan_status column.

让我们来探索这一列中的不同值，并提出一个转换此列中值的策略。我们将使用DataFrame方法value_counts（）返回loan_status列中唯一值的频率。

loans_2007["loan_status"].value_counts()

The loan status has nine different possible values!

贷款状态有九种不同的可能值！

Let's learn about these unique values to determine the ones that best describe the final outcome of a loan, and also the kind of classification problem we'll be dealing with.

让我们了解这些独特的值，以确定最能描述贷款最终结果的值，以及我们将要处理的分类问题的种类。

You can read about most of the different loan statuses on the Lending Club website as well as these posts on the Lend Academy and Orchard forums. I have pulled that data together in a table below so we can see the unique values, their frequency in the dataset and what each means:

您可以阅读Lending Club网站上的大部分不同贷款状态以及Lend Academy和Orchard论坛上的这些帖子。我已经将这些数据放在下面的表格中，以便我们可以看到独特的值，数据集中的频率以及每种数据的含义：

meaning = [
    "Loan has been fully paid off.",
    "Loan for which there is no longer a reasonable expectation of further payments.",
    "While the loan was paid off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",
    "While the loan was charged off, the loan application today would no longer meet the credit policy and wouldn't be approved on to the marketplace.",
    "Loan is up to date on current payments.",
    "The loan is past due but still in the grace period of 15 days.",
    "Loan hasn't been paid in 31 to 120 days (late on the current payment).",
    "Loan hasn't been paid in 16 to 30 days (late on the current payment).",
    "Loan is defaulted on and no payment has been made for more than 121 days."]

status, count = loans_2007["loan_status"].value_counts().index, loans_2007["loan_status"].value_counts().values

loan_statuses_explanation = pd.DataFrame({'Loan Status': status,'Count': count,'Meaning': meaning})[['Loan Status','Count','Meaning']]
loan_statuses_explanation

Remember, our goal is to build a machine learning model that can learn from past loans in trying to predict which loans will be paid off and which won't. From the above table, only the Fully Paid and Charged Off values describe the final outcome of a loan. The other values describe loans that are still on going, and even though some loans are late on payments, we can't jump the gun and classify them as Charged Off.

记住，我们的目标是建立一个机器学习模型，可以从过去的贷款中学习，试图预测哪些贷款将被偿还，哪些不会。从上表中可以看出，只有“全额支付”和“扣除”值才能描述贷款的最终结果。其他值描述了仍在进行中的贷款，即使有些贷款迟付，我们也无法跳过gun并将它们分类为已关闭。

Also, while the Default status resembles the Charged Off status, in Lending Club's eyes, loans that are charged off have essentially no chance of being repaid while default ones have a small chance. Therefore, we should use only samples where the loan_status column is 'Fully Paid' or 'Charged Off'.

此外，尽管默认状态与已关闭状态类似，但在Lending Club的眼中，被扣除的贷款基本上没有被偿还的机会，而默认的贷款机会很小。因此，我们只应使用loan_status列为“全额付款”或“关闭”的样本。

We're not interested in any statuses that indicate that the loan is ongoing or in progress, because predicting that something is in progress doesn't tell us anything.

我们对任何表明贷款正在进行或正在进行的状态不感兴趣，因为预测有些事情正在进行中并没有告诉我们任何事情。

Since we're interested in being able to predict which of these 2 values a loan will fall under, we can treat the problem as binary classification.

由于我们有兴趣能够预测贷款将落入这两个值中的哪一个，我们可以将该问题视为二元分类。

Let's remove all the loans that don't contain either 'Fully Paid' or 'Charged Off' as the loan's status and then transform the 'Fully Paid' values to 1 for the positive case and the 'Charged Off' values to 0 for the negative case.

让我们删除所有不包含“完全支付”或“已关闭”的贷款作为贷款的状态，然后将“完全支付”值转换为1，对于正面情况，将“已关闭”值转换为0 消极的情况。

This will mean that out of the ~42,000 rows we have, we'll be removing just over 3,000.

这意味着我们有42,000行，我们将删除超过3,000。

There are few different ways to transform all of the values in a column, we'll use the DataFrame method replace().

有几种不同的方法来转换列中的所有值，我们将使用DataFrame方法replace（）。

loans_2007 = loans_2007[(loans_2007["loan_status"] == "Fully Paid") |
                            (loans_2007["loan_status"] == "Charged Off")]

mapping_dictionary = {"loan_status":{ "Fully Paid": 1, "Charged Off": 0}}
loans_2007 = loans_2007.replace(mapping_dictionary)

Remove Columns with only One Value

删除只有一个值的列

To wrap up this section, let's look for any columns that contain only one unique value and remove them. These columns won't be useful for the model since they don't add any information to each loan application. In addition, removing these columns will reduce the number of columns we'll need to explore further in the next stage.

为了结束本节，我们来看看只包含一个唯一值的列并将其删除。这些列对模型没有用处，因为它们不会向每个贷款申请添加任何信息。另外，删除这些列将减少我们需要在下一阶段进一步探索的列数。

The pandas Series method nunique() returns the number of unique values, excluding any null values. We can use apply this method across the dataset to remove these columns in one easy step.

The pandas Series method nunique() 返回唯一值的数量，不包括任何空值。我们可以在整个数据集中应用这种方法，只需一个简单的步骤即可删除这些列。

loans_2007 = loans_2007.loc[:,loans_2007.apply(pd.Series.nunique) != 1]

Again, there may be some columns with more than one unique values but one of the values has insignificant frequency in the dataset. Let's find out and drop such column(s):

同样，可能有一些列具有多个唯一值，但其中一个值在数据集中具有不显着的频率。让我们找出并删除这样的列：

for col in loans_2007.columns:
    if (len(loans_2007[col].unique()) < 4):
        print(loans_2007[col].value_counts())
        print()

36 months    29096
 60 months    10143
Name: term, dtype: int64

Not Verified       16845
Verified           12526
Source Verified     9868
Name: verification_status, dtype: int64

1    33586
0     5653
Name: loan_status, dtype: int64

n    39238
y        1
Name: pymnt_plan, dtype: int64

The payment plan column (pymnt_plan) has two unique values, 'y' and 'n', with 'y'occurring only once. Let's drop this column:

付款计划栏（pymnt_plan）有两个唯一值'y'和'n'，'y'只出现一次。让我们删除这一列：

print(loans_2007.shape[1])
loans_2007 = loans_2007.drop('pymnt_plan', axis=1)
print("We've been able to reduced the features to => {}".format(loans_2007.shape[1]))

25
We've been able to reduced the features to => 24

Lastly, lets save our work in this section to a CSV file.

最后，让我们将本节中的工作保存到CSV文件中。

loans_2007.to_csv("processed_data/filtered_loans_2007.csv",index=False)

Visualizing the Target Column Outcomes

可视化目标列结果

fig, axs = plt.subplots(1,2,figsize=(14,7))
sns.countplot(x='loan_status',data=filtered_loans,ax=axs[0])
axs[0].set_title("Frequency of each Loan Status")
filtered_loans.loan_status.value_counts().plot(x=None,y=None, kind='pie', ax=axs[1],autopct='%1.2f%%')
axs[1].set_title("Percentage of each Loan status")
plt.show()

These plots indicate that a significant number of borrowers in our dataset paid off their loan - 85.62% of loan borrowers paid off amount borrowed, while 14.38% unfortunately defaulted. From our loan data it is these 'defaulters' that we're more interested in filtering out as much as possible to reduce loses on investment returns.

这些图显示我们数据集中的大量借款人还清了贷款 - 贷款借款人中85.62％偿还了贷款，而不幸拖欠了14.38％。从我们的贷款数据来看，这些“违约者”是我们更有兴趣尽可能地过滤掉，以减少投资回报的损失。

3. Preparing the Features for Machine Learning

3.准备机器学习的特征

In this section, we'll prepare the filtered_loans_2007.csv data for machine learning. We'll focus on handling missing values, converting categorical columns to numeric columns and removing any other extraneous columns.

在本节中，我们将为机器学习准备filtered_loans_2007.csv数据。我们将重点处理缺失的值，将分类列转换为数字列并删除任何其他无关的列。

We need to handle missing values and categorical features before feeding the data into a machine learning algorithm, because the mathematics underlying most machine learning models assumes that the data is numerical and contains no missing values. To reinforce this requirement, scikit-learn will return an error if you try to train a model using data that contain missing values or non-numeric values when working with models like linear regression and logistic regression.

在将数据提供给机器学习算法之前，我们需要处理缺失值和分类特征，因为大多数机器学习模型的基础数学假设数据是数字并且不包含缺失值。为了加强这一要求，如果在使用线性回归和逻辑回归等模型时尝试使用包含缺失值或非数值的数据来训练模型，scikit-learn将返回错误。

Here's an outline of what we'll be doing in this stage:

下面概述了我们在这个阶段将要做的事情：

Handle Missing Values
- Investigate Categorical Columns
- Convert Categorical Columns To Numeric Features
- Map Ordinal Values To Integers
- Encode Nominal Values As Dummy Variables
处理缺失的值
- 查明分类的列
- 将分类列转换为数字特征
- 将序数值映射到整数
- 将名义值编码为虚拟变量

First though, let's load in the data from last section's final output:

首先，让我们加载上一节最终输出的数据：

filtered_loans = pd.read_csv('processed_data/filtered_loans_2007.csv')
print(filtered_loans.shape)
filtered_loans.head()

Handle Missing Values

处理缺失值

我们来计算缺失值的数量并确定如何处理它们。我们可以通过以下方式在DataFrame中返回缺失值的数量：

First, use the Pandas DataFrame method isnull()to return a DataFrame containing Boolean values:

True if the original value is null
False if the original value isn't null
Then, use the Pandas DataFrame method sum() to calculate the number of null values in each column.

首先，使用Pandas DataFrame方法isnull()返回包含布尔值的DataFrame：如果原始值为空，则为真如果原始值不为空，则为False

然后，使用Pandas DataFrame方法sum（）计算每列中的空值数。

null_counts = filtered_loans.isnull().sum()
print("Number of null values in each column:\n{}".format(null_counts))

Notice while most of the columns have 0 missing values, title has 9 missing values, revol_util has 48, and pub_rec_bankruptcies contains 675 rows with missing values. Let's remove columns entirely where more than 1% (392) of the rows for that column contain a null value. In addition, we'll remove the remaining rows containing null values, which means we'll lose a bit of data, but in return keep some extra features to use for prediction.

注意，虽然大多数列有0个缺失值，但title有9个缺失值，revol_util有48个，pub_rec_bankruptcies包含675行缺少值。让我们完全删除列，其中该列的多于1％（392）行包含空值。另外，我们将删除包含空值的剩余行，这意味着我们将丢失一些数据，但是作为回报，保留一些额外的特征用于预测。

This means that we'll keep the title and revol_util columns, just removing rows containing missing values, but drop the pub_rec_bankruptcies column entirely since more than 1% of the rows have a missing value for this column.

这意味着我们将保留title和revol_util 列，只删除包含缺失值的行，但完全删除pub_rec_bankruptcies 列，因为超过1％的行对此列缺少值。

Here's a list of steps we can use to achieve that:

Use the drop method to remove the pub_rec_bankruptcies column from filtered_loans.
Use the dropna method to remove all rows from filtered_loans containing any missing values.

以下是我们可以用来实现的步骤列表：

使用drop方法从filtered_loans中删除pub_rec_bankruptcies列。
使用dropna方法删除filtered_loans中包含缺失值的所有行。

filtered_loans = filtered_loans.drop("pub_rec_bankruptcies",axis=1)
filtered_loans = filtered_loans.dropna()

Next, we'll focus on the categorical columns.

接下来，我们将重点介绍分类列。

Investigate Categorical Columns

调查分类列

Keep in mind, the goal in this section is to have all the columns as numeric columns (int or float data type), and containing no missing values. We just dealt with the missing values, so let's now find out the number of columns that are of the object data type and then move on to process them into numeric form.

请记住，本节中的目标是将所有列都作为数字列（int或float数据类型），并且不包含缺少的值。我们只处理了缺失的值，现在让我们找出对象数据类型的列数，然后继续将它们处理为数字形式。

print("Data types and their frequency\n{}".format(filtered_loans.dtypes.value_counts()))

Data types and their frequency
float64    11
object     11
int64       1
dtype: int64

We have 11 object columns that contain text which need to be converted into numeric features. Let's select just the object columns using the DataFrame method select_dtype, then display a sample row to get a better sense of how the values in each column are formatted.

我们有11个对象列包含需要转换为数字特征的文本。让我们使用DataFrame方法select_dtype选择对象列，然后显示样本行以更好地了解每列中值的格式。

object_columns_df = filtered_loans.select_dtypes(include=['object'])
print(object_columns_df.iloc[0])

term                     36 months
grade                            B
emp_length               10+ years
home_ownership                RENT
verification_status       Verified
purpose                credit_card
title                     Computer
addr_state                      AZ
earliest_cr_line          Jan-1985
revol_util                   83.7%
last_credit_pull_d        Sep-2016
Name: 0, dtype: object

Notice that revol_util column contains numeric values, but is formatted as object. We learned from the description of columns in the preview DataFrame earlier that revol_util is a revolving line utilization rate or the amount of credit the borrower is using relative to all available credit (read more here).

注意该revol_util列包含数值，但被格式化为对象。我们之前从previewDataFrame中的列描述中了解到，这revol_util是一种循环线使用率或借款人相对于所有可用信用额度的信用额度（请在此处阅读更多内容）。

We need to format revol_util as numeric values. Here's what we should do:

Use the str.rstrip() string method to strip the right trailing percent sign (%).
On the resulting Series object, use the astype() method to convert to the type float.
Assign the new Series of float values back to the revol_util column in the filtered_loans.

我们需要格式化revol_util为数值。以下是我们应该做的事情：

使用str.rstrip()字符串方法去除右边的百分号（%）。
在生成的Series对象上，使用该astype()方法转换为该类型float。
将新的系列浮动值重新分配到中的revol_util列filtered_loans。

filtered_loans['revol_util'] = filtered_loans['revol_util'].str.rstrip('%').astype('float')

Moving on, these columns seem to represent categorical values:

home_ownership — home ownership status, can only be 1 of 4 categorical values according to the data dictionary.
verification_status — indicates if income was verified by Lending Club.
emp_length — number of years the borrower was employed upon time of application.
term — number of payments on the loan, either 36 or 60.
addr_state — borrower's state of residence.
grade — LC assigned loan grade based on credit score.
purpose — a category provided by the borrower for the loan request.
title — loan title provided the borrower.

继续，这些列似乎代表了分类值：

home_ownership - 房屋所有权状态，根据数据字典只能为4个分类值中的1个。
verification_status - 表示收入是否由Lending Club验证。
emp_length - 借款人在申请时受聘的年限。
term - 贷款的支付数量，36或60。
addr_state - 借款人的居住状态。
grade - LC根据信用评分分配的贷款等级。
purpose- 借款人为贷款申请提供的类别。
title - 借款人提供的贷款头衔。

To be sure, lets confirm by checking the number of unique values in each of them.

Also, based on the first row's values for purpose and title, it appears these two columns reflect the same information. We'll explore their unique value counts separately to confirm if this is true.

Lastly, notice the first row's values for both earliest_cr_line and last_credit_pull_dcolumns contain date values that would require a good amount of feature engineering for them to be potentially useful:

可以肯定的是，我们通过检查每一个列中唯一值的数量来确认。

另外，根据purpose 和 title的第一行值，看起来这两列反映了相同的信息。我们将分别探讨其独特的价值计数，以确认这是否属实。

最后，注意earliest_cr_line和last_credit_pull_d列的第一行值包含日期值，这些日期值需要大量的特征工程，以便它们可能有用：

earliest_cr_line — The month the borrower's earliest reported credit line was opened
last_credit_pull_d — The most recent month Lending Club pulled credit for this loan
earliest_cr_line - 借款人最早报告的信用额度被打开的月份 last_credit_pull_d - 最近一个月的Lending Club为这笔贷款提供贷款

We'll remove these date columns from the DataFrame.

First, let's explore the unique value counts of the six columns that seem like they contain categorical values

我们将从DataFrame中删除这些日期列。

首先，让我们探索看起来像包含分类值的六列的唯一值计数

cols = ['home_ownership', 'grade','verification_status', 'emp_length', 'term', 'addr_state']
for name in cols:
    print(name,':')
    print(object_columns_df[name].value_counts(),'\n')

Most of these columns contain discrete categorical values which we can encode as dummy variables and keep. The addr_state column, however,contains too many unique values, so it's better to drop this.

Next, let's look at the unique value counts for the purpose and title columns to understand which columns we want to keep.

这些列中的大多数都包含离散的分类值，我们可以将其编码为虚拟变量并保留。但是，addr_state列包含太多的唯一值，所以最好放弃它。

接下来，让我们看看目标和标题列的唯一值计数，以了解我们想保留哪些列。

for name in ['purpose','title']:
    print("Unique Values in column: {}\n".format(name))
    print(filtered_loans[name].value_counts(),'\n')

It appears the purpose and title columns do contain overlapping information, but the purpose column contains fewer discrete values and is cleaner, so we'll keep it and drop title.

Lets drop the columns we've decided not to keep so far:

看起来目的和标题列确实包含重叠信息，但目的列包含的离散值较少，并且更清晰，因此我们将保留它并放弃标题。

让我们放下我们已经决定不保留的列：

drop_cols = ['last_credit_pull_d','addr_state','title','earliest_cr_line']
filtered_loans = filtered_loans.drop(drop_cols,axis=1)

Convert Categorical Columns to Numeric Features

将分类列转换为数字特征

First, let's understand the two types of categorical features we have in our dataset and how we can convert each to numerical features:

首先，让我们了解我们在数据集中具有的两类分类特征以及如何将每种特征转换为数字特征：

Ordinal values: these categorical values are in natural order. That's you can sort or order them either in increasing or decreasing order. For instance, we learnt earlier that Lending Club grade loan applicants from A to G, and assign each applicant a corresponding interest rate - grade A is less riskier while grade B is riskier than A in that order:
序数值：这些分类值按自然顺序排列。那就是你可以按照递增或递减顺序对它们进行排序或排序。例如，我们先前了解到从A到G的贷款俱乐部级贷款申请人，并为每个申请人分配相应的利率 - 等级A的风险较低，而等级B的风险高于A的顺序：
A < B < C < D < E < F < G ; where < means less riskier than
Nominal Values: these are regular categorical values. You can't order nominal values. For instance, while we can order loan applicants in the employment length column (emp_length) based on years spent in the workforce:
标准值定值：这些是正常的分类值。您无法操作标称值。例如，虽然我们可以根据工作年限在雇佣长度栏（emp_length）中操作贷款申请人：

we can't do that with the column purpose. It wouldn't make sense to say:

我们不能对特征purpose这么做。这样做是说不出什么道理的：

car < wedding < education < moving < house

These are the columns we now have in our dataset:

这些是我们现在在我们的数据集中的列：

Ordinal Values

grade
emp_length

Nominal Values

home_ownership
verification_status
purpose
term

There are different approaches to handle each of these two types. In the steps following, we'll convert each of them accordingly.

To map the ordinal values to integers, we can use the pandas DataFrame method replace() to map both grade and emp_length to appropriate numeric values

有两种处理这两种类型的方法。在下面的步骤中，我们将相应地转换它们中的每一个。

要将序数值映射为整数，我们可以使用pandas DataFrame方法replace（）将等级和emp_length映射到适当的数值

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0

    },
    "grade":{
        "A": 1,
        "B": 2,
        "C": 3,
        "D": 4,
        "E": 5,
        "F": 6,
        "G": 7
    }
}

filtered_loans = filtered_loans.replace(mapping_dict)
filtered_loans[['emp_length','grade']].head()

Perfect! Let's move on to the Nominal Values. The approach to converting nominal features into numerical features is to encode them as dummy variables. The process will be:

完美！让我们继续讨论分类变量值。将分类特征转换为数字特征的方法是将它们编码为虚拟变量。该过程将是：

Use pandas' get_dummies() method to return a new DataFrame containing a new column for each dummy variable
Use the concat() method to add these dummy columns back to the original DataFrame
Then drop the original columns entirely using the drop method
使用pandas的get_dummies（）方法为每个虚拟变量返回一个包含新列的DataFrame 使用concat（）方法将这些虚拟列添加回原始DataFrame 然后使用drop方法完全删除原始列

Lets' go ahead and encode the nominal columns that we now have in our dataset.

让我们继续前进，并对我们现在在数据集中的分类列进行编码。

nominal_columns = ["home_ownership", "verification_status", "purpose", "term"]
dummy_df = pd.get_dummies(filtered_loans[nominal_columns])
filtered_loans = pd.concat([filtered_loans, dummy_df], axis=1)
filtered_loans = filtered_loans.drop(nominal_columns, axis=1)

To wrap things up, let's inspect our final output from this section to make sure all the features are of the same length, contain no null value, and are numericals.

Let's use pandas info method to inspect the filtered_loans DataFrame:

为了包装起来，让我们检查一下本节的最终输出，以确保所有特征的长度相同，不包含空值，并且是数字。

让我们使用pandas info方法来检查filtered_loans DataFrame：

filtered_loans.info()

Save to CSV

保存成CSV

It is a good practice to store the final output of each section or stage of your workflow in a separate csv file. One of the benefits of this practice is that it helps us to make changes in our data processing flow without having to recalculate everything.

将工作流程的每个部分或阶段的最终输出存储在单独的csv文件中是一种很好的做法。这种做法的好处之一是，它可以帮助我们改变数据处理流程，而无需重新计算所有内容。

filtered_loans.to_csv("processed_data/cleaned_loans_2007.csv",index=False)

Next Steps

下一步

In this post, we used the Data Dictionary Lending Club provided with the Loans_2007 DataFrame's first row's values to become familiar with the columns in the dataset and were able to removed many columns that aren't useful for modeling. We also selected loan_status as our target column and decided to focus our modeling efforts on binary classification.

在本文中，我们使用Loans_2007 DataFrame的第一行值提供的Data Dictionary Lending Club来熟悉数据集中的列，并且能够删除许多对建模无用的列。我们还选择loan_status作为我们的目标列，并决定将我们的建模工作集中在二元分类上。

Then, we performed the last amount of data preparation necessary to get the features into data types that can be fed into machine learning algorithms. We converted all columns of object data type(Categorical features) to numerical values because those are the only type of values scikit-learn can work with.

然后，我们执行最后的数据准备工作，以便将这些特征导入可以输入机器学习算法的数据类型中。我们将对象数据类型（分类特征）的所有列转换为数值，因为这些是scikit-learn可以使用的唯一值类型。

If you're interested in working more with this data set, you might like to check out our interactive machine learning walkthrough course which covers the next steps in working with the Lending Club data.

如果您有兴趣更多地使用此数据集，可以查看我们的交互式机器学习演练课程，其中介绍了使用Lending Club数据的后续步骤。

Files

文章翻译.md

Latest commit

History