2.2.1
互联网金融飞速发展,使得个人金融理财变得越来越容易。而其中信用评分技术是一种对贷款申请人(信用卡申请人)做风险评估分值的统计模型,可以根据客户提供的资料、客户的历史数据、第三方平台数据(芝麻分、京东、微信等),对客户的信用进行评估。现要求根据提供的finance数据集,补全2.2.1.ipynb代码。选择合适的特征,开发一个申请的评分模型,利用测试工具对模型进行测试,并对测试结果进行分析,完成测试报告,并运用工具对错误原因进行纠正。
(1)正确加载数据集,显示前五行的数据。
(2)使用Logistic模型进行模型训练,要求设定自变量和因变量,并根据自变量特征进行模型训练,最终将训练好的模型以文件名2.2.1_model.pkl保存到考生文件夹,结果文件以2.2.1_results.txt保存到考生文件夹。
(3)使用测试工具对模型进行测试,并记录测试结果,命名2.2.1_report.txt,保存到考生文件夹
(4)对测试结果进行详细分析,并编写测试报告,包括模型性能评估、错误分析及改进建议,将答案写到答题卷文件中,答题卷文件命名为“2.2.1.docx”,保存到考生文件夹。
(5)运用工具分析算法中错误案例产生的原因并进行纠正,重新得到模型训练结果,以文件名2.2.1_results_xg.txt保存到考生文件夹。
(6)将以上代码以及运行结果,以html格式保存并命名为2.2.1.html,保存到考生文件夹,考生文件夹命名为“准考证号+身份证后6位”。
数据集说明:
Unnamed: 0 - 索引号。
SeriousDlqin2yrs - 个人在过去两年内是否出现过严重的拖欠(1 表示有严重拖欠,0 表示没有)。
RevolvingUtilizationOfUnsecuredLines - 这是指个人未偿还的信用额度与总信用额度的比例。
age - 客户的年龄。
NumberOfTime30-59DaysPastDueNotWorse - 在过去一段时间内,贷款逾期30至59天的次数。
DebtRatio - 债务比率。
MonthlyIncome - 客户的月收入。
NumberOfOpenCreditLinesAndLoans - 正在使用的信贷账户或贷款的数量。
NumberOfTimes90DaysLate - 贷款逾期超过90天的次数。
NumberRealEstateLoansOrLines - 持有的房地产相关贷款或信贷的数量。
NumberOfTime60-89DaysPastDueNotWorse - 贷款逾期60至89天的次数。
NumberOfDependents - 家庭中依赖该个人的人数。
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
# 加载数据
data = __________
# 显示前五行的数据
print(__________)
# 选择自变量和因变量
X = data.drop(['SeriousDlqin2yrs', 'Unnamed: 0'], axis=1)
y = data['SeriousDlqin2yrs']
# 分割训练集和测试集(测试集20%)
X_train, X_test, y_train, y_test = __________(__________, random_state=42)
# 训练Logistic回归模型(最大迭代次数为1000次)
model = __________
#训练 Logistic 回归模型
__________
# 保存模型
with open('2.2.1_model.pkl', 'wb') as file:
pickle.__________
# 预测并保存结果
y_pred = __________
pd.DataFrame(y_pred, columns=['预测结果']).to_csv('2.2.1_results.txt', index=False)
# 生成测试报告
report = classification_report(y_test, y_pred, zero_division=1)
with open('2.2.1_report.txt', 'w') as file:
file.write(report)
# 分析测试结果
accuracy = __________
print(f"模型准确率: {accuracy:.2f}")
# 处理数据不平衡
smote = SMOTE(random_state=42)
X_resampled, y_resampled = __________
# 重新训练模型
__________
# 重新预测
y_pred_resampled = __________
# 保存新结果
pd.DataFrame(y_pred_resampled, columns=['预测结果']).to_csv('2.2.1_results_xg.txt', index=False)
# 生成新的测试报告
report_resampled = classification_report(y_test, y_pred_resampled, zero_division=1)
with open('2.2.1_report_xg.txt', 'w') as file:
file.write(report_resampled)
# 分析新的测试结果
accuracy_resampled = __________
print(f"重新采样后的模型准确率: {accuracy_resampled:.2f}")
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
# 加载数据
data = pd.read_csv('finance数据集.csv')
# 显示前五行的数据
print(data.head(5))
# 选择自变量和因变量
X = data.drop(['SeriousDlqin2yrs', 'Unnamed: 0'], axis=1)
y = data['SeriousDlqin2yrs']
# 分割训练集和测试集(测试集20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练Logistic回归模型(最大迭代次数为1000次)
model = LogisticRegression(max_iter=200)
#训练 Logistic 回归模型
model.fit(X_train, y_train)
# 保存模型
with open('2.2.1_model.pkl', 'wb') as file:
pickle.dump(model,file)
# 预测并保存结果
y_pred = model.predict(X_test)
pd.DataFrame(y_pred, columns=['预测结果']).to_csv('2.2.1_results.txt', index=False)
# 生成测试报告
report = classification_report(y_test, y_pred, zero_division=1)
with open('2.2.1_report.txt', 'w') as file:
file.write(report)
# 分析测试结果
accuracy = (y_test==y_pred).mean()
print(f"模型准确率: {accuracy:.2f}")
# 处理数据不平衡
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train,y_train)
# 重新训练模型
model.fit(X_resampled, y_resampled)
# 重新预测
y_pred_resampled = model.predict(X_test)
# 保存新结果
pd.DataFrame(y_pred_resampled, columns=['预测结果']).to_csv('2.2.1_results_xg.txt', index=False)
# 生成新的测试报告
report_resampled = classification_report(y_test, y_pred_resampled, zero_division=1)
with open('2.2.1_report_xg.txt', 'w') as file:
file.write(report_resampled)
# 分析新的测试结果
accuracy_resampled = (y_test==y_pred_resampled).mean()
print(f"重新采样后的模型准确率: {accuracy_resampled:.2f}")
Unnamed: 0 ... NumberOfDependents
0 1 ... 2.0
1 2 ... 1.0
2 3 ... 0.0
3 4 ... 0.0
4 5 ... 0.0
[5 rows x 12 columns]
/opt/anaconda3/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
模型准确率: 0.94
重新采样后的模型准确率: 0.76
/opt/anaconda3/lib/python3.12/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pickle
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
# 加载数据
data = pd.read_csv('finance数据集.csv')
# 显示前五行的数据
print(data.head(5))
# 选择自变量和因变量
X = data.drop(['SeriousDlqin2yrs', 'Unnamed: 0'], axis=1)
y = data['SeriousDlqin2yrs']
# 分割训练集和测试集(测试集20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Unnamed: 0 SeriousDlqin2yrs RevolvingUtilizationOfUnsecuredLines age
0 1 1 0.766127 45
1 2 0 0.957151 40
2 3 0 0.658180 38
3 4 0 0.233810 30
4 5 0 0.907239 49
NumberOfTime30-59DaysPastDueNotWorse DebtRatio MonthlyIncome
0 2 0.802982 9120.0
1 0 0.121876 2600.0
2 1 0.085113 3042.0
3 0 0.036050 3300.0
4 1 0.024926 63588.0
NumberOfOpenCreditLinesAndLoans NumberOfTimes90DaysLate
0 13 0
1 4 0
2 2 1
3 5 0
4 7 0
NumberRealEstateLoansOrLines NumberOfTime60-89DaysPastDueNotWorse
0 6 0
1 0 0
2 0 0
3 0 0
4 1 0
NumberOfDependents
0 2.0
1 1.0
2 0.0
3 0.0
4 0.0