2.1.5
在健康与营养咨询领域,客户的健康数据是评估其饮食和生活方式建议的重要依据。通过对客户健康数据的分析,可以帮助健康咨询师更准确地评估客户的健康状况,并制定个性化的营养和健康管理计划。现提供一份健康咨询客户数据集。请补全2.1.5.ipynb代码,完成下面的数据预处理任务:
(1)加载数据集:查看表的数据类型,表结构和显示每一列的空缺值数量;
(2)缺失值处理:对于含有缺失值的行进行删除操作;
(3)数据类型转换:将“Your age”列的数据类型转换为整数类型,并处理其中的异常值;
(4)数据去重:检查数据集中的重复值并删除所有重复值,并记录删除的行数;
(5)数据归一化处理:对“如何形容你的当前健身水平?”(How do you describe your current level of fitness ?)列中的数据进行归一化处理;
(6)绘制健身频率分布的饼图;
(7)对数据进行标注划分;
(8)保存处理后的数据,并命名为:2.1.5_cleaned_data.csv,保存到考生文件夹;
(9)制定数据清洗和数据标注规范,将答案写到答题卷文件中,答题卷文件命名为“2.1.5.docx”,保存到考生文件夹;
(10)将以上代码以及运行结果,以html格式保存并命名为2.1.5.html,保存到考生文件夹,考生文件夹命名为“准考证号+身份证后6位”。
import pandas as pd
# 加载数据集
data = __________
# 查看表结构基本信息
print(__________)
# 显示每一列的空缺值数量
print(__________)
# 删除含有缺失值的行
data_cleaned = __________
# 转换 'Your age' 列的数据类型为整数类型,并处理异常值
data_cleaned.loc[:, 'Your age'] = __________(__________, errors='coerce')
data_cleaned = data_cleaned.dropna(subset=['Your age'])
data_cleaned = data_cleaned[data_cleaned['Your age'] >= 0]
data_cleaned.loc[:, 'Your age'] = data_cleaned['Your age'].__________
print(data_cleaned['Your age'].dtype)
# 检查和删除重复值
duplicates_removed = data_cleaned.duplicated().sum()
data_cleaned = __________
print(f"Removed {duplicates_removed} duplicate rows")
from sklearn.preprocessing import LabelEncoder
# 归一化 'How do you describe your current level of fitness ?' 列
label_encoder = LabelEncoder()
data_cleaned[__________] = __________
print(data_cleaned['How do you describe your current level of fitness ?'].unique())
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# 去掉列名中的空格
data.columns = data.columns.str.strip()
# 显示数据集的列名
print(data.columns)
# 删除包含缺失值的行
data_cleaned = data.dropna(subset=['How often do you exercise?'])
# 统计不同健身频率的分布情况
exercise_frequency_counts = data_cleaned['How often do you exercise?'].value_counts()
# 绘制饼图
plt.figure(figsize=(10, 6))
__________(autopct='%1.1f', startangle=90, colors=plt.cm.Paired.colors)
plt.title('Distribution of Exercise Frequency')
plt.ylabel('')
plt.show()
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# 填充缺失值
data_filled = data.apply(lambda x: x.fillna(x.mode()[0]))
# 划分数据(测试集占比20%)
train_data, test_data = train_test_split(data_filled, test_size=0.2, random_state=42)
# 保存处理后的数据
cleaned_file_path = '2.1.5_cleaned_data.csv'
data_filled.to_csv(cleaned_file_path, index=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551 entries, 0 to 550
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Timestamp 551 non-null object
1 Your name 551 non-null object
2 Your gender 551 non-null object
3 Your age 551 non-null float64
4 How important is exercise to you ? 551 non-null int64
5 How do you describe your current level of fitness ? 551 non-null object
6 How often do you exercise? 550 non-null object
7 What barriers, if any, prevent you from exercising more regularly? (Please select all that apply) 551 non-null object
8 What form(s) of exercise do you currently participate in ? (Please select all that apply) 550 non-null object
9 Do you exercise ___________ ? 549 non-null object
10 What time if the day do you prefer to exercise? 551 non-null object
11 How long do you spend exercising per day ? 551 non-null object
12 Would you say you eat a healthy balanced diet ? 550 non-null object
13 What prevents you from eating a healthy balanced diet, If any? (Please select all that apply) 551 non-null object
14 How healthy do you consider yourself? 550 non-null float64
15 Have you ever recommended your friends to follow a fitness routine? 550 non-null object
16 Have you ever purchased a fitness equipment? 551 non-null object
17 What motivates you to exercise? (Please select all that applies ) 551 non-null object
dtypes: float64(2), int64(1), object(15)
memory usage: 77.6+ KB
None
Timestamp 0
Your name 0
Your gender 0
Your age 0
How important is exercise to you ? 0
How do you describe your current level of fitness ? 0
How often do you exercise? 1
What barriers, if any, prevent you from exercising more regularly? (Please select all that apply) 0
What form(s) of exercise do you currently participate in ? (Please select all that apply) 1
Do you exercise ___________ ? 2
What time if the day do you prefer to exercise? 0
How long do you spend exercising per day ? 0
Would you say you eat a healthy balanced diet ? 1
What prevents you from eating a healthy balanced diet, If any? (Please select all that apply) 0
How healthy do you consider yourself? 1
Have you ever recommended your friends to follow a fitness routine? 1
Have you ever purchased a fitness equipment? 0
What motivates you to exercise? (Please select all that applies ) 0
dtype: int64
float64
Removed 5 duplicate rows
[1 4 3 0 2]
Index(['Timestamp', 'Your name', 'Your gender', 'Your age',
'How important is exercise to you ?',
'How do you describe your current level of fitness ?',
'How often do you exercise?',
'What barriers, if any, prevent you from exercising more regularly? (Please select all that apply)',
'What form(s) of exercise do you currently participate in ? (Please select all that apply)',
'Do you exercise ___________ ?',
'What time if the day do you prefer to exercise?',
'How long do you spend exercising per day ?',
'Would you say you eat a healthy balanced diet ?',
'What prevents you from eating a healthy balanced diet, If any? (Please select all that apply)',
'How healthy do you consider yourself?',
'Have you ever recommended your friends to follow a fitness routine?',
'Have you ever purchased a fitness equipment?',
'What motivates you to exercise? (Please select all that applies )'],
dtype='object')
import pandas as pd
def age_range_to_numeric(age_str):
if pd.isna(age_str): # 处理缺失值
return np.nan
if 'to' in age_str: # 如 '19 to 25'
low, high = map(int, age_str.split(' to '))
return (low + high) / 2 # 取中位数
elif 'and above' in age_str: # 如 '40 and above'
return int(age_str.split(' ')[0]) # 取 40,或自定义(如 40 + 10)
else:
return np.nan # 其他情况按缺失值处理
# 加载数据集
data = pd.read_csv('健康咨询客户数据集.csv')
data['Your age'] = data['Your age'].apply(age_range_to_numeric)
# 查看表结构基本信息
print(data.info())
# 显示每一列的空缺值数量
print(data.isna().sum())
# 删除含有缺失值的行
data_cleaned = data.dropna()
# 转换 'Your age' 列的数据类型为整数类型,并处理异常值
data_cleaned.loc[:, 'Your age'] = pd.to_numeric(data_cleaned['Your age'], errors='coerce')
data_cleaned = data_cleaned.dropna(subset=['Your age'])
data_cleaned = data_cleaned[data_cleaned['Your age'] >= 0]
data_cleaned.loc[:, 'Your age'] = data_cleaned['Your age'].astype(int)
print(data_cleaned['Your age'].dtype)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551 entries, 0 to 550
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Timestamp 551 non-null object
1 Your name 551 non-null object
2 Your gender 551 non-null object
3 Your age 551 non-null float64
4 How important is exercise to you ? 551 non-null int64
5 How do you describe your current level of fitness ? 551 non-null object
6 How often do you exercise? 550 non-null object
7 What barriers, if any, prevent you from exercising more regularly? (Please select all that apply) 551 non-null object
8 What form(s) of exercise do you currently participate in ? (Please select all that apply) 550 non-null object
9 Do you exercise ___________ ? 549 non-null object
10 What time if the day do you prefer to exercise? 551 non-null object
11 How long do you spend exercising per day ? 551 non-null object
12 Would you say you eat a healthy balanced diet ? 550 non-null object
13 What prevents you from eating a healthy balanced diet, If any? (Please select all that apply) 551 non-null object
14 How healthy do you consider yourself? 550 non-null float64
15 Have you ever recommended your friends to follow a fitness routine? 550 non-null object
16 Have you ever purchased a fitness equipment? 551 non-null object
17 What motivates you to exercise? (Please select all that applies ) 551 non-null object
dtypes: float64(2), int64(1), object(15)
memory usage: 77.6+ KB
None
Timestamp 0
Your name 0
Your gender 0
Your age 0
How important is exercise to you ? 0
How do you describe your current level of fitness ? 0
How often do you exercise? 1
What barriers, if any, prevent you from exercising more regularly? (Please select all that apply) 0
What form(s) of exercise do you currently participate in ? (Please select all that apply) 1
Do you exercise ___________ ? 2
What time if the day do you prefer to exercise? 0
How long do you spend exercising per day ? 0
Would you say you eat a healthy balanced diet ? 1
What prevents you from eating a healthy balanced diet, If any? (Please select all that apply) 0
How healthy do you consider yourself? 1
Have you ever recommended your friends to follow a fitness routine? 1
Have you ever purchased a fitness equipment? 0
What motivates you to exercise? (Please select all that applies ) 0
dtype: int64
float64