2.1.1

在现代交通中，燃油效率（MPG）是衡量汽车性能和交通系统优化的重要指标之一。高效的燃油利用不仅能够降低车辆运营成本，还能减少碳排放，促进环保。开发一个用于预测汽车燃油效率的模型可以帮助智慧交通系统优化路线规划和车辆调度，从而提升整体交通效率和减少能源消耗。此外，这样的模型还可以帮助消费者做出更明智的购车决策，并帮助厂商优化汽车设计。

现要求根据提供的汽车燃油效率数据集，补全2.1.1.ipynb代码。选择合适的特征，开发一个燃油效率预测模型。在开发预测模型之前，首先要对数据进行数据清洗和标
注，请完成下面的数据预处理任务，并设计一套标注流程规范：

(1)正确加载数据集，并显示前五行的数据及数据类型。
(2)检查数据集中的缺失值并删除缺失值所在的行。
(3)将“horsepower”列转换为数值类型，并处理转换中的异常值。
(4)对数值型数据进行标准化处理，确保数据在同一量纲下进行分析。
(5)根据业务需求和数据特性，选择对燃油效率预测最有用的特征：选择以下特征：'cylinders'、'displacement'、'horsepower'、'weight'、'acceleration'、'model year'、'origin'
(6)将“mpg”设为目标变量并标注；
(7)对数据进行标注和划分；
(8)保存处理后的数据，并命名为：2.1.1_cleaned_data.csv，保存到考生文件夹；
(9)制定数据清洗和标注规范，将答案写到答题卷文件中，答题卷文件命名为“2.1.1.docx”，保存到考生文件夹；
(10)将以上代码以及运行结果，以html格式保存并命名为2.1.1.html，保存到考生文件夹，考生文件夹命名为“准考证号+身份证后6位”。

import pandas as pd

# 加载数据集并显示数据集的前五行 1分
data = pd.read_csv('auto-mpg.csv')
print("数据集的前五行:")
print(data.head(5))

# 显示每一列的数据类型
print(data.dtypes)

# 检查缺失值并删除缺失值所在的行  2分
print("\n检查缺失值:")
print(data.isna().sum())  
data = data.dropna()

# 将 'horsepower' 列转换为数值类型，并（删除）处理转换中的异常值 1分
data['horsepower'] = pd.to_numeric(data['horsepower'], errors='coerce')
data = data.dropna(subset=['horsepower'])

# 显示每一列的数据类型
print(data.horsepower.dtypes)

# 检查清洗后的缺失值
print("\n检查清洗后的缺失值:")
print(data.isnull().sum())

from sklearn.preprocessing import StandardScaler
# 对数值型数据进行标准化处理 1分
numerical_features = ['displacement', 'horsepower', 'weight', 'acceleration']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

from sklearn.model_selection import train_test_split
# 选择特征、自变量和目标变量 2分
selected_features = ['cylinders','displacement','horsepower','weight','acceleration','model year','origin']
X = data[selected_features]
y = data['mpg']

# 划分数据集为训练集和测试集（训练集占8成） 1分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 将特征和目标变量合并到一个数据框中
cleaned_data = X.copy()
cleaned_data['mpg'] = y

# 保存清洗和处理后的数据（不存储额外的索引号） 1分
cleaned_data.to_csv('2.1.1_cleaned_data.csv', index=False)

# 打印消息指示文件已保存
print("\n清洗后的数据已保存到 2.1.1_cleaned_data.csv")

数据集的前五行:
mpg cylinders displacement ... model year origin car name
0 18.0 8.0 NaN ... 70.0 1 chevrolet chevelle malibu
1 15.0 8.0 350.0 ... 70.0 1 buick skylark 320
2 18.0 8.0 318.0 ... 70.0 1 plymouth satellite
3 16.0 8.0 304.0 ... 70.0 1 amc rebel sst
4 17.0 8.0 302.0 ... 70.0 1 ford torino

[5 rows x 9 columns]
mpg float64
cylinders float64
displacement float64
horsepower object
weight float64
acceleration float64
model year float64
origin int64
car name object
dtype: object

检查缺失值:
mpg 0
cylinders 1
displacement 3
horsepower 1
weight 2
acceleration 1
model year 1
origin 0
car name 0
dtype: int64
float64

检查清洗后的缺失值:
mpg 0
cylinders 0
displacement 0
horsepower 0
weight 0
acceleration 0
model year 0
origin 0
car name 0
dtype: int64

清洗后的数据已保存到 2.1.1_cleaned_data.csv

import pandas as pd
data = pd.read_csv('auto-mpg.csv')
print("数据集的前五行:")
print(data.head(5))

[5 rows x 9 columns]

# 显示每一列的数据类型
print(data.dtypes)

# 检查缺失值并删除缺失值所在的行  2分
print("\n检查缺失值:")
print(data.isna().sum())  
data = data.dropna()
print(data)

mpg float64
cylinders float64
displacement float64
horsepower object
weight float64
acceleration float64
model year float64
origin int64
car name object
dtype: object

检查缺失值:
mpg 0
cylinders 0
displacement 0
horsepower 0
weight 0
acceleration 0
model year 0
origin 0
car name 0
dtype: int64
mpg cylinders displacement ... model year origin car name
1 15.0 8.0 350.0 ... 70.0 1 buick skylark 320
2 18.0 8.0 318.0 ... 70.0 1 plymouth satellite
3 16.0 8.0 304.0 ... 70.0 1 amc rebel sst
4 17.0 8.0 302.0 ... 70.0 1 ford torino
5 15.0 8.0 429.0 ... 70.0 1 ford galaxie 500
.. ... ... ... ... ... ... ...
393 27.0 4.0 140.0 ... 82.0 1 ford mustang gl
394 44.0 4.0 97.0 ... 82.0 2 vw pickup
395 32.0 4.0 135.0 ... 82.0 1 dodge rampage
396 28.0 4.0 120.0 ... 82.0 1 ford ranger
397 31.0 4.0 119.0 ... 82.0 1 chevy s-10

[390 rows x 9 columns]

# 将 'horsepower' 列转换为数值类型，并（删除）处理转换中的异常值 1分
data['horsepower'] = pd.to_numeric(data['horsepower'], errors='coerce')
data = data.dropna(subset=['horsepower'])

print(data.dtypes)
print(data)

mpg float64
cylinders float64
displacement float64
horsepower float64
weight float64
acceleration float64
model year float64
origin int64
car name object
dtype: object
mpg cylinders displacement ... model year origin car name
1 15.0 8.0 350.0 ... 70.0 1 buick skylark 320
2 18.0 8.0 318.0 ... 70.0 1 plymouth satellite
3 16.0 8.0 304.0 ... 70.0 1 amc rebel sst
4 17.0 8.0 302.0 ... 70.0 1 ford torino
5 15.0 8.0 429.0 ... 70.0 1 ford galaxie 500
.. ... ... ... ... ... ... ...
393 27.0 4.0 140.0 ... 82.0 1 ford mustang gl
394 44.0 4.0 97.0 ... 82.0 2 vw pickup
395 32.0 4.0 135.0 ... 82.0 1 dodge rampage
396 28.0 4.0 120.0 ... 82.0 1 ford ranger
397 31.0 4.0 119.0 ... 82.0 1 chevy s-10

[382 rows x 9 columns]

data.isna().sum()

mpg 0
cylinders 0
displacement 0
horsepower 0
weight 0
acceleration 0
model year 0
origin 0
car name 0
dtype: int64

import pandas as pd

# 加载数据集并显示数据集的前五行 1分
data = pd.read_csv('auto-mpg.csv')
print("数据集的前五行:")
print(data.head(5))

# 显示每一列的数据类型
print(data.dtypes)

# 检查缺失值并删除缺失值所在的行  2分
print("\n检查缺失值:")
print(data.isna().sum())  
data = data.dropna()

# 将 'horsepower' 列转换为数值类型，并（删除）处理转换中的异常值 1分
data['horsepower'] = pd.to_numeric(data['horsepower'], errors='coerce')
data = data.dropna(subset=['horsepower'])

# 显示每一列的数据类型
print(data.horsepower.dtypes)

# 检查清洗后的缺失值
print("\n检查清洗后的缺失值:")
print(data.isnull().sum())

from sklearn.preprocessing import StandardScaler
# 对数值型数据进行标准化处理 1分
numerical_features = ['displacement', 'horsepower', 'weight', 'acceleration']
scaler = StandardScaler()
data[numerical_features] = scaler.fit_transform(data[numerical_features])

data[numerical_features]

[5 rows x 9 columns]
mpg float64
cylinders float64
displacement float64
horsepower object
weight float64
acceleration float64
model year float64
origin int64
car name object
dtype: object

检查缺失值:
mpg 0
cylinders 1
displacement 3
horsepower 1
weight 2
acceleration 1
model year 1
origin 0
car name 0
dtype: int64
float64

检查清洗后的缺失值:
mpg 0
cylinders 0
displacement 0
horsepower 0
weight 0
acceleration 0
model year 0
origin 0
car name 0
dtype: int64

	displacement	horsepower	weight	acceleration
1	1.502862	1.592426	0.838814	-1.488697
2	1.195545	1.199835	0.537191	-1.670886
3	1.061093	1.199835	0.533670	-1.306509
4	1.041886	0.938107	0.552448	-1.853074
5	2.261553	2.456126	1.599326	-2.035262
...	...	...	...	...
393	-0.513910	-0.475220	-0.220974	0.005246
394	-0.926868	-1.365093	-0.995570	3.284635
395	-0.561928	-0.527566	-0.801921	-1.452260
396	-0.705983	-0.658429	-0.414623	1.098376
397	-0.715587	-0.579911	-0.303128	1.389877

382 rows × 4 columns


from sklearn.model_selection import train_test_split
# 选择特征、自变量和目标变量 2分
selected_features = ['cylinders','displacement','horsepower','weight','acceleration','model year','origin']
X = data[selected_features]
print(X)
y = data['mpg']
print(y)

# 划分数据集为训练集和测试集（训练集占8成） 1分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test

cylinders displacement horsepower ... acceleration model year origin
1 8.0 350.0 165.0 ... 11.5 70.0 1
2 8.0 318.0 150.0 ... 11.0 70.0 1
3 8.0 304.0 150.0 ... 12.0 70.0 1
4 8.0 302.0 140.0 ... 10.5 70.0 1
5 8.0 429.0 198.0 ... 10.0 70.0 1
.. ... ... ... ... ... ... ...
393 4.0 140.0 86.0 ... 15.6 82.0 1
394 4.0 97.0 52.0 ... 24.6 82.0 2
395 4.0 135.0 84.0 ... 11.6 82.0 1
396 4.0 120.0 79.0 ... 18.6 82.0 1
397 4.0 119.0 82.0 ... 19.4 82.0 1

[382 rows x 7 columns]
1 15.0
2 18.0
3 16.0
4 17.0
5 15.0
...
393 27.0
394 44.0
395 32.0
396 28.0
397 31.0
Name: mpg, Length: 382, dtype: float64

( cylinders displacement horsepower ... acceleration model year origin
177 4.0 115.0 95.0 ... 15.0 75.0 2
333 6.0 168.0 132.0 ... 11.4 80.0 3
231 8.0 400.0 190.0 ... 12.2 77.0 1
105 8.0 360.0 170.0 ... 13.0 73.0 1
243 3.0 80.0 110.0 ... 13.5 77.0 3
.. ... ... ... ... ... ... ...
82 4.0 120.0 97.0 ... 14.5 72.0 3
117 4.0 68.0 49.0 ... 19.5 73.0 2
282 4.0 140.0 88.0 ... 17.3 79.0 1
363 6.0 231.0 110.0 ... 15.8 81.0 1
113 6.0 155.0 107.0 ... 14.0 73.0 1

[305 rows x 7 columns],
cylinders displacement horsepower ... acceleration model year origin
292 8.0 360.0 150.0 ... 13.0 79.0 1
260 6.0 225.0 110.0 ... 18.7 78.0 1
230 8.0 350.0 170.0 ... 11.4 77.0 1
341 6.0 173.0 110.0 ... 12.6 81.0 1
67 8.0 429.0 208.0 ... 11.0 72.0 1
.. ... ... ... ... ... ... ...
115 8.0 350.0 145.0 ... 13.0 73.0 1
179 4.0 121.0 98.0 ... 14.5 75.0 2
74 8.0 302.0 140.0 ... 16.0 72.0 1
244 4.0 90.0 48.0 ... 21.5 78.0 2
95 8.0 455.0 225.0 ... 11.0 73.0 1

[77 rows x 7 columns],
177 23.0
333 32.7
231 15.5
105 13.0
243 21.5
...
82 23.0
117 29.0
282 22.3
363 22.4
113 21.0
Name: mpg, Length: 305, dtype: float64,
292 18.5
260 18.6
230 15.5
341 23.5
67 11.0
...
115 15.0
179 22.0
74 13.0
244 43.1
95 12.0
Name: mpg, Length: 77, dtype: float64)


# 将特征和目标变量合并到一个数据框中
cleaned_data = X.copy()
print(cleaned_data)
cleaned_data['mpg'] = y
print(cleaned_data)


# cleaned_data
# cleaned_data['mpg']

[382 rows x 7 columns]
cylinders displacement horsepower ... model year origin mpg
1 8.0 350.0 165.0 ... 70.0 1 15.0
2 8.0 318.0 150.0 ... 70.0 1 18.0
3 8.0 304.0 150.0 ... 70.0 1 16.0
4 8.0 302.0 140.0 ... 70.0 1 17.0
5 8.0 429.0 198.0 ... 70.0 1 15.0
.. ... ... ... ... ... ... ...
393 4.0 140.0 86.0 ... 82.0 1 27.0
394 4.0 97.0 52.0 ... 82.0 2 44.0
395 4.0 135.0 84.0 ... 82.0 1 32.0
396 4.0 120.0 79.0 ... 82.0 1 28.0
397 4.0 119.0 82.0 ... 82.0 1 31.0

[382 rows x 8 columns]