Machine Learning Notes
Machine Learning Notes
Opptimization of hyperparameters
调参的问题。一般而言调参是针对特定数据集的,最好用验证集或K-Fold验证。调参非常昂贵(expensive),通常最好从最重要的超参数开始,只有在必要时才考虑调整其他超参数(够用就好)。
Techniques for hyperparameter tuning include grid search, random search, and Bayesian optimization.
虽然说有些人认为 grid search 对参数范围过大的情况不友好,所以推荐 Bayesian Method,但是各个参数未必平滑变化,意味着基于平滑空间探索的Bayesian optimization Method 未必一定适合。(不过确实好用)
XGBoost
官网永远是第一手资料,XGBoost Hyperparameter Optimization
核心超参数列表
参数名称 (英文) | 描述 | 作用 | 推荐值 |
---|---|---|---|
max_depth | 单棵树的最大深度 | 增加深度提升模型复杂度(易过拟合) | 3-9 |
min_child_weight | 子节点分裂所需的最小实例权重和 | 较大值约束树生长(防止过拟合) | 1-7 |
subsample | 每棵树使用的训练数据采样比例 | 较低值引入随机性(防过拟合) | 0.6-1.0 |
colsample_bytree | 每棵树使用的特征列采样比例 | 限制特征选择空间(增强泛化能力) | 0.6-1.0 |
learning_rate | 权重更新的步长收缩率 | 小步长提升稳定性(需配合更多树) | 0.01-0.3 |
Xgboost 扩展超参数列表
展开
参数名称 (英文) | 描述 | 作用 | 推荐范围 |
---|---|---|---|
gamma | 节点分裂所需的最小损失减少量 | 较大值使模型保守(适用于噪声数据) | 0-5 |
reg_alpha (L1) | L1正则化系数 | 稀疏化特征权重(产生稀疏解) | 0-10 |
reg_lambda (L2) | L2正则化系数 | 平滑化特征权重(防止极端值) | 0-10 |
scale_pos_weight | 正负样本权重平衡系数 | 处理类别不平衡问题(值>1时增强正样本权重) | 1-10 |
n_estimators | 集成中树的总数量 | 增加数量提升模型容量(需平衡计算成本) | 100-2000 |
这里参数范围基本上是瞎设置的(因为官网上没有)
Optuna 调参
Ref:Bayesian Optimization of XGBoost Hyperparameters with optuna
官网的例子
python
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
import optuna
# Generate a synthetic classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_informative=10, random_state=42)
def objective(trial):
# Suggest hyperparameters
params = {
'max_depth': trial.suggest_int('max_depth', 2, 10),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1.0, log=True),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
}
# Create an XGBoost classifier with the suggested hyperparameters
model = XGBClassifier(**params, n_estimators=100, objective='binary:logistic', random_state=42)
# Perform 5-fold cross-validation and return the mean accuracy
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
return scores.mean()
# Create an Optuna study with TPE sampler
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler())
# Optimize the study for 100 trials
study.optimize(objective, n_trials=100)
# Print the best hyperparameters and best score
print(f"Best hyperparameters: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")
使用手动的K-Fold
python
from sklearn.model_selection import KFold
from xgboost import XGBRegressor, callback
def optimize_hyperparameters(target_name, X_train_, y_train_):
"""使用交叉验证进行参数优化"""
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 9),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
'gamma': trial.suggest_float('gamma', 0, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 0, 1.0),
'reg_lambda': trial.suggest_float('reg_lambda', 0.5, 2.0),
'tree_method': 'hist' #'gpu_hist'
}
# 使用分层K折交叉验证(更适合分类任务)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X_train_, y_train_)):
X_tr, y_tr = X_train_.iloc[train_idx], y_train_.iloc[train_idx]
X_v, y_v = X_train_.iloc[val_idx], y_train_.iloc[val_idx]
model = XGBRegressor(
**params,
n_estimators=2000,
early_stopping_rounds=20,
eval_metric='rmse',
random_state=42 # + fold_idx, # 为每个fold设置不同的随机种子 ?
n_jobs=-1
)
model.fit(
X_tr, y_tr,
eval_set=[(X_v, y_v)],
verbose=False
)
# 使用最佳迭代预测并计算分数
y_pred = model.predict(X_v, iteration_range=(0, model.best_iteration + 1))
fold_score = np.sqrt(mean_squared_error(y_v, y_pred))
scores.append(fold_score)
# 添加Optuna中途剪枝功能
trial.report(fold_score, step=fold_idx)
if trial.should_prune():
raise optuna.TrialPruned()
return np.mean(scores)
# 创建带持久化存储的Optuna study
storage_name = f"sqlite:///{target_name}_tuning_optuna.db"
study = optuna.create_study(
direction='minimize',
sampler=optuna.samplers.TPESampler(seed=42),
study_name=target_name,
storage=storage_name,
load_if_exists=True
)
# 优化时显示进度条并设置超时时间
study.optimize(objective, n_trials=500, show_progress_bar=True, timeout=86400) # 24小时超时
# 保存完整的study对象
joblib.dump(study, f"optuna_study_{target_name}.pkl")
return study.best_params
Ps:超级想吐槽sqlite:///
不知道为何是这种结构