随机森林

随机森林 = Bagging + 决策树 + 特征随机。通过集成多棵决策树，大幅降低过拟合风险。

核心思想

两个随机性

样本随机（Bagging）：每棵树从原始数据中有放回地采样（Bootstrap），平均约 63.2% 的样本被选中
特征随机：每次分裂时只考虑一个随机子集的特征（通常 $\sqrt{p}$ 个），防止强特征主导所有树

为什么有效

单棵决策树方差大（换一组数据可能长出完全不同的树），但多棵树的平均能够降低方差，这就是集成学习的力量。

自实现随机森林

import numpy as np
from sklearn.tree import DecisionTreeClassifier

class SimpleRandomForest:
    def __init__(self, n_estimators=100, max_depth=None, max_features='sqrt'):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.trees = []

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.trees = []

        for _ in range(self.n_estimators):
            # Bootstrap 采样
            idx = np.random.choice(n_samples, n_samples, replace=True)
            X_boot, y_boot = X[idx], y[idx]

            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                max_features=self.max_features,
            )
            tree.fit(X_boot, y_boot)
            self.trees.append(tree)

    def predict(self, X):
        # 每棵树投票
        preds = np.array([tree.predict(X) for tree in self.trees])
        # 多数表决
        return np.array([
            np.bincount(preds[:, i]).argmax()
            for i in range(len(X))
        ])

Scikit-learn 实现

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier(
    n_estimators=100,      # 树的数量（越多越稳定，但收益递减）
    max_depth=10,           # 每棵树的最大深度
    min_samples_split=5,    # 内部节点最少样本数
    max_features='sqrt',    # 每次分裂考虑的特征数
    n_jobs=-1,              # 并行训练
    random_state=42,
)
model.fit(X_train, y_train)

print(f"准确率: {model.score(X_test, y_test):.3f}")

关键参数

参数	建议值	说明
`n_estimators`	100-500	树越多越稳定，但超过一定数量后收益递减
`max_depth`	3-15	控制过拟合，从较小值开始尝试
`min_samples_split`	5-20	分裂所需的最小样本数
`max_features`	`sqrt`	分类默认 `sqrt`，回归默认 `n_features`

特征重要性

随机森林天然支持特征重要性评估：

import pandas as pd

importances = model.feature_importances_
feature_names = [f'feature_{i}' for i in range(len(importances))]

# 排序打印
indices = np.argsort(importances)[::-1]
for i in indices[:10]:
    print(f"{feature_names[i]}: {importances[i]:.4f}")

OOB（袋外）验证

约 36.8% 的样本未被某棵树的 Bootstrap 选中，这些样本可作为验证集：

model = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
model.fit(X_train, y_train)
print(f"OOB 分数: {model.oob_score_:.3f}")

OOB 分数是无偏估计，可以省去单独的验证集。

总结

特性	说明
优点	鲁棒、不易过拟合、特征重要性直观、OOB 验证
缺点	不可解释（百棵树无法可视化）、大数据集训练较慢（可并行缓解）
适用场景	表格数据的分类/回归基线、特征筛选

核心思想​

两个随机性​

为什么有效​

自实现随机森林​

Scikit-learn 实现​

关键参数​

特征重要性​

OOB（袋外）验证​

总结​

核心思想

两个随机性

为什么有效

自实现随机森林

Scikit-learn 实现

关键参数

特征重要性

OOB（袋外）验证

总结