from sklearn.datasets import load_boston from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier from sklearn.model_selection import train_test_split import numpy as np import pandas as pd from sklearn.metrics import r2_score
house = load_boston() X = house.data y = house.target
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
Ensemble learning is a machine learning paradigm that solves the same
problem by training multiple models. In contrast to ordinary machine
learning methods that try to learn a hypothesis from training data,
ensemble methods try to construct a set of hypotheses and use them in
combination. Next, we will use the decision tree and its integrated
version to model the classic data set Mnist and observe the differences
in different integration methods.
import numpy as np import pandas as pd from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
Build a data set
The Mnist data set used this time is not in the original format. In
order to more easily adapt to this training, the 28 * 28 pictures in the
original data set are flatten
operation, it becomes 784 features, the columns in the DataFrame below:
1x1, 1x2, ..., 28x28, representing the i row and j
column in the picture The pixel value of is a grayscale image, so the
pixel value is only 0 and 1
From the above results, we can see that by adjusting the parameter
min_samples_leaf, the overfitting situation has been
alleviated. What does this parameter mean? Why increasing it can
alleviate the overfitting problem? The meaning of
min_samples_leaf is the minimum number of samples contained
in the leaf nodes of the decision tree. By increasing this parameter,
the decision tree can not capture any of the subtle features of the
training data during training, resulting in excessive training data.
Fitting: The large number of samples of leaf nodes can also play a role
in voting and enhance the generalization performance of the model. You
can try to continue to increase the value of this parameter and try to
find the best parameter. In addition to this parameter, you can also try
to adjust the parameters such as min_samples_split and
max_features. For the specific meaning, please refer to sklearn
documentation
Second question:
Try to adjust other parameters to see the performance of the
decision tree on the test set
Random Forest
Take a look at the bagging version of the decision tree and how the
random forest performs!
It is worthy of the integrated version. It basically achieves better
performance under the default parameters. The accuracy of the test set
is about 7% higher than that of the ordinary decision tree. However,
comparing the training and test results, it can be found that there is
still a certain degree of overfitting. , Try to adjust some parameters
below
After increasing the parameter n_estimators, the
accuracy of the test set has increased by about 1%. The meaning of this
parameter is to train 20 decision trees at the same time, and finally
integrate the results. The increase of this parameter can be simply
regarded as voting The number of people increases, so the final result
will inevitably be more robust. You can try to continue to increase this
parameter, or adjust other parameters such as max_samples,
appropriately less than the total amount of training data, which can
increase the difference between different sub-models and further improve
the generalization performance. It can also adjust the parameters of the
base learner (decision tree). For the meaning of the parameters, see sklearn
documentation
GBDT
Let's compare the performance of the boosting version of the decision
tree GBDT!
As expected, the performance has been greatly improved, and the
indicators of the training set are basically the same as those of the
test set, and there is no overfitting, so it should be possible to
continue to try to improve this parameter. Generally, in the absence of
overfitting, we only need to consider continuing to increase the
complexity of the model. This is the fastest way to improve performance.
When the complexity of the model increases to the point of over-fitting,
we then consider using some methods to reduce over-fitting.
Bagging
The aforementioned random forest and GBDT are ensemble learning
algorithms based on decision trees, but it should be noted that ensemble
learning is not exclusive to decision trees. Any other learner can be
used as a base learner for ensemble learning, such as Logistic
regression, support vector machine.
Bagging is short for "bootstrap aggregating". This is a
meta-algorithm, which takes M sub-samples (with replacement) from the
initial data set, and trains the prediction model on these sub-samples.
The final model is obtained by averaging all sub-models, which usually
produces better results. The main advantage of this technique is that it
combines regularization, all you need to do is choose good parameters
for the base learner.
The following uses the general api provided by sklearn to construct
an integrated learning algorithm
1 2 3 4 5 6 7 8
# Still use decision tree as base learner bgc = BaggingClassifier(DecisionTreeClassifier(), max_samples=0.5, max_features=1.0, n_estimators=20) bgc.fit(X_train, y_train)
Above we have successfully used logistic regression as the base
learner to complete integrated learning. You can try to use only
logistic regression for training, and compare the performance of the
single model with the bagging version of logistic regression.
Boosting
Boosting refers to a series of algorithms that can transform a weak
learner into a strong learner. The main principle of boosting is to
combine a series of weak learners (only better than random guessing).
For those samples that were misclassified in the early stages of
training, the boosting algorithm will give more attention. Then combine
the predictions by weighted majority voting (classification) or weighted
sum (regression) to produce the final prediction.
Comparing the boosting integrated version of decision tree and
logistic regression, we can find that logistic regression has better
generalization ability, and decision tree is easier to overfit
In fact, over-fitting is not a bad thing. If your model cannot be
over-fitted, it means that it cannot fit the training data well.
Therefore, the decision tree is very over-fitted at the beginning, which
also shows its potential. , You can see that after the above parameters
are adjusted, the boosting version of the decision tree easily exceeds
the boosting version of the logistic regression