Machine learning basics - use decision trees to make predictions
about coupons
In order to get close to real life and applications, the processing
of actual data sets is the main focus. From January 1, 2016 to June 30,
2016, real online and offline consumption behaviors are predicted to be
used by users within 15 days after receiving coupons in July 2016. Note:
In order to protect the privacy of users and businesses, all data is
anonymized, and biased sampling and necessary filtering are used.
Data set ccf_offline_stage1_train.csv (training
data)
data['Discount_rate'] = data['Discount_rate'].apply(getDiscountType) """ See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy import sys """
# load plugin # Import DecisionTreeClassifier model from sklearn.tree import DecisionTreeClassifier # Import train_test_split, used to divide the data set and test set from sklearn.model_selection import train_test_split # Import accuracy_score accuracy index from sklearn.metrics import accuracy_score
add label row to the dataset
Labeling Label Label which samples are positive samples y=1 and which
are negative samples y = -1 Forecast goal: the user's consumption within
15 days after receiving the coupon (Date-Date_received <= 15) means
to receive the coupon and use it within 15 days, that is, a positive
sample, y = 1 (Date-Date_received> 15) means that the coupon has not
been used within 15 days, that is, a negative sample, y = 0 pandas
tutorial on time
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html
1 2 3 4 5 6 7 8 9 10 11
deflabel(row): if row['Date'] != 'null': td = pd.to_datetime(row['Date'], format = '%Y%m%d') - pd.to_datetime(row['Date_received'], format = '%Y%m%d') if td <= pd.Timedelta(15, 'D'): return1 return0
data['label'] = data.apply(label, axis = 1) """ See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """
Change the standard of the model selection feature to entropy
1
model = DecisionTreeClassifier(criterion='entropy', random_state=1, max_depth=2)
Model training
1
model.fit(X_train, y_train)
predict
1
y_pred = model.predict(X_test)
Evaluate
1
accuracy_score(y_test, y_pred)
In addition to the above key steps, you can explore the data by
yourself, as well as any other forms of feature preprocessing methods
and feature engineering processing. I hope to focus on understanding the
development process of machine learning tasks. For the skills and
methods of data processing, it is encouraged to invest more time to
explore.
5 [label="gini = 0.0\nsamples = 47\nvalue = [0, 47, 0]\nclass = versicolor", fillcolor="#39e581"] ; show more (open the raw output data in a text editor) ...
import numpy as np from collections import Counter
defentropy(elements): counter = Counter(elements) probabilities = [counter[e] / len(elements) for e in elements] return -sum(p * np.log10(p) for p in probabilities)
for f in x_fields: elements = set(training_data[f]) for e in elements: sub_spliter_1 = training_data[dataset[f] == e][target].tolist() entropy_1 = entropy(sub_spliter_1) sub_spliter_2 = training_data[dataset[f] != e][target].tolist() entropy_2 = entropy(sub_spliter_2) entropy_v = entropy_1 + entropy_2