Logistic regression to diagnose heart disease

The preject source code url : Heart

load data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv('./data/heart.csv')
# the csv url: https://github.com/hivandu/colab/blob/master/AI_Data/data/heart.csv

# Print a brief summary of the data set

data.info()
data.shape

data.target.value_counts()

The params meaning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Params	  Meaning	
age 年龄
sex 性别(1 = 男性, 0 = 女性)
cp 胸部疼痛类型(值 1:典型心绞痛,值 2:非典型性心绞痛,值 3:非心绞痛,值 4:无症状)
trestbps 血压
chol 胆固醇
fbs 空腹血糖(> 120 mg/dl,1=真;0=假)
restecg 心电图结果(0=正常,1=患有 ST-T 波异常,2=根据 Estes 的标准显示可能或确定的左心室肥大)
thalach 最大心跳数
exang 运动时是否心绞痛(1=有过;0=没有)
oldpeak 运动相对于休息的 ST
slop 心电图 ST segment 的倾斜度(值 1:上坡,值 2:平坦,值 3:下坡)
ca 透视检查看到的血管数
thal 缺陷种类(3=正常;6=固定缺陷;7=可逆缺陷)
target 是否患病(0=否,1=是)

Perform analysis

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

# Change the "sex" column into two columns "sex_0" and "sex_1"
sex = pd.get_dummies(data['sex'], prefix = 'sex')

# Add "sex_0" and "sex_1" to the data set.
data = pd.concat([data, sex], axis = 1)


# And delete the sex column.
data = data.drop(columns = ['sex'])


# Print out the first five lines. Check whether sex_0, sex_1 are added successfully, and whether sex is deleted successfully.
data.head()

# Get sample label
data_y = data.target.values
data_y.shape

# Get sample feature set
data_x = data.drop(['target'], axis = 1)
data_x.shape

# Divide the data set
train_x, test_x, train_y, test_y = train_test_split(data_x, data_y, test_size = 0.3, random_state=33)

Normalization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# initialize
ss = StandardScaler()

# The fit function/module is used to train model parameters
ss.fit(train_x)

# Standardize the training set and test set
train_x = ss.transform(train_x)
test_x = ss.transform(test_x)

# Define a logistic regression model
lr = LogisticRegression()
lr.fit(train_x, train_y)

# Calculate the training set score
lr.score(train_x, train_y)

# Calculate test set score
lr.score(test_x, test_y)

# Use the classification_report function to display a text report of the main classification indicators
predict = lr.predict(test_x)
print(classification_report(test_y, predict))

Foundation of Artificial Intelligence - Lecture 1

Algorithm --> Data Structure

No obvious solution ==> Algorithm engineers do it If there is a clear implementation path ==> the person who develops the project will do it

What's the Algorithm?

{Ace of hearts, 10 of spades, 3 of spades, 9 of hearts, 9 clubs, 4 of diamonds, J}

First: Hearts> Diamonds> Spades> Clubs Second: Numbers are arranged from small to large

  1. Some people put the colors together first
  2. Some people arrange the size first, and extract the colors one by one

\[ 1024 --> 10^3 --> 1k \] \[ 1024 * 1024 --> 10^6 --> 1M \] \[ 1024 * 1024 * 1024 --> 10^9 --> 1G \]

1
2
3
4
5
6
struction-0  00011101
struction-1 00011111
struction-2 00011100
struction-3 00011101
struction-4 00011100
struction-5 00011001

2.6G Hz

1
2
3
4
5
def fac(n): # return n!
if n == 1:
return 1 # 返回操作
else:
return n * fac(n-1) # 乘法操作 + 返回操作 + 函数调用
1
2
3
4
5
6
7
8
9
10
fac(1)
> 1

fac(100)
> 93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000

fac_100 = """93326215443944152681699238856266700490715968264381621468592963895217599993229915608941463976156518286253697920827223758251185210916864000000000000000000000000"""

len(fac_100)
> 158
1
2
3
4
5
?? N --> fac(n)
# 乘法操作 + 返回操作 + 函数调用
?? (N - 1)--> fac(n-1)
?? N == 100 fac(N)
??? 99
1
2
3
4
Object ` N --> fac(n)` not found.
Object ` (N - 1)--> fac(n-1)` not found.
Object ` N == 100 fac(N)` not found.
Object `? 99` not found.

\[ Time(N) - Time(N-1) = constant \] \[ Time(N-1) - Time(N-2) = constant \] \[ Time(N-2) - Time(N-3) = constant \] \[ Time(2) - Time(1) = constant \] \[ Time(N) - Time(1) == (N-1)constant \] \[ Time(N) == (N-1)constant + Time(1) \] \[ Time(N) == N * constant + (Time(1) - constant) \]

SVM-based Text Classification in Practice

The source code: SVM-based Text Classification in Practice

'cnews.train.txt' data cannot be uploaded because it is too large, so it needs to be decompressed and imported after compression.

Use SVM to implement a simple text classification based on bag of words and support vector machine.

import data

1
2
3
4
# import
import codecs
import os
import jieba

Chinese news data is prepared as a sample data set. The number of training data is 50,000 and the number of test data is 10,000. All data is divided into 10 categories: sports, finance, real estate, home furnishing, education, technology, fashion, current affairs, games and entertainment . From the training text, you can load the code, view the data format and samples:

1
2
3
4
5
6
7
8
9
10
11

data_train = './data/cnews.train.txt' # training data file name
data_test = './data/cnews.test.txt' # test data file name
vocab = './data/cnews.vocab.txt' # dictionary

with codecs.open(data_train, 'r', 'utf-8') as f:
lines = f.readlines()

# print sample content
label, content = lines[0].strip('\r\n').split('\t')
content

Take the first item of the training data as an example to segment the loaded news data. Here I use the word segmentation function of LTP, you can also use jieba, and the segmentation results are displayed separated by "/" symbols.

1
2
3
# print word segment results
segment = jieba.cut(content)
print('/'.join(segment))

To sort out the above logic a bit, implement a class to load training and test data and perform word segmentation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# cut data
def process_line(idx, line):
data = tuple(line.strip('\r\n').split('\t'))
if not len(data)==2:
return None
content_segged = list(jieba.cut(data[1]))
if idx % 1000 == 0:
print('line number: {}'.format(idx))
return (data[0], content_segged)

# data loading method
def load_data(file):
with codecs.open(file, 'r', 'utf-8') as f:
lines = f.readlines()
data_records = [process_line(idx, line) for idx, line in enumerate(lines)]
data_records = [data for data in data_records if data is not None]
return data_records

# load and process training data
train_data = load_data(data_train)
print('first training data: label {} segment {}'.format(train_data[0][0], '/'.join(train_data[0][1])))
# load and process testing data
test_data = load_data(data_test)
print('first testing data: label {} segment {}'.format(test_data[0][0], '/'.join(test_data[0][1])))

After spending some time on word segmentation, you can start building a dictionary. The dictionary is built from the training set and sorted by word frequency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def build_vocab(train_data, thresh):
vocab = {'<UNK>': 0}
word_count = {} # word frequency
for idx, data in enumerate(train_data):
content = data[1]
for word in content:
if word in word_count:
word_count[word] += 1
else:
word_count[word] = 1
word_list = [(k, v) for k, v in word_count.items()]
print('word list length: {}'.format(len(word_list)))
word_list.sort(key = lambda x : x[1], reverse = True) # sorted by word frequency
word_list_filtered = [word for word in word_list if word[1] > thresh]
print('word list length after filtering: {}'.format(len(word_list_filtered)))
# construct vocab
for word in word_list_filtered:
vocab[word[0]] = len(vocab)
print('vocab size: {}'.format(len(vocab))) # vocab size is word list size +1 due to unk token
return vocab

vocab = build_vocab(train_data, 1)

In addition, according to category, we know that the label itself also has a "dictionary":

1
2
3
4
5
6
7
8
9
10
def build_label_vocab(cate_file):
label_vocab = {}
with codecs.open(cate_file, 'r', 'utf-8') as f:
for lines in f:
line = lines.strip().split('\t')
label_vocab[line[0]] = int(line[1])
return label_vocab

label_vocab = build_label_vocab('./data/cnews.category.txt')
print(f'label vocab: {label_vocab}')

Next, construct the id-based training and test sets, because we only consider the bag of words, so the order of words is excluded. Constructed to look like libsvm can eat. Note that because the bag of word model

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def construct_trainable_matrix(corpus, vocab, label_vocab, out_file):
records = []
for idx, data in enumerate(corpus):
if idx % 1000 == 0:
print('process {} data'.format(idx))
label = str(label_vocab[data[0]]) # label id
token_dict = {}
for token in data[1]:
token_id = vocab.get(token, 0)
if token_id in token_dict:
token_dict[token_id] += 1
else:
token_dict[token_id] = 1
feature = [str(int(k) + 1) + ':' + str(v) for k,v in token_dict.items()]
feature_text = ' '.join(feature)
records.append(label + ' ' + feature_text)

with open(out_file, 'w') as f:
f.write('\n'.join(records))

construct_trainable_matrix(train_data, vocab, label_vocab, './data/train.svm.txt')
construct_trainable_matrix(test_data, vocab, label_vocab, './data/test.svm.txt')

Training process

The remaining core model is simple: use libsvm to train the support vector machine, let your svm eat the training and test files you have processed, and then use the existing method of libsvm to train, we can change different parameter settings . The documentation of libsvm can be viewed here, where the "-s, -t, -c" parameters are more important, and they decide what you choose Svm, your choice of kernel function, and your penalty coefficient.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from libsvm import svm
from libsvm.svmutil import svm_read_problem,svm_train,svm_predict,svm_save_model,svm_load_model

# train svm
train_label, train_feature = svm_read_problem('./data/train.svm.txt')
print(train_label[0], train_feature[0])
model=svm_train(train_label,train_feature,'-s 0 -c 5 -t 0 -g 0.5 -e 0.1')

# predict
test_label, test_feature = svm_read_problem('./data/test.svm.txt')
print(test_label[0], test_feature[0])
p_labs, p_acc, p_vals = svm_predict(test_label, test_feature, model)

print('accuracy: {}'.format(p_acc))

After a period of training, we can observe the experimental results. You can change different svm types, penalty coefficients, and kernel functions to optimize the results.

Auto operation Weibo

The code address of this article is: auto operation weibo Chromedrive download: Taobao Mirror , need to be consistent with your Chrome version

auto operation weibo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from selenium import webdriver
import time
driver = webdriver.Chrome('/Applications/chromedriver')

# login weibo
def weibo_login(username, password):

# open weibo index
driver.get('https://passport.weibo.cn/signin/login')
driver.implicitly_wait(5)
time.sleep(1)

# fill the info: username, password
driver.find_element_by_id('loginName').send_keys(username)
driver.find_element_by_id('loginPassword').send_keys(password)
time.sleep(1)

# click login
driver.find_element_by_id('loginAction').click()
time.sleep(1)

# set username, password
username = 'ivandoo75@gmail.com'
password = 'ooxx'

# Mobile phone verification is required here, but still can’t log in fully automatically
weibo_login(username, password)

follow user

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def add_follow(uid):
driver.get('https://m.weibo.com/u/' + str(uid))
time.sleep(1)

# driver.find_element_by_id('follow').click()
follow_button = driver.find_element_by_xpath('//div[@class="btn_bed W_fl"]')
follow_button.click()
time.sleep(1)

# select group
group_button = driver.find_element_by_xpath('//div[@class="list_content W_f14"]/ul[@class="list_ul"]/li[@class="item"][2]')
group_button.click()
time.sleep(1)

# cancel the select
cancel_button = driver.find_element_by_xpath('//div[@class="W_layer_btn S_bg1"]/a[@class="W_btn_b btn_34px"]')
cancel_button.click()
time.sleep(1)

# 每天学点心理学 UID
uid = '1890826225'
add_follow(uid)

create text and publish

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def add_comment(weibo_url, content):
driver.get(weibo_url)
driver.implicitly_wait(5)

content_textarea = driver.find_element_by_css_selector('textarea.W.input').clear()
content_textarea = driver.find_element_by_css_selector('textarea.W.input').send_keys(content)

time.sleep(2)

comment_button = driver.find_element_by_css_selector('.W_btn_a').click()

# post the text
def post_weibo(content):
# go to the user index
driver.get('https://weibo.com')
driver.implicitly_wait(5)

# click publish button
# post_button = driver.find_element_by_css_selector('[node-type="publish"]').click()

# input content word to textarea
content_textarea = driver.find_element_by_css_selector('textarea.W_input[node-type="textEl"]').send_keys(content)
time.sleep(2)

# click publish button
post_button = driver.find_element_by_css_selector("[node-type='submit']").click()
time.sleep(1)

# comment the weibo
weibo_url = 'https://weibo.com/1890826225/HjjqSahwl'
content= 'here is Hivan du, Best wish to u.'

# auto send weibo
content = 'Learning is a belief!'
post_weibo(content)

Boston house analysis

The source code: Boston House

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Import package
# Used to load the Boston housing price data set
from sklearn.datasets import load_boston
# pandas toolkit If you are unfamiliar with pandas, you can refer to the official 10-minute tutorial: https://pandas.pydata.org/pandas-docs/stable/10min.html
import pandas as pd
import numpy as np
# seaborn for drawing
import seaborn as sns
import matplotlib.pyplot as plt
# Show drawing
%matplotlib inline


data = load_boston() # load datase

data.keys() # Fields inside data

df = pd.DataFrame(data['data'])

# Looking at the first 5 rows of the dataframe, we can see that the column names are numbers
df.head(5)

data['feature_names'] # Feature name

The Table params and chinese info

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
params	chinese info
CRIM 住房所在城镇的人均犯罪率
ZN 住房用地超过 25000 平方尺的比例
INDUS 住房所在城镇非零售商用土地的比例
CHAS 有关查理斯河的虚拟变量(如果住房位于河边则为 1,否则为 0
NOX 一氧化氮浓度
RM 每处住房的平均房间数
AGE 建于 1940 年之前的业主自住房比例
DIS 住房距离波士顿五大中心区域的加权距离
RAD 离住房最近的公路入口编号
TAX10000 美元的全额财产税金额
PTRATIO 住房所在城镇的师生比例
B 1000(Bk-0.63)^2,其中 Bk 指代城镇中黑人的比例
LSTAT 弱势群体人口所占比例
MEDV 业主自住房的中位数房价(以千美元计)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Replace numeric column names with feature names
df.columns = data['feature_names']
df.head(5)

# The target is the house price, which is also our target value. We assign the target value to the dataframe
df['price'] = data['target']
df.head(5)

# View the correlation coefficient between the feature and price, positive correlation and negative correlation
sns.heatmap(df.corr(), annot=True, fmt='.1f')

plt.scatter(df['RM'], df['price'])


plt.figure(figsize=(20, 5))

# View the data distribution display of some features and price
features = ['LSTAT', 'RM']
target = df['price']

for i, col in enumerate(features):
plt.subplot(1, len(features), i+1)
x = df[col]
y = target
plt.scatter(x, y, marker = 'o')
plt.title('{} price'.format(col))
plt.xlabel(col)
plt.ylabel('price')


# Simple example: univariate forecast price
x = df['RM']
y = df['price']

history_notes = {_x: _y for _x, _y in zip(x,y)}

history_notes[6.575]


# Find the top three prices that are closest to RM:6.57,
similary_ys = [y for _, y in sorted(history_notes.items(), key=lambda x_y: (x_y[0] - 6.57) ** 2)[:3]]
similary_ys


# Calculate the average of three
np.mean(similary_ys)

Use historical data to predict data that has never been seen before, the most direct method

K-Neighbor-Nearst

1
2
3
4
5
6
7
8
def knn(query_x, history, top_n = 3):
sorted_notes = sorted(history.items(), key = lambda x_y: (x_y[0] - query_x)**2)
similar_notes = sorted_notes[:top_n]
similar_ys = [y for _, y in similar_notes]

return np.mean(similar_ys)

knn(5.4, history_notes)

In order to obtain results faster, we hope to obtain predictive power by fitting a function

\[ f(rm) = k * rm + b \]

Random Approach

\[ Loss(k, b) = \frac{1}{n} \sum_{i \in N} (\hat{y_i} - y_i) ^ 2 \] \[ Loss(k, b) = \frac{1}{n} \sum_{i \in N} ((k * rm_i + b) - y_i) ^ 2 \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def loss(y_hat, y):
return np.mean((y_hat - y)**2)

import random

min_loss = float('inf')

best_k, best_b = None, None


for step in range(1000):
min_v, max_v = -100, 100
k, b = random.randrange(min_v, max_v), random.randrange(min_v, max_v)
y_hats = [k * rm_i + b for rm_i in x]
current_loss = loss(y_hats, y)

if current_loss < min_loss:
min_loss = current_loss
best_k, best_b = k, b
print(f'{step}, we have func f(rm) = {k} * rm + {b}, lss is :{current_loss}')

plt.scatter(x, y)
plt.scatter(x, [best_k * rm + best_b for rm in x])

Monte Carlo simulation(蒙特卡洛模拟)

Supervisor

\[ Loss(k, b) = \frac{1}{n} \sum_{i \in N} ((k * rm_i + b) - y_i) ^ 2 \]

\[ \frac{\partial{loss(k, b)}}{\partial{k}} = \frac{2}{n}\sum_{i \in N}(k * rm_i + b - y_i) * rm_i \]

\[ \frac{\partial{loss(k, b)}}{\partial{b}} = \frac{2}{n}\sum_{i \in N}(k * rm_i + b - y_i)\]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def partial_k(k, b, x, y):
return 2 * np.mean((k*x+b-y) * x)

def partial_b(k, b, x, y):
return 2 * np.mean(k*x+b-y)

k, b = random.random(), random.random()
min_loss = float('inf')

best_k, best_b = None, None
learning_rate = 1e-2

for step in range(2000):
k, b = k + (-1 * partial_k(k, b, x, y) * learning_rate), b + (-1 * partial_b(k, b, x, y) * learning_rate)
y_hats = k * x + b
current_loss = loss(y_hats, y)

if current_loss < min_loss:
min_loss = current_loss
best_k, best_b = k, b
print(f'setp {step}, we have func f(rm) = {k} * rm + {b}, lss is :{current_loss}')

best_k, best_b


plt.scatter(x, y)
plt.scatter(x, [best_k * rm + best_b for rm in x])

Supervised Learning

We turn the forecast of housing prices into a more responsible and sophisticated model. What should we do?

\[ f(x) = k * x + b \]

\[ f(x) = k2 * \sigma(k_1 * x + b_1) + b2 \]

\[ \sigma(x) = \frac{1}{1 + e^(-x)} \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def sigmoid(x):
return 1 / (1+np.exp(-x))

sub_x = np.linspace(-10, 10)
plt.plot(sub_x, sigmoid(sub_x))


def random_linear(x):
k, b = random.random(), random.random()
return k * x + b

def complex_function(x):
return (random_linear(x))

for _ in range(10):
index = random.randrange(0, len(sub_x))
sub_x_1, sub_x_2 = sub_x[:index], sub_x[index:]
new_y = np.concatenate((complex_function(sub_x_1), complex_function(sub_x_2)))
plt.plot(sub_x, new_y)

We can implement more complex functions through simple, basic modules and repeated superposition

For more and more complex functions? How does the computer seek guidance?

  1. What is machine learning?
  2. The shortcomings of this method of KNN, what is the background of the proposed linear fitting
  3. How to get faster function weight update through supervision method
  4. The combination of nonlinear and linear functions can fit very complex functions
  5. Deep learning we can fit more complex functions through basic function modules

Assigment:

\[ L2-Loss(y, \hat{y}) = \frac{1}{n}\sum{(\hat{y} - y)}^2 \]

\[ L1-Loss(y, \hat{y}) = \frac{1}{n}\sum{|(\hat{y} - y)|} \]

L2-Loss becomes L1Loss and achieves gradient descent

Realize L1Loss gradient descent from 0

1. import package

1
2
import numpy as np
import pandas as pd

2. load data

1
2
3
4
5
6
7
8
9
10
11
from sklearn.datasets import load_boston
data = load_boston()
data.keys()

data_train = data.data
data_traget = data.target

df = pd.DataFrame(data_train, columns = data.feature_names)
df.head()

df.describe() # Data description, you can view the statistics of each variable

3. Data preprocessing

Normalization or standardization can prevent a certain dimension or a few dimensions from affecting the data too much when there are very many dimensions, and secondly, the program can run faster. There are many methods, such as standardization, min-max, z-score, p-norm, etc. How to use it depends on the characteristics of the data set.

Further reading-数据标准化的迷思之深度学习领域

1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.preprocessing import StandardScaler
# z = (x-u) / s u is the mean, s is the standard deviation
ss = StandardScaler()
data_train = ss.fit_transform(data_train)
# For linear models, normalization or standardization is generally required, otherwise gradient explosion will occur, and tree models are generally not required
data_train = pd.DataFrame(data_train, columns = ['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT'])
data_train.describe()

# y=Σwixi+
# Because the derivation of b is all 1, add a bias b to the data and set it to 1, as a feature of the data and update the gradient wi*b=wi
data_train['bias'] = 1
data_train

Divide the data set, where 20% of the data is used as the test set X_test, y_test, and the other 80% are used as the training set X_train, y_train, where random_state is the random seed

1
2
3
4
5
6
7
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(data_train, data_traget, test_size = 0.2, random_state=42)

print('train_x.shape, train_y.shape', train_x.shape, train_y.shape)
print('test_x.shape, test_y.shape', test_x.shape, test_y.shape)

train_x = np.array(train_x)

Model training and gradient update

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def l1_cost(x, y, theta):
"""
x: 特征
y: 目标值
thta: 模型参数
"""
k = x.shape[0]
total_cost = 0
for i in range(k):
total_cost += 1/k * np.abs(y[i] -theta.dot(x[i, :]))
return total_cost

def l2_cost(x, y, theta):
k = x.shape[0]
total_cost = 0
for i in range(k):
total_cost += 1/k * (y[i] -theta.dot(x[i,:])) ** 2
return total_cost

np.zeros(10).shape

def step_l1_gradient(x, y, learning_rate, theta):
"""
Function to calculate the gradient of the MAE loss function
Return the gradient value 0 for the non-differentiable point at 0
X:特征向量
y:目标值
learing_rate:学习率
theta:参数
"""
n = x.shape[0]
# print(n)
e = y - x @ theta
gradients = - (x.T @ np.sign(e)) / n # sign is a sign function
thata = theta - learning_rate * gradients
return theta

def step_l2_gradient(x, y, learning_rate, theta):
k = x.shape[0]
n = x.shape[1]
gradients = np.zeros(n)
for i in range(k):
for j in range(n):
gradients[j] += (-2/k) * (y[i] - (theta.dot(x[i, :]))) * x[i, j]
theta = theta - learning_rate * gradient
return theta

# def step_gradient(X, y, learning_rate, theta):
# """
# X:特征向量
# y:目标值
# learing_rate:学习率
# theta:参数
# """
# m_deriv = 0
# N = len(X)
# for i in range(N):
# # 计算偏导
# # -x(y - (mx + b)) / |mx + b|
# m_deriv += - X[i] * (y[i] - (theta*X[i] + b)) / abs(y[i] - (theta*X[i] + b))
# # We subtract because the derivatives point in direction of steepest ascent
# theta -= (m_deriv / float(N)) * learning_rate
# # theta = theta - learning_rate * gradients
# return theta

def gradient_descent(train_x, train_y, learning_rate, iterations):
k = train_x.shape[0]
n = train_x.shape[1]
theta = np.zeros(n) # Initialization parameters

loss_values = []
# print(theta.shape)

for i in range(iterations):
theta = step_l1_gradient(train_x, train_y, learning_rate, theta)
loss = l1_cost(train_x, train_y, theta)
loss_values.append(loss)
print(i, 'cost:', loss)
return theta, loss_values

# Training parameters
learning_rate = 0.04 # Learning rate
iterations = 300 # Number of iterations
theta, loss_values = gradient_descent(train_x, train_y, learning_rate, iterations)

Boston house price CART regression tree

On the code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# CART regression tree prediction
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import r2_score,mean_absolute_error, mean_squared_error
from sklearn.tree import DecisionTreeRegressor,export_graphviz
import graphviz

# Prepare data set
boston = load_boston()

# Explore data
print(boston.feature_names)

# Get feature set and price
features = boston.data
prices = boston.target


# Randomly extract 33% of the data as the test set, and the rest as the training set
train_features, test_features, train_price, test_price = train_test_split(features,prices,test_size=0.33)

# Create CART regression tree
dtr = DecisionTreeRegressor()

# Fitting and constructing CART regression tree
dtr.fit(train_features, train_price)

# Predict housing prices in the test set
predict_price = dtr.predict(test_features)

grap_data = export_graphviz(dtr, out_file=None)
graph = graphviz.Source(grap_data)

# Result evaluation of test set
print(f'Regression tree mean squared deviation:',mean_squared_error(test_price, predict_price))
print(f'Regression tree absolute value deviation mean:',mean_absolute_error(test_price, predict_price))

# Generate regression tree visualization
graph.render('Boston')

!> Before running this code, please ensure that the relevant dependencies have been installed;

Digits recognition

The code address of this article is: digit recognition

Convolution operation demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pylab
import numpy as np
from scipy import signal

# set img
img = np.array([[10, 10, 10, 10, 10],[10, 5, 5, 5, 10], [10, 5, 5, 5, 10], [10, 5, 5, 5, 10], [10, 10, 10, 10, 10]])

# set convolution
fil = np.array([[-1, -1, 0], [-1, 0, 1], [0, 1, 1]])

# convolution the img
res = signal.convolve2d(img, fil, mode='valid')

# output the result
print(res)

output

1
2
3
[[ 15  10   0]
[ 10 0 -10]
[ 0 -10 -15]]

A image demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import matplotlib.pyplot as plt
import pylab
import cv2
import numpy as np
from scipy import signal

# read the image
img = cv2.imread('./data/weixin.jpg', 0) # Any picture

# show the image
plt.imshow(img, cmap='gray')
pylab.show()

# set the convolution
fil = np.array([[-1,-1,0], [-1, 0, 1], [0, 1, 1]])

# convolution operation
res = signal.convolve2d(img, fil, mode='valid')
print(res)

# show convolution image
plt.imshow(res, cmap = 'gray')
pylab.show()

use LeNet model to recognize Mnist handwritten digits

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import keras
from keras.datasets import mnist
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Dense, Flatten
from keras.models import Sequential
import warnings
warnings.filterwarnings('ignore')

# load data
(train_x, train_y), (test_x, test_y) = mnist.load_data()

train_x = train_x.reshape(train_x.shape[0], 28, 28, 1)
test_x = test_x.reshape(test_x.shape[0], 28, 28, 1)
train_x = train_x / 255
test_x = test_x / 255

train_y = keras.utils.to_categorical(train_y, 10)
test_y = keras.utils.to_categorical(test_y, 10)

# create sequential models
model = Sequential()

# The first convolutional layer: 6 convolution kernels, the size is 5*5, relu activation function
model.add(Conv2D(6, kernel_size = (5,5), activation='relu', input_shape=(28, 28, 1)))

# the second pooling layer: maximum pooling
model.add(MaxPooling2D(pool_size = (2, 2)))

# the third convolutional layer: 16 convolution kernels, the size is 5*5, relu activation function
model.add(Conv2D(16, kernel_size = (5, 5), activation = 'relu'))

# the second pooling layer: maximum pooling
model.add(MaxPooling2D(pool_size = (2, 2)))

# Flatten the parameters, which is called a convolutional layer in leNet5. in fact, this layer is a one-dimensional vector, the same as the fully connected layer
model.add(Flatten())
model.add(Dense(120, activation = 'relu'))

# Fully connected layer, the number of output nodes is 84
model.add(Dense(84, activation = 'relu'))

# The output layer uses the softmax activation function to calculate the classification probability
model.add(Dense(10, activation='softmax'))

# set the loss function and optimizer configuration
model.compile(loss = keras.metrics.categorical_crossentropy, optimizer = keras.optimizers.Adam(), metrics = ['accuracy'])

# Incoming training data for training
model.fit(train_x, train_y, batch_size = 128, epochs = 2, verbose = 1, validation_data = (test_x, test_y))

# Evaluate the results
score = model.evaluate(test_x, test_y)
print('Error: %.4lf' % score[0])
print('Accuracy: ', score[1])
1
2
3
4
5
6
7
8
Train on 60000 samples, validate on 10000 samples
Epoch 1/2
60000/60000 [==============================] - 37s 616us/step - loss: 0.3091 - accuracy: 0.9102 - val_loss: 0.1010 - val_accuracy: 0.9696
Epoch 2/2
60000/60000 [==============================] - 36s 595us/step - loss: 0.0876 - accuracy: 0.9731 - val_loss: 0.0572 - val_accuracy: 0.9814
10000/10000 [==============================] - 3s 328us/step
Error: 0.0572
Accuracy: 0.9814000129699707

Analysis data and research report collection

The purpose is to facilitate finding specific locations when doing data analysis by yourself

1. 国内咨询机构网站数据报告列表

2. 国家机构公开数据

  1. 中国信通院-研究成果-权威发布-权威数据
  2. 中国城市轨道交通协会城市地铁线路的流量数据
  3. 国家的便民服务查询(包括 5A 景区 list,小微企业名录,法人信用查询,出租车信息查询)
  4. 国家宏观经济数据(GDP,CPI,总人口,社会消费品零售总额,粮食产品,PPI,各地区行政规划,各地财政收支等等,分月度季度和年度)部分数据如下
  5. 国家统计局(数据多到瞠目结舌,包括年度,季度,国家,国际,年鉴,介乎涵盖所有数据指标和历史)部分数据举例
  6. 世界银行的公开数据库(有健康,农业,公共部部门,人口分布,外债,性别,教育,环境,气候变化,能源,贫困等各种公开数据)
  7. 世界数据图册(世界和地区统计资料,各国数据,地图,排名)包含的全球的国家公开的数据
  8. 国家机关部委的公开数据(包括国家发展和改革委员会,教育部,民政部,司法部,财政部,工业和信息化不,交通运输部,文化和旅游部等)
  9. 各城市开放数据(包括浙江数据开放网,青岛数据开放网,贵阳数据开放平台,成都数据公开平台,合肥数据开放平台,河南开放数据平台等)
  10. 宏观经济查询数据(包括高校财经信息库,人民网经济数据库,香港统计处,联合国统计司,世界经合组织,欧盟统计局,国际货币基金组织等)
  11. 房价数据(包括中国房价指数,房价走势,台湾房价行情,北京房价查询,深圳楼盘成交查询等,上海地铁房价,贝壳指数等)
  12. 汽车数据(包括中国汽车工业协会数据,百度汽车网,易车汽车指数,汽车渠道直通车,中国汽车流通协会数据中心,德国汽车工业协会等)
  13. 权威发布 | 中华全国商业信息中心

3. 国内互联网公司数据报告网站列表

  1. 讲座 PPT-腾讯大讲堂
  2. Tencent 腾讯-业绩报告
  3. 腾讯大数据-腾讯云数据分析出来的行业报告
  4. 百度开放服务平台-百度云数据分析出来的行业报告
  5. 百度数据研究中心 提供行业研究报告、行业分析报告-百度数据中心报告
  6. 首页-阿里研究院-阿里行业研究报告
  7. 企鹅智酷_腾讯网-腾讯出品行业报告
  8. 腾讯 CDC -腾讯交互设计报告
  9. 百度用户体验中心-百度 UED 用户研究报告
  10. 网易用户体验设计中心-网易 UED 用户研究报告
  11. 网络视频数据报告-优酷指数行业报告
  12. PP 指数_PPTV 聚力-PPTV 指数行业报告
  13. 360 研究报告_360 安全中心-360 应用商店等产品出品报告

4. 国外咨询机构网站数据报告列表

国外咨询机构较多,数据详实,无论是海外出海产品,海外报告中多有亚洲和中国的重点研究,相关报告和趋势分析都可以选看

5. 各大公司不定期发布的报告,比如(细分方向的时候用)**

  1. 高德地图:2015 年度中国主要城市交通分析报告
  2. 微信城市服务发布《2015 微信政务民生白皮书》
  3. 【报告】淘宝发布 2015 中国消费趋势数据,2015 年我们为什么买单?
  4. 互联网增长的第一本数据分析手册-Growing IO 的公开手册
  5. 移动游戏运营数据分析指标白皮书(一)-Talkingdata 运营指标分析白皮书
  6. 多多大学 (多多大学也分享了很多的拼多多运营数据还提供课程,可以看)

6. 找行业内的人事打听内部一手资料

  1. 关注一些专门打听行业内部人事的信息来源
    1. 这里先推荐一家公共号:晚点 LatePost(微信搜索公共号可以 关注)
    2. 这家主要是会 看一些行业内部和重要的消息
  2. 在行上约人。在行 App
    1. 如果想知道一些企业的信息,可以在在行上找到一些行家,曾经一手经营或者运营过祥光额项目和参与过竞品和行业公司的操盘,可以约出来,从信息和方法论角度获得资讯

7. 企业信息报告**

  1. 新三板在线 - 中国最大的新三板生态平台(各行各业的新三板上市公司财务数据,高管数据等)
  2. 企查查|企业查询(查询企业的产品,品牌和法人信息)
  3. 企业注册信息查询(天眼查,同企查查)
  4. SEC.gov | Home(美国上市公司年度财务报告)
  5. 巨潮资讯网—(中国上市公司季度年度财务报告)
  6. Baidu | Investors(各大上市公司季度财报,IR.XX 公司.com,比如百度这个)
  7. 天眼查(可以查到各个企业的详细信息,还可以查到员工个数)

8. 爬虫网站或者 APP 的数据

最近研究发现,还有一个好的行业信息获取来源,就是通过站内或者 App 内的爬虫抓取,这个渠道获取的数据,通常可以帮助你了解行业和竞品的站内使用情况,用户喜欢的内容,用户的分布,用户的行为和喜好等等。

爬虫,简单来说是通过程序来获取网页的信息,整理成数据库,从而进行数据挖掘的得到分析结论的过程。比如你可以爬虫购物的页面,知道哪个商品的销量好,比如你可以爬虫小红书的页面,知道哪些 kol 收到欢迎,你还可以爬取他们的分类,知道美妆和购物的 kol 表现好,并且有多少个这样的 kol。如果你没有对方的数据库权限(当然你肯定没有),那么从外部爬虫是最好的了解他们业务数据的方式。

  1. 通常的搜索方式 是:你要了解的网站/App+爬虫,在搜索平台比如百度搜索

  2. 这里举例一些程序员垂直的网站,

    CSDN 网站:在这个网站内搜索:网站/app +爬虫 这个关键词,在站内搜索

    简书 - 创作你的创作:在这个网站内搜索:网站/app +爬虫 这个关键词,在站内搜索,

    V2EX:在这个网站内搜索:网站/app +爬虫 这个关键词,在站内搜索

    掘金:在这个网站内搜索:网站/app +爬虫 这个关键词,在站内搜索

9 . 业内微信群

现在发现很多好的内部报告和难以获得报告,是通过加入一些干货群,内部群来获得的。

比如做直播电商的人自己比较关注一些直播和电商带货的详细的数据和报告趋势,大家会自己组建一些干货群,只要是市面上有的报告,自己内部发现的都会往里面扔。

这个是淘宝直播的负责人赵圆圆离开淘宝后创业,同时聚集的几个群,里面关于直播的干货非常多。

其他的关于投资的,趋势,创业的类似群也很多,获取报告也很一手,大家也可以自己开发下这样的群组织。

10. 搜索引擎

搜索引擎还是可以搜到很多你个性化想要找的报告和趋势。以前我没觉得搜索引擎很很难,后来发现也需要学习和熟练使用,才能让其为自己所用。

如何使用搜索引擎

11. 各大公司的财报

通常对于上市公司来说,财报信息包含的内容是最全面的,关于用户,商业,渠道,增长,业务策略等等。所以如果想了解一个公司,如果是上市公司最好第一手先看财报后者 SEC(上市报告)。

很多同学问我财报哪里找,不知道怎么看。其实每个公司都有自己的 IR(投资者页面),在上面有财报的完整的 pdf 下载。另外,也推荐大家听听每期的企业 conference call(回答财报问题),可以听下 CEO 对财报的解读。

这里我列举几个大公司财报的网站

如果大家有自己想要了解的公司,在百度 or google 搜索:公司名字+IR ,可以 定位到他们公司的财报网站页面。在页面上找到 conference call 或者 webcast,可以 找到他们的财报解读音频。

12. 投资机构的统计网站(创业方向选择,投融资选择的时候用)

  1. IT 桔子 | IT 互联网公司产品数据库及商业信息服务(IT 桔子,中国创业公司投融资数据和报告)
  2. 研究院_ChinaVenture 投资中国网-(投中的每个季度的行业融资报告,不定期有专项分析报告)
  3. CB Insights - Blog (CBI insights 的一系列产品,包括公司的估值,独角兽公司列表等)
  4. The Downround Tracker(公司估值下降的趋势)
  5. The Complete List of Unicorn Companies(独角兽公司列表)
  6. IPO Center: IPO Market, IPO News, IPO Calendars, IPO Pricings, IPO Voting(IPO 相关新闻和趋势报告)
  7. PrivCo | Private Company Financial Intelligence(美国金融数据公司,主要关注未上市公司的所有投融资资料,目前涵盖的公司包括全世界,当然也包括中国公司)
  8. 券商行业研究报告 (国内券商的行业报告,策略报告,可以筛选行业,筛选报告类型)
  9. https://pitchbook.com/news/reports(PitchBook 的 PE,VC,M&A 行业报告)
  10. 研究院_ChinaVenture 投资中国网 (IPO 投融资行业报告)
  11. Dow Jones VentureSource 2Q’16 U.S. Venture Capital Report(道琼斯旗下机构 Dow Jones LP Source 行业投资报告)
  12. NVCA Venture Investment(美国国家风险投资协会,每个季度和年度都会出投融资行业报告)
  13. PWC-MoneyTree Home(PWC 的 money tree report 是每个季度美国的风险投资行业报告)
  14. https://home.kpmg.com/xx/en/home/insights.html (KPMG 毕马威的 insights 报告,一般是每个季度的创投趋势,比较细致的分析)
  15. Mattermark - Discover, Enrich, & Analyze Companies(创业公司投资并购信息一站式搜索)
  16. M&A, Private Equity & Venture Capital Database(创业公司投资并购信息一站式搜索)
  17. DataFox | Prospect Sales Leads with Company Signals(创业公司投资并购信息一站式搜索)
  18. CrunchBase accelerates innovation by bringing together data on companies and the people behind them.(创业公司数据库)
  19. Venture Intelligence PE/VC database
  20. Stock Market Insights | Seeking Alpha (二级市场金融分析网站)
  21. Tencent Holdings Ltd -- Trefis(各个公司的 revenue model 的预测和 key driver 的趋势,这个网站简直不能再棒)

13. 本地数据库

这个世界有很多有用的信息,搜索引擎只解决了其中 20%,其他 80%的信息再各个角落,包括微信群,包括口口,甚至包括直播里都有,但是都不在搜索引擎。

就搜索引擎而言,现在很多人只是使用了其中的 5%还不到。搜索引擎的技巧可以提升,但是其他 80%的信息获取渠道更为隐蔽和无法公开获得的。

我加了很多 群,里面都是这些报告和信息和各行各业的各种信息,这些是搜索引擎提供不了的

这些冰山下的信息才决定了信息的获取的不同和优质与否。

除了上述渠道外,能找到靠谱渠道,找到合适的报告随时存储起来,等用的时候随手打开用是最好的。分享一个我最近看的收藏的精品的报告收藏夹,也希望对你们有用(随时更新)

共享下我看过的精品报告的收藏夹 list

14. 怎么提炼自己获取信息的层次和获取信息的价值

找到行业信息报告知识获取信息只是其中一个层次 ,获取信息是否更有价值更直接可用,在于基本功行业信息报告的甄别和获取,积累和提炼,这是非常重要的。

但是 越往上走,越是接近信息更有价值,更新鲜,更真实有效,更直接,有大量的渠道 可以 获得更多 的信息,这些不仅是通过 行业报告获取的,还有包括自己可以控制的方法,包括爬虫,数据挖掘,信息技术 等,还包括人脉,圈子,内幕的等渠道。大家感兴趣可以到这个答案看下详情,我对每个层级的方法的解读。

哪些渠道可以获取一般人不知道的知识和信息

15. 其他(不定期更新)

针对某一个项目自动切换 node 版本

nvm作为 node 的版本管理器,并不具备自动切换版本切换的功能,有的时候我们需要针对某一个项目切换当前的 node 版本,这个时候就需要用到其他工具了。比如avn

举例项目:project

因为最近 Node 更新到 10 之后,我将系统默认版本切换到了 10,有不更新不舒服斯基强迫症 而project 编译的版本为 8,否则会出现编译出错。

1
2
3
$ brew install nvm
$ nvm i -g avn
$ avn steup

之后在project根目录中添加一个文件.node-version

阅读更多