for i inrange(1000): k = k + (-1) * partial(k) *alpha print(k, loss(k))
# out """ 7.959 124.32404299999999 -7.918246 122.66813714954799 show more (open the raw output data in a text editor) ... -1.1833014444482555 -14.082503185837805 """
for i, p inenumerate(original_price): price[i+1] = p
defmemo(func): cache = {} @wraps(func) def_wrap(n): if n in cache: result = cache[n] else: result = func(n) cache[n] = result return result return _wrap
for i inrange(100): x_star = x_star + -1*gradient(x_star)*alpha steps.append(x_star)
ic(x_star, func(x_star))
fig, ax = plt.subplots() ax.plot(x, func(x))
""" ic| x_star: 9.368, func(x_star): 1186.3702400000002 ic| x_star: 9.14864, func(x_star): 1138.732618496 show more (open the raw output data in a text editor) ... ic| x_star: -0.1157435825983131, func(x_star): 5.430171125980905 [<matplotlib.lines.Line2D at 0x7fd6d19545d0>] """
for i, s inenumerate(steps): ax.annotate(str(i+1), (s, func(s)))
mpl.rcParams['font.sans-serif'] = ['FangSong'] # Specify the default font mpl.rcParams['axes.unicode_minus'] = False# Solve the problem that the minus sign'-' is displayed as a square in the saved image
K = 5 centers = {'{}'.format(i+1): get_random_center(all_x, all_y) for i inrange(K)}
from collections import defaultdict
closet_points = defaultdict(list) for x, y, inzip(all_x, all_y): closet_c, closet_dis = min([(k, geo_distance((x, y), centers[k])) for k in centers], key=lambda t: t[1])
K = k centers = {'{}'.format(i+1): get_random_center(all_x, all_y) for i inrange(K)} changed = True
while changed: closet_points = defaultdict(list)
for x, y, inzip(all_x, all_y): closet_c, closet_dis = min([(k, geo_distance((x, y), centers[k])) for k in centers], key=lambda t: t[1]) closet_points[closet_c].append([x, y])
This contains data of news headlines published over a period of 15
years. From the reputable Australian news source ABC (Australian
Broadcasting Corp.) Site: http://www.abc.net.au/ Prepared by Rohit
Kulkarni
import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.feature_extraction import text from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from nltk.tokenize import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer import warnings warnings.filterwarnings('ignore') %matplotlib inline
data = pd.read_csv("./data/abcnews-date-text.csv",error_bad_lines=False,usecols =["headline_text"]) data.head()
# output """ headline_text 0 aba decides against community broadcasting lic... 1 act fire witnesses must be aware of defamation 2 a g calls for infrastructure protection summit 3 air nz staff in aust strike for pay rise 4 air nz strike to affect australian travellers """
data[data['headline_text'].duplicated(keep=False)].sort_values('headline_text').head(8) data = data.drop_duplicates('headline_text')
NLP
Preparing data for
vectorizaion
However, when doing natural language processing, words must be
converted into vectors that machine learning algorithms can make use of.
If your goal is to do machine learning on text data, like movie reviews
or tweets or anything else, you need to convert the text data into
numbers. This process is sometimes referred to as “embedding” or
“vectorization”.
In terms of vectorization, it is important to remember that it isn’t
merely turning a single word into a single number. While words can be
transformed into numbers, an entire document can be translated into a
vector. Not only can a vector have more than one dimension, but with
text data vectors are usually high-dimensional. This is because each
dimension of your feature data will correspond to a word, and the
language in the documents you are examining will have thousands of
words.
TF-IDF
In information retrieval, tf–idf or TFIDF, short for term
frequency–inverse document frequency, is a numerical statistic that is
intended to reflect how important a word is to a document in a
collection or corpus. It is often used as a weighting factor in searches
of information retrieval, text mining, and user modeling. The tf-idf
value increases proportionally to the number of times a word appears in
the document and is offset by the frequency of the word in the corpus,
which helps to adjust for the fact that some words appear more
frequently in general. Nowadays, tf-idf is one of the most popular
term-weighting schemes; 83% of text-based recommender systems in the
domain of digital libraries use tf-idf.
Variations of the tf–idf weighting scheme are often used by search
engines as a central tool in scoring and ranking a document's relevance
given a user query. tf–idf can be successfully used for stop-words
filtering in various subject fields, including text summarization and
classification.
One of the simplest ranking functions is computed by summing the
tf–idf for each query term; many more sophisticated ranking functions
are variants of this simple model.
Stemming is the process of reducing a word into its stem, i.e. its
root form. The root form is not necessarily a word by itself, but it can
be used to generate words by concatenating the right suffix. For
example, the words fish, fishes and fishing all stem into fish, which is
a correct word. On the other side, the words study, studies and studying
stems into studi, which is not an English word.
Tokenizing
Tokenization is breaking the sentence into words and punctuation,
For this, we will use k-means clustering algorithm. ### K-means
clustering (Source Wikipedia)
Elbow method to
select number of clusters
This method looks at the percentage of variance explained as a
function of the number of clusters: One should choose a number of
clusters so that adding another cluster doesn't give much better
modeling of the data. More precisely, if one plots the percentage of
variance explained by the clusters against the number of clusters, the
first clusters will add much information (explain a lot of variance),
but at some point the marginal gain will drop, giving an angle in the
graph. The number of clusters is chosen at this point, hence the "elbow
criterion". This "elbow" cannot always be unambiguously identified.
Percentage of variance explained is the ratio of the between-group
variance to the total variance, also known as an F-test. A slight
variation of this method plots the curvature of the within group
variance.
Basically,
number of clusters = the x-axis value of the point that is the corner of
the "elbow"(the plot looks often looks like an elbow)
1 2 3 4 5 6 7 8 9 10 11 12
from sklearn.cluster import KMeans wcss = [] for i inrange(1,11): kmeans = KMeans(n_clusters=i,init='k-means++',max_iter=300,n_init=10,random_state=0) kmeans.fit(X3) wcss.append(kmeans.inertia_) plt.plot(range(1,11),wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.savefig('elbow.png') plt.show()
As more than one elbows have been generated, I will have to select
right amount of clusters by trial and error. So, I will showcase the
results of different amount of clusters to find out the right amount of
clusters.
kmeans = KMeans(n_clusters = 3, n_init = 20, n_jobs = 1) # n_init(number of iterations for clsutering) n_jobs(number of cpu cores to use) kmeans.fit(X3) # We look at 3 the clusters generated by k-means. common_words = kmeans.cluster_centers_.argsort()[:,-1:-26:-1] for num, centroid inenumerate(common_words): print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))
# output """ publish_date headline_text 0 20030219 aba decides against community broadcasting lic... 1 20030219 act fire witnesses must be aware of defamation 2 20030219 a g calls for infrastructure protection summit 3 20030219 air nz staff in aust strike for pay rise 4 20030219 air nz strike to affect australian travellers """
The data set contains only two columns, the release date and the news
title.
For simplicity, I will explore the first 10,000 rows in this dataset.
Since the titles are sorted by publish_date, they are actually two
months from February 19, 2003 to April 7, 2003.
Number of
characters present in each sentence
Visualization of text statistics is a simple but insightful
technique.
They include:
Word frequency analysis, sentence length analysis, average word
length analysis, etc.
These really help to explore the basic characteristics of text
data.
For this, we will mainly use histograms (continuous data) and bar
graphs (categorical data).
First, let me look at the number of characters in each sentence. This
can give us a rough idea of the length of news headlines.
1
df['headline_text'].str.len().hist()
number of words
appearing in each news headline
The histogram shows that the range of news headlines is 10 to 70
characters, usually between 25 and 55 characters.
Now, we will continue to explore the data verbatim. Let's plot the
number of words that appear in each news headline.
Obviously, the number of words in news headlines is in the range of 2
to 12, and most of them are between 5 and 7.
Next, let's check the average word length in each sentence.
1
df['headline_text'].str.split().apply(lambda x : [len(i) for i in x]).map(lambda x: np.mean(x)).hist()
The average word length is between 3 and 9, and the most common
length is 5. Does this mean that people use very short words in news
headlines?
Let us find out.
One reason that may not be the case is stop words. Stop words are the
most commonly used words in any language (such as "the", "a", "an",
etc.). Since the length of these words may be small, these words may
cause the above graphics to be skewed to the left.
Analyzing the number and types of stop words can give us some
in-depth understanding of the data.
To get a corpus containing stop words, you can use the nltk library. Nltk
contains stop words from multiple languages. Since we only deal with
English news, I will filter English stop words from the corpus.
We can clearly see that in the news headlines, stop words such as
"to", "in" and "for" dominate.
So now that we know which stop words appear frequently in our text,
let's check which words other than these stop words appear
frequently.
We will use the counter function in the collection library to count
the occurrence of each word and store it in a list of tuples. This is a
very useful feature when we are dealing with word-level analysis in
natural language processing.
x, y=[], [] for word,count in most[:40]: if (word notin stop): x.append(word) y.append(count)
sns.barplot(x=y,y=x)
Wow! In the past 15 years, "America", "Iraq" and "War" have dominated
the headlines.
"We" here may mean the United States or us (you and me). We are not a
stop word, but when we look at the other words in the picture, they are
all related to the United States-the Iraq War and "we" here may mean the
United States.
Ngram analysis
Ngram is a continuous sequence of n words. For example, "Riverbank",
"Three Musketeers" and so on. If the number of words is two, it is
called a double word. For 3 characters, it is called a trigram, and so
on.
Viewing the most common n-grams can give you a better understanding
of the context in which the word is used.
Bigram analysis
To build our vocabulary, we will use Countvectorizer. Countvectorizer
is a simple method for labeling, vectorizing and representing corpora in
an appropriate form. Can be
found in sklearn.feature_engineering.text
Therefore, we will analyze the top news in all news headlines.
We can see that many of these hexagrams are a combination of "face
the court" and "anti-war protest." This means that we should spend some
effort on data cleaning to see if we can combine these synonyms into a
clean token.
Topic modelling
Use pyLDAvis for
topic modeling exploration
Topic modeling is the process of using unsupervised learning
techniques to extract the main topics that appear in the document
set.
Latent
Dirichlet Allocation (LDA) is an easy-to-use and efficient topic
modeling model. Each document is represented by a topic distribution,
and each topic is represented by a word distribution.
Once the documents are classified into topics, you can delve into the
data for each topic or topic group.
But before entering topic modeling, we must do some preprocessing of
the data. we will:
Tokenization: The process of converting sentences into tokens or word
lists. remove stopwordslemmatize: Reduce the deformed form of each word
to a common base or root. Convert to word bag: word bag is a dictionary
where the key is the word (or ngram/tokens) and the value is the number
of times each word appears in the corpus.
# output """ [nltk_data] Downloading package punkt to /Users/xx/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package wordnet to /Users/xx/nltk_data... [nltk_data] Unzipping corpora/wordnet.zip. True """
defpreprocess_news(df): corpus=[] stem=PorterStemmer() lem=WordNetLemmatizer() for news in df['headline_text']: words=[w for w in word_tokenize(news) if (w notin stop)]
words=[lem.lemmatize(w) for w in words iflen(w)>2]
corpus.append(words) return corpus
corpus = preprocess_news(df)
# Now, let's use gensim to create a bag of words model dic=gensim.corpora.Dictionary(corpus) bow_corpus = [dic.doc2bow(doc) for doc in corpus]
# We can finally create the LDA model: lda_model = gensim.models.LdaMulticore(bow_corpus, num_topics = 4, id2word = dic, passes = 10, workers = 2)
Theme 0 represents things related to the Iraq war and the police.
Theme 3 shows Australia's involvement in the Iraq War.
You can print all the topics and try to understand them, but there
are tools that can help you run this data exploration more effectively.
pyLDAvis is such a tool, it can interactively visualize the results of
LDA.
Visualize the topics
1 2 3
pyLDAvis.enable_notebook() vis = pyLDAvis.gensim_models.prepare(lda_model, bow_corpus, dic) vis
On the left, the area of each circle represents the importance of the
topic relative to the corpus. Because there are four themes, we have
four circles.
The distance between the center of the circle indicates the
similarity between themes. Here you can see that Topic 3 and Topic 4
overlap, which indicates that the themes are more similar. On the right,
the histogram of each topic shows the top 30 related words. For example,
in topic 1, the most relevant words are "police", "new", "may", "war",
etc.
Therefore, in our case, we can see many war-related words and topics
in the news headlines.
Wordclouds
Wordcloud is a great way to represent text data. The size and color
of each word appearing in the word cloud indicate its frequency or
importance.
It is easy to create a wordcloudusing
python, but we need to provide data in the form of a corpus.
Similarly, you can see that terms related to war are highlighted,
indicating that these words often appear in news headlines.
There are many parameters that can be adjusted. Some of the most
famous are:
stopwords: stop a group of words appearing in the image. max_words:
Indicates the maximum number of words to be displayed. max_font_size:
Maximum font size.
There are many other options to create beautiful word clouds. For
more detailed information, you can refer to here.
Text sentiment
Sentiment analysis is a very common natural language processing task
in which we determine whether the text is positive, negative or neutral.
This is very useful for finding sentiments related to comments and
comments, allowing us to gain some valuable insights from text data.
There are many projects that can help you use python for sentiment
analysis. I personally like TextBlob
and Vader
Sentiment.
1 2 3 4 5 6 7
from textblob import TextBlob TextBlob('100 people killed in Iraq').sentiment
Textblob is a python library built on top of nltk. It has been around
for a while and is very easy to use.
The sentiment function of TextBlob returns two attributes:
Polarity: It is a floating-point number in the range of [-1,1], where
1 means a positive statement and -1 means a negative statement.
Subjectivity: refers to how personal opinions and feelings affect
someone’s judgment. The subjectivity is expressed as a floating point
value with a range of [0,1].
I will run this feature on news headlines.
TextBlob claims that the text "100 people killed in Iraq" is
negative, not a view or feeling, but a statement of fact. I think we can
agree to TextBlob here.
Now that we know how to calculate these sentiment scores, we can use
histograms to visualize them and explore the data further.
# output """ 7 aussie qualifier stosur wastes four memphis match 23 carews freak goal leaves roma in ruins 28 council chief executive fails to secure position 34 dargo fire threat expected to rise 40 direct anger at govt not soldiers crean urges Name: headline_text, dtype: object """
Vader
The next library we are going to discuss is VADER. Vader is better at
detecting negative emotions. It is very useful in the context of social
media text sentiment analysis.
The VADER or Valence Aware dictionary and sentiment reasoner is an
open source sentiment analyzer pre-built library based on
rules/dictionaries and is protected by the MIT license.
The VADER sentiment analysis class returns a dictionary that contains
the possibility that the text appears positive, negative, and neutral.
Then, we can filter and select the emotion with the highest
probability.
We will use VADER to perform the same analysis and check if the
difference is large.
Yes, the distribution is slightly different. There are even more
headlines classified as neutral 85%, and the number of negative news
headlines has increased (to 13%).
Named Entity Recognition
Named entity recognition is an information extraction method in which
entities existing in the text are classified into predefined entity
types, such as "person", "location", "organization" and so on. By using
NER, we can gain insight into the entities that exist in a given text
data set of entity types.
Let us consider an example of a news article.
In the above news, the named entity recognition model should be able
to recognize Entities, such as RBI as an organization, Mumbai and India
as Places, etc.
There are three standard libraries for named entity recognition:
I will use spaCy, which is an open source library
for advanced natural language processing tasks. It is written in Cython
and is known for its industrial applications. In addition to NER,
spaCy also provides many other functions, such as pos mark, word
to vector conversion, etc.
import spacy from spacy import displacy import en_core_web_sm
nlp = en_core_web_sm.load()
# nlp = spacy.load("en_core_web_sm")
# One of the advantages of Spacy is that we only need to apply the nlp function once, and the entire background pipeline will return the objects we need
doc=nlp('India and Iran have agreed to boost the economic \ viability of the strategic Chabahar port through various measures, \ including larger subsidies to merchant shipping firms using the facility, \ people familiar with the development said on Thursday.')
We can see that India and Iran are confirmed as geographic locations
(GPE), Chabahar is confirmed as a person, and Thursday is confirmed as a
date.
We can also use the display module in spaCy to visualize the
output.
1 2 3
from spacy import displacy
displacy.render(doc, style='ent')
This can make sentences with recognized entities look very neat, and
each entity type is marked with a different color.
Now that we know how to perform NER, we can further explore the data
by performing various visualizations on the named entities extracted
from the data set.
First, we will run named entity recognition on news headlines and
store entity types.
NER Analysis
1 2 3 4 5 6 7 8 9 10 11 12
defner(text): doc=nlp(text) return [X.label_ for X in doc.ents]
ent=df['headline_text'].apply(lambda x : ner(x)) ent=[x for sub in ent for x in sub] counter=Counter(ent) count=counter.most_common()
# Now, we can visualize the entity frequency: x,y=map(list,zip(*count)) sns.barplot(x=y,y=x)
Now we can see that GPE and ORG dominate the headlines, followed by
the PERSON entity.
We can also visualize the most common tokens for each entity. Let's
check which places appear the most in news headlines.
Most common GPE
1 2 3 4 5 6 7 8 9 10
defner(text,ent="GPE"): doc=nlp(text) return [X.text for X in doc.ents if X.label_ == ent]
gpe=df['headline_text'].apply(lambda x: ner(x,"GPE")) gpe=[i for x in gpe for i in x] counter=Counter(gpe)
Saddam Hussein and George Bush served as presidents of Iraq and the
United States during the war. In addition, we can see that the model is
far from perfect to classify "vic govt" or "nsw govt" as individuals
rather than government agencies.
Pos tagging
Use nltk for all parts of speech markup, but there are other
libraries that can do the job well (spaacy, textblob).
tags=df['headline_text'].apply(lambda x : pos(x)) tags=[x for l in tags for x in l] counter=Counter(tags) x,y=list(map(list,zip(*counter.most_common(7))))
sns.barplot(x=y,y=x)
We can clearly see that nouns (NN) dominate in news headlines,
followed by adjectives (JJ). This is typical for news reports, and for
art forms, higher adjective (ADJ) frequencies may happen a lot.
You can investigate this in more depth by investigating the most
common singular nouns in news headlines. Let us find out.
Nouns such as "war", "Iraq", and "person" dominate the news
headlines. You can use the above functions to visualize and check other
parts of the voice.
Most common Nouns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
defget_adjs(text): adj=[] pos=nltk.pos_tag(word_tokenize(text)) for word,tag in pos: if tag=='NN': adj.append(word) return adj
words=df['headline_text'].apply(lambda x : get_adjs(x)) words=[x for l in words for x in l] counter=Counter(words)
doc = nlp('She sells seashells by the seashore') displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
Text readability
Textstat
1 2
from textstat import flesch_reading_ease df['headline_text'].apply(lambda x : flesch_reading_ease(x)).hist()
complex headlines?
Almost all readability scores exceed 60. This means that an average
of 11-year-old students can read and understand news headlines. Let's
check all news headlines with a readability score below 5.
In this article, we discussed and implemented various exploratory
data analysis methods for text data. Some are common and little known,
but all of them can be an excellent addition to your data exploration
toolkit.
Hope you will find some of them useful for your current and future
projects.
To make data exploration easier, I created a "exploratory data
analysis of natural language processing templates", which you can use
for your work.
In addition, you may have seen that for each chart in this article,
there is a code snippet to create it. Just click the button below the
chart.