Chinese email text classification [actual project]

Project introduction

Text classification is one of the application fields of natural language processing. Text classification is the basic type of many other tasks. This project is the simplest binary classification problem.

This project will introduce how to convert text data into numerical feature data (extract text characteristics). Then, the support vector machine algorithm in machine learning is used to classify 10001 e-mail samples in Python.

Knowledge points

  • Basic concepts of natural language processing

  • Support vector machine algorithm

  • TF-IDF

Introduction to text classification

Text classification technology plays a very important role in the field of natural language processing. Generally speaking, text classification refers to the process of automatically determining the text category according to the content under certain rules. Text classification has many applications in practical scenes, such as spam classification, emotion analysis, news classification and so on.

Types of text classification problems

According to different classification requirements, text classification can be divided into three categories: secondary classification, multi classification and multi label classification.

  • Second classification problem: it is also the most basic classification. As the name suggests, it classifies text into two categories, such as normal mail, spam or normal mail. For a film review, judge whether it is good or bad.

  • Multi classification problem: it is to divide the text into multiple categories, such as classifying news into politics, entertainment, life, etc.

  • Multi label classification: it is to label the text with multiple different labels. For example, a novel can be divided into multiple themes at the same time, which may be both a fairy novel and a fantasy novel.

Solutions to text classification problems

There are two main methods of text classification: traditional machine learning text classification algorithm and deep learning text classification algorithm.

  • Traditional methods: feature extraction + classifier. Is to convert the text into a fixed dimension vector, and then send it to the classifier for classification.

  • Deep learning method: it can automatically extract features, realize end-to-end training, and has strong feature representation ability. Therefore, the effect of text classification by deep learning is often better than that of traditional methods.

Introduction to support vector machine algorithm

Briefly describe the principle and application method of Support Vector Machine. Support Vector Machine is also referred to as SVM (Support Vector Machine).

SVM is a very important classification algorithm in traditional machine learning. Given training samples, support vector machine finds an "optimal" partition hyperplane to divide different categories.

Generally speaking, there are many such hyperplanes. Support vector machine is to find the partition hyperplane located in the "middle" of the two types of training samples. In order to facilitate understanding, we use the simple two-dimensional space example in the figure below to explain the basic principle of support vector machine. In practical application, it is often a complex high-dimensional space.

It can be seen that many lines in the figure can correctly divide the samples into two categories, but the red line is located in the "middle" of the two samples, which is the best line.

Why not choose another thread? Due to the noise and limitations of the training set, other samples outside the training set may be closer to the boundary of the two categories than the samples of the training set, which will lead to classification errors. The red line is the least affected and can ensure the maximum fault tolerance. Therefore, the purpose of support vector machine is to find the red line.

The red line in the figure is the maximum classification interval hyperplane, and the points on the dotted line are the set of points closest to the interface. Such points are called "support vectors".

We now extend to multidimensional space, and the hyperplane is expressed by a linear equation as:

ω T x + b = 0 \omega ^{T}x+b=0 ωTx+b=0

among ω = ( ω 1 ; ω 2 ; ⋯   ; ω n ) \omega=\left ( \omega _{1};\omega _{2};\cdots ;\omega _{n} \right ) ω= ( ω 1​; ω 2​;⋯; ω n) is the weight value, which determines the direction and position of the hyperplane; b b b is the displacement term, which determines the distance between the hyperplane and the origin, ω \omega ω and b b b are all vectors.

Any point in space x i x_{i} xi) distance to hyperplane r r r formula is:

r = ∣ ω T x i + b ∣ ∥ ω ∥ r=\frac{\left | \omega ^{T}x_{i}+b \right |}{\left \| \omega \right \|} r=∥ω∥∣∣​ωTxi​+b∣∣​​

Among them, ∥ ω ∥ \left \| \omega \right \| ∥ ω ‖ yes ω \omega ω L2 norm of.

➡️ L2 norm: for example ω = ( a , b , c ) \omega =\left ( a,b,c \right ) ω= (a,b,c), then ∥ ω ∥ = a 2 + b 2 + c 2 \left \| \omega \right \|=\sqrt{a^2+b^2+c^2} ∥ω∥=a2+b2+c2 ​ .
tips: I don't quite understand this formula. You can look at the distance formula from a point to a straight line learned in high school

Assuming that hyperplane H can correctly classify the training samples, give a training sample set:

D = { ( x 1 , y 1 ) ( x 2 , y 2 ) ( x 3 , y 3 ) ⋯ ( x n , y n ) } D = \left \{ \left ( {x_{1}},{y_{1}} \right )\left ( {x_{2}},{y_{2}} \right )\left ( {x_{3}},{y_{3}} \right )\cdots \left ( {x_{n}},{y_{n}} \right ) \right \} D={(x1​,y1​)(x2​,y2​)(x3​,y3​)⋯(xn​,yn​)}

among y i ∈ ( − 1 , + 1 ) {y_{i}}\in \left (-1,+1 \right ) yi ∈ (− 1, + 1) [binary classification problem], x i x_{i} xi. Yes n n n-dimensional vector, i.e. if y i = + 1 y_{i}=+1 yi = + 1, there is ω T x i + b > 0 \omega ^{T}x_{i}+b>0 ω Txi + b > 0, if y i = − 1 y_{i}=-1 yi = − 1, there is ω T x i + b < 0 \omega ^{T}x_{i}+b<0 ω Txi + B < 0, i.e.:

{ ω T x i + b ≥ + 1 , y i = + 1 ω T x i + b ≤ − 1 , y i = − 1 \begin{cases}\omega^{T}x_{i}+b\geq +1,y_{i}=+1&\\ \omega^{T}x_{i}+b\leq -1,y_{i}=-1 &\end{cases} {ωTxi​+b≥+1,yi​=+1ωTxi​+b≤−1,yi​=−1​​

If and only if the point is in two different classes Support vector On the set, make the equal sign true respectively. At this time, the "interval" between the two support vectors is:

γ = 2 ∥ ω ∥ \gamma =\frac{2}{\left \| \omega \right \|} γ=∥ω∥2​

Obviously, in order to maximize the interval, we need to find the smallest ∥ ω ∥ 2 {\left \| \omega \right \|}^2 ∥ ω ‖ 2, this is the basic type of support vector machine.

Introduction to text feature extraction

A very important step in nlp task is feature extraction. In vector space model, text features include words, phrases, phrases and other elements. Text data sets generally contain tens of thousands or even hundreds of thousands of different phrases. The scale of vectors composed of such huge phrases is amazing, and computer operation is very difficult.

Feature selection is of great significance to text classification. Feature selection is to find a way to select the phrase elements that can best represent the meaning of the text. Feature selection can not only reduce the scale of the problem, but also help to improve the classification performance. Selecting different features has a very important impact on the performance of text classification system. Many text classification feature selection methods have been proposed, and the commonly used methods are TF-IDF and word bag model.

Word bag model

Word bag model is the most primitive feature set. It ignores the grammar and word order of the text and uses a group of disordered word sequences to express a text or a document. It can be understood that all the words in the whole document set are thrown into the bag, and then rearranged out of order (remove the duplicate). For each document, the document is represented by the number of words. For example:

Sentence 1: I/yes/One/Apple

Sentence 2: I/tomorrow/go/One/local

Sentence 3: you/reach/One/local

Sentence 4: I/yes/I/favorite/you

Throw all the words into a bag: I, have, one, apple, tomorrow, go, place, you, to, favorite. These 10 words appear in total in these four sentences.

Now let's create an unordered list: I, have, one, apple, tomorrow, go, place, you, to, favorite. Each sentence is expressed according to the number of words in each sentence.

Summarize the features:

  • Features of sentence 1: (1, 1, 1, 1, 0, 0, 0, 0, 0, 0)
  • Sentence 2 features: (1, 0, 1, 0, 1, 1, 1, 0, 0, 0)
  • Sentence 3 features: (0, 0, 1, 0, 0, 0, 1, 1, 1, 0)
  • Sentence 4 features: (2, 1, 0, 0, 0, 0, 0, 1, 0, 1)

Such a feature representation is called the feature of word bag model.

TF-IDF model

This model mainly uses the statistical features of vocabulary as the feature set. TF-IDF consists of two parts: TF (Term frequency) and IDF (Inverse document frequency).

TF and IDF are well understood. Let's talk about their calculation formula directly:

  1. TF:
    t f i j = n i j ∑ k n k j tf_{ij} = \frac{n_{ij}}{\sum_{k}n_{kj}} tfij​=∑k​nkj​nij​​
    Its molecule n i j n_{ij} nij , demonstrator i i i in document j j Frequency of occurrence in j. The denominator is the document j j The sum of all word frequencies in j, that is, the document j j The number of all words in j.

    for instance:

    Sentence 1: God/yes/One/girl
    
    Sentence 2: table/upper/yes/One/Apple
    
    Sentence 3: Xiao Ming/yes/teacher
    
    Sentence 4: I/yes/I/like best/of/
    

    TF of words in each sentence:

IDF:
i d f i = l o g ( ∣ D ∣ 1 + ∣ D i ∣ ) idf_{i} = log\left ( \frac{\left | D \right |}{1+\left | D_{i} \right |} \right ) idfi​=log(1+∣Di​∣∣D∣​)
among ∣ D ∣ \left | D \right | ∣ D ∣ represents the total number of documents and the denominator part ∣ D i ∣ \left | D_{i} \right | ∣ Di ∣ means that the document set contains i i Number of documents with i words. The original formula has no denominator + 1 +1 +Yes, here + 1 +1 +1. Laplace smoothing is adopted to avoid the situation that some new words do not appear in the corpus, resulting in zero denominator.

Use the IDF calculation formula to calculate the IDF value of each word in the above sentence:

Finally, the TF-IDF value can be obtained by multiplying the TF and IDF values. Namely:

t f ∗ i d f ( i , j ) = t f i j ∗ i d f i = n i j ∑ k n k j ∗ l o g ( ∣ D ∣ 1 + ∣ D i ∣ ) tf*idf(i,j)=tf_{ij}*idf_{i}= \frac{n_{ij}}{\sum_{k}n_{kj}} *log\left ( \frac{\left | D \right |}{1+\left | D_{i} \right |} \right ) tf∗idf(i,j)=tfij​∗idfi​=∑k​nkj​nij​​∗log(1+∣Di​∣∣D∣​)

TF-IDF value of words in each sentence above:

Adding the TF-IDF value of each word in each sentence to the vector is the TF-IDF feature of each sentence.

For example, the characteristics of sentence 1:
( 0.25 ∗ l o g ( 2 ) , 0.25 ∗ l o g ( 1.33 ) , 0.25 ∗ l o g ( 1.33 ) , 0.25 ∗ l o g ( 2 ) , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) ( 0.25 * log(2) , 0.25 * log(1.33) , 0.25 * log(1.33) , 0.25 * log(2) , 0 ,0 ,0 ,0 , 0 , 0 , 0 , 0 , 0 ) (0.25∗log(2),0.25∗log(1.33),0.25∗log(1.33),0.25∗log(2),0,0,0,0,0,0,0,0,0)

Chinese mail classification

Above, we mainly explained the basic idea of support vector machine, and now we use it in the task of spam classification. The whole experimental steps are as follows:

  1. Import data, segment words and eliminate stop words.
  2. Divide training set and test set.
  3. Convert text data into digital feature data.
  4. Build a classifier.
  5. Training classifier.
  6. Test the classifier.

1. Load dataset

The data used this time includes three files, ham_ The data.txt file contains 5000 normal mail samples, spam_ The data.txt file contains 5001 spam samples, and stopwords is a stop word list. The entire dataset is put on the cloud.

First call the wget command of Linux to download the file. If it is windows, use other commands:

!wget - nc "http://labfile.oss.aliyuncs.com/courses/1208/ham_data.txt"
!wget - nc "http://labfile.oss.aliyuncs.com/courses/1208/spam_data.txt"
!wget - nc "http://labfile.oss.aliyuncs.com/courses/1208/stop_word.txt"
--2021-09-28 20:08:48--  http://-/
Resolving host - (-)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“-"
--2021-09-28 20:08:48--  http://nc/
Resolving host nc (nc)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“ nc"
--2021-09-28 20:08:48--  http://labfile.oss.aliyuncs.com/courses/1208/ham_data.txt
 Resolving host labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)... 47.110.177.159
 on connection labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)|47.110.177.159|:80... Connected.
Issued HTTP Request, waiting for response... 200 OK
 Length: 2481321 (2.4M) [text/plain]
Saving to: "ham_data.txt.1"

ham_data.txt.1      100%[===================>]   2.37M  6.92MB/s  Time 0.3s    

2021-09-28 20:08:48 (6.92 MB/s) - Saved“ ham_data.txt.1" [2481321/2481321])

Download complete --2021-09-28 20:08:48--
Total time: 0.5s
 Downloaded: 1 files, 0.3s (6.92 MB/s) 2 in.4M
--2021-09-28 20:08:49--  http://-/
Resolving host - (-)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“-"
--2021-09-28 20:08:49--  http://nc/
Resolving host nc (nc)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“ nc"
--2021-09-28 20:08:49--  http://labfile.oss.aliyuncs.com/courses/1208/spam_data.txt
 Resolving host labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)... 47.110.177.159
 on connection labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)|47.110.177.159|:80... Connected.
Issued HTTP Request, waiting for response... 200 OK
 Length: 1304315 (1.2M) [text/plain]
Saving to: "spam_data.txt.1"

spam_data.txt.1     100%[===================>]   1.24M  4.52MB/s  Time 0.3s    

2021-09-28 20:08:49 (4.52 MB/s) - Saved“ spam_data.txt.1" [1304315/1304315])

Download complete --2021-09-28 20:08:49--
Total time: 0.5s
 Downloaded: 1 files, 0.3s (4.52 MB/s) 1 in.2M
--2021-09-28 20:08:49--  http://-/
Resolving host - (-)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“-"
--2021-09-28 20:08:49--  http://nc/
Resolving host nc (nc)... Failed: nodename nor servname provided, or not known. 
wget: Unable to resolve host address“ nc"
--2021-09-28 20:08:49--  http://labfile.oss.aliyuncs.com/courses/1208/stop_word.txt
 Resolving host labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)... 47.110.177.159
 on connection labfile.oss.aliyuncs.com (labfile.oss.aliyuncs.com)|47.110.177.159|:80... Connected.
Issued HTTP Request, waiting for response... 200 OK
 Length: 15185 (15K) [text/plain]
Saving to: "stop_word.txt.1"

stop_word.txt.1     100%[===================>]  14.83K  --.-KB/s  Time 0.04s   

2021-09-28 20:08:49 (412 KB/s) - Saved“ stop_word.txt.1" [15185/15185])

Download complete --2021-09-28 20:08:49--
Total time: 0.2s
 Downloaded: 1 files, 0.04s (412 KB/s) 15 in K
h = open('ham_data.txt', encoding='utf-8')  # Normal mail
s = open('spam_data.txt', encoding='utf-8') # Spam
h, s
(<_io.TextIOWrapper name='ham_data.txt' mode='r' encoding='utf-8'>,
 <_io.TextIOWrapper name='spam_data.txt' mode='r' encoding='utf-8'>)

The above h and s are just data streams. The data we prepare corresponds to an email line by line, so here we need to use readlines() to read the contents of the text line by line.

h_data = h.readlines()
s_data = s.readlines()
h_data[0:3], s_data[0:3] # Take the first three emails as an example
(['It's a story about the descendants of Confucius. An old leader came back to his hometown and had a bad relationship with his son and made peace with his greedy grandson Kong Wei. The younger brother of the old leader, Wei zongwan, drove a carriage. A foreign girl probably studied folk customs and spent the new year at their home. Kong Wei always wanted to go abroad and was educated by his grandfather. Finally, the family basically reconciled. By the way, another kind of film, Beijing Qingqing The background of the Sino Vietnamese war. A soldier was introduced to a blind date. The woman was a nurse in the military hospital. She hesitated and always recalled her boyfriend who was injured on the battlefield. It seemed that she was not dead. Finally, the man said he understood and returned to the team.\n',
  'No, there's no topic to do without this broken company? Thank you for your concern. She slept well last night. MM She has already made up her mind. Let's act according to the circumstances. When she gets the relevant materials that can come out to do the paper, she will resign immediately. Alas! Let's see. Maybe we should do it separately XDJM Help out with the idea of finding a job. MM I'm a graduate student of Harbin Engineering University. I don't want to do nothing in Harbin. That's why I came out. Thank you first.!!! I'm not good at Chinese and don't add punctuation. Those who can't understand hard XDJM What happened.\n',
  'Give birth to a child for fun. If it's not fun, give it away. First, you know, before you fall in love, your parents are meaningless to her. It's unreasonable for your parents to ask her to have children, and she has to be obedient. In other words, can you do it if your parents in law want their future children to have their mother's surname? Husband and wife are equal. If you can't promise your parents in law, why does she promise your parents in law? Second, Can you afford to have children? It doesn't mean that you want to have children. Your parents are happy. If you don't have a house and sufficient financial resources, having children will only bring you more difficulties. It's easy to have children and difficult to raise children.\n'],
 ['Every day is a festival for sentient people. A word of cold and warm, a line of noise; a word of advice, a note of legend; a piece of Acacia, one hopes for each other; a piece of love, one loves each other all his life. Search 201:::http://201.855.com I wish you all a happy Tanabata Valentine! Search 201 friendly tips::: 2005 Tanabata Valentine's Day: August 11 - don't forget to send blessings to her or him! \ n ',
  'Our company is an industrial and trade tax fixing enterprise; the cost of opening balance tickets is relatively low. This operation method can save some taxes for your company (factory). The company is based on the principle of mutual benefit,Sincerely look forward to your call!!! contact: Wang Sheng   TEL: --13528886061\n',
  'The company has some ordinary invoices (commodity sales invoices), value-added tax invoices, special payment letters for value-added tax levied by the customs and invoices of other service industries, Highway and inland river transportation invoices can be issued for your company at a low tax rate. Our company has the strength of domestic and foreign trade business to ensure the authenticity of the bills issued by our company. We hope to cooperate!common development!Welcome your call for negotiation and consultation! Contact: Mr. Li      Tel: 13632588281. If you have any trouble, please forgive me. Zhu Shangqi.\n'])

2. Data preprocessing

Because the data we load cannot be directly processed by the model, we need to carry out a series of preprocessing preparations for the data.

2.1 generate samples and labels

After reading the data, h_data is a normal mail sample set composed of 5000 mail strings, s_data is a spam sample set composed of 5001 e-mail strings. We need to manually generate corresponding labels for the sample set. Next, we combine the two samples and label them, set the label of normal e-mail to 1 and the label of spam to 0.

import numpy as np

h_labels = np.ones(len(h_data)).tolist() # Generate positive label list of all 1
s_labels = np.zeros(len(s_data)).tolist() # Generate negative label list of all 0
# Splice positive and negative sample sets and labels together
datas = h_data + s_data
labels = h_labels + s_labels

2.2 division of training set and test set

Use the train in the scikit learn tool_ test_ Split class randomly delimits 25% of the 10001 samples and labels as the test set, and the remaining 75% as the training set to train our classifier.

from sklearn.model_selection import train_test_split

train_d, test_d, train_y, test_y = train_test_split(datas, labels, test_size=0.25, random_state=5)

Meaning of parameter:

  • Data: sample set
  • labels: label set
  • train_test_split: the proportion divided into test sets
  • random_state: random seed. If the same random seed is taken, the test set will be the same every time.

Meaning of return value:

  • train_d: Training set
  • test_d: Test set
  • train_y: Training label
  • test_y: Test label

Take a look at the first 10 labels and try:

train_y[0:10]
[0.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]

2.3 word segmentation

nlp often takes words as the basic features for analysis, so it is necessary to segment the text. In order to simplify the code and facilitate understanding, the word segmentation is designed as tokenize_ The words function is used for subsequent direct calls. We use jieba sub thesaurus.

import jieba

def tokenize_words(corpus):
    tokenized_words = jieba.cut(corpus) # Call jieba participle
    tokenized_words = [token.strip() for token in tokenized_words] # Remove the carriage return and change to list type
    return tokenized_words
# Just enter a sentence and call the function to verify it
string = 'I love natural language processing'
b = tokenize_words(string)
b
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/n6/fr8x2twx7v9b5c7qqz1d_d640000gn/T/jieba.cache
Loading model cost 0.644 seconds.
Prefix dict has been built successfully.





['I', 'love', 'natural language', 'handle']

2.4 removing stop words

In natural language, many words have no practical meaning, such as [de], [le], [de], etc., so they should be eliminated. First, load the stop phrase list we just downloaded. You can also download it on the Internet. The coding format is utf-8, with one stop word per line. For the convenience of calling, we put the operation of removing the stop word into remove_ In the stopwords function.

We can simply check the stop phrase list:

def remove_stopwords(corpus): # The function input is all sample sets (including training and testing)
    sw = open('stop_word.txt', encoding='utf-8') # Load stop list
    sw_list = [l.strip for l in sw] # Remove the carriage return and store it in the list
    # Call word segmentation function
    tokenized_data = tokenize_words(corpus)
    # Use the list generator to filter the stop words
    filtered_data = [data for data in tokenized_data if data not in sw_list]
    # Filtered with '_ The data string is assigned to filtered_data (not easy to introduce, you can see the screenshot comparison before and after processing below)
    filtered_datas = ' '.join(filtered_data)
    # Returns a string after removing the stop word
    return filtered_datas

tips:

Next, build a function to integrate the preprocessing of word segmentation and elimination of stop words, and call the tqdm module to display the progress.

from tqdm.notebook import tqdm

def preprocessing_datas(datas):
    preprocessing_datas = []
    # Remove the stop word for each data in the data
    # And add it to the preprocessed just created above_ Data
    for data in tqdm(datas):
        data = remove_stopwords(data)
        preprocessing_datas.append(data)
    # Returns the preprocessed sample set
    return preprocessing_datas

Finally, directly call the and processing function above to preprocess the training set and test set, and check our processed text:

pred_train_d = preprocessing_datas(train_d)
print(pred_train_d[0])
pred_test_d = preprocessing_datas(test_d)
print(pred_test_d[0])
  0%|          | 0/7500 [00:00<?, ?it/s]


Dear users! 1798 international trade network < http : / / www.1798 . cn / >  , It is ranked No. 1 in the world.2 10000 business portals , Now it is free to release supply and demand information, talent recruitment and Internet website registration. Please click Register now < http : / / www.1798 . cn / index . asp > 



  0%|          | 0/2501 [00:00<?, ?it/s]


The world's largest Chinese entrepreneurship portal, a success secret that shocked the world!  www . zhaozhao360 . com  Are you ready to turn part-time jobs into entrepreneurship? The beauty boss digs the first bucket of gold. The 19-year-old girl makes a net profit of 300000 underwear in five years. She tries to become a platinum beauty, China's new nine profiteering industries! Bill ・ How does gates spend money? Earn 5 million because how can a dog change his salary from 400 to 40000 

Through the previous preprocessing work, we get the sample set PRED after word segmentation and removal of stop words_ train_ D and test set pred_test_d.

3. Feature extraction

After word segmentation and de stop word processing, we get a word segmented text. Now our classifier is SVM, and the input requirements of SVM are numerical features. This means that we need to further process the text data preprocessed earlier and convert it into numerical data.

There are many conversion methods, and the most classic TF-IDF method is used here.

In Python, we can implement TF-IDF model through scikit learn. Also, using the scikit learn library will be very simple. The TfidfVectorizer() class is mainly used here.

Next, we begin to use this class to convert text features into corresponding TF-IDF values.

First, load the TfidfVectorizer class and define the TF-IDF model trainer vectorizer.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=(1, 1))

Parameter list:

  • min_df: ignore words whose word frequency is strictly lower than the fixed threshold.
  • norm: the specification used to standardize entry vectors.
  • smooth_idf: add a smoothing IDF weight, that is, whether the denominator of IDF uses smoothing to prevent the occurrence of 0 weight.
  • use_idf: enable IDF inverse document frequency reweighting.
  • ngram_range: same word bag model

Pre after pretreatment_ train_ D) feature extraction:

tfidf_train_features = vectorizer.fit_transform(pred_train_d)
tfidf_train_features
<7500x29195 sparse matrix of type '<class 'numpy.float64'>'
	with 262786 stored elements in Compressed Sparse Row format>

Through this step, we get 7500 vectors with about 28335 dimensions as our training feature set. We can view the conversion results:

np.array(tfidf_train_features.toarray()).shape
(7500, 29195)

The vectorizer trained in the training set is used to extract the features of the test set:

⚠️ Note that vectorizer.fit cannot be used here_ Transform () should use vectorizer.transform(). Otherwise, the TF-IDF model will be trained separately for the test set, rather than based on the number of words in the training set. In this way, the total number of words is not the same as that of the training set, and the sorting is not the same, which will lead to different dimensions and ultimately fail to complete the test.

tfidf_test_features = vectorizer.transform(pred_test_d)
tfidf_test_features
<2501x29195 sparse matrix of type '<class 'numpy.float64'>'
	with 81964 stored elements in Compressed Sparse Row format>

After that, we get 2501 vectors with 28335 dimensions as our test feature set.

3. Classification

After obtaining TF-IDF features, we can really perform the classification task. Instead of manually building the algorithm model, we can call the SGDClassifier() class in sklearn to train the SVM classifier.

SGDClassifier is a combination of multiple classifiers. When the parameter loss='hinge ', it is a support vector machine classifier.

from sklearn.linear_model import SGDClassifier

svm = SGDClassifier(loss='hinge')

Then we send the previously prepared sample set and sample label to SVM classifier for training.

svm.fit(tfidf_train_features, train_y)
SGDClassifier()

4. View classification results

The implementation algorithm of packet switching is very simple. You can see that the most complex step is preprocessing. Next, let's check the classification results. First, use the test set to see the effect of the classifier.

predictions = svm.predict(tfidf_test_features)
predictions
array([0., 1., 1., ..., 0., 1., 0.])

In order to visually display the classification results, we use the accuracy in the scikit learn library_ The score function is used to calculate the accuracy of the classifier (the accuracy is the same proportion as the prediction in test_l).

from sklearn import metrics

accuracy_score = np.round(metrics.accuracy_score(test_y, predictions), 2)
accuracy_score
0.99

The classification accuracy is calculated by the test label and prediction results. The function of np.round(X,2) is to retain 2 digits after the decimal point after rounding.

We can see that our accuracy has reached 99%, which can be said to be very high. We can randomly input a sample to check whether its prediction results are correct:

id = int(input('Please enter the sample number (0)~250): '))
print('Mail type:', 'Spam' if test_y[id]==1 else'Normal mail')
print('Forecast message type:', 'Spam' if predictions[id]==1 else'Normal mail')
print('Text:', test_d[id])
Mail type: normal mail
 Forecast mail type: normal mail
 Text: Hello! Our company has extra invoices that can be issued on behalf of others. (national tax, local tax, transportation, advertising, customs payment certificate). If your company (factory) needs to call for negotiation and consultation! Contact number:(0)13510614952 Thank you, Miss Wu. Best wishes!

Tags: Deep Learning NLP Machine Learning

Posted by Marqis on Wed, 29 Sep 2021 00:24:50 +0530