Python crawler crawls Harry Potter novels and visually analyzes characters with data

preface

The text and pictures of this article come from the Internet, only for learning and communication, without any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

 

First, briefly introduce jieba Chinese word segmentation package. jieba package mainly has three word segmentation modes:

  • Precise mode: by default, it is the precise mode. It can segment words accurately and is suitable for text analysis;
  • Full mode: all words that can be formed into words are separated, but words will have ambiguity;
  • Search engine mode: on the basis of precise mode, long words are segmented again, which is suitable for search engine word segmentation.

Common statements of jieba package:

  • Exact pattern participle: jieba Cut (text, cut_all = false), when cut_ Full mode when all = true
  • Custom dictionary: jieba Load_ Userdict (file\u name)
  • Add word: jieba Add_ Word (SEG, freq, flag)
  • Delete word: jieba Del_ Word (SEG)

Harry Potter is a series of fantasy novels written by British writer J. K. Rowling. It describes the adventures of the protagonist Harry Potter during his seven-year study at Hogwarts School of witchcraft and Wizardry. The following will take the complicated character relationships in Harry Potter as an example to practice the jieba package.

#Load required packages
import numpy as np
import pandas as pd
import jieba,codecs
import jieba.posseg as pseg  #Tagging part of speech module
from pyecharts import Bar,WordCloud

#Import person name, stop words, specific Thesaurus
renmings = pd.read_csv('name.txt',engine='python',encoding='utf-8',names=['renming'])['renming']
stopwords = pd.read_csv('mystopwords.txt',engine='python',encoding='utf-8',names=['stopwords'])['stopwords'].tolist()
book = open('Harry Potter.txt',encoding='utf-8').read()
jieba.load_userdict('Harry Potter Thesaurus.txt')

#Define a word segmentation function
def words_cut(book):
    words = list(jieba.cut(book))
    stopwords1 = [w for w in words if len(w)==1]  #Add stop words
    seg = set(words) - set(stopwords) - set(stopwords1) #Filter stop words to get more accurate word segmentation
    result = [i for i in words if i in seg]
    return result

#First participle
bookwords = words_cut(book)
renming = [i.split(' ')[0] for i in set(renmings)] #As long as the character's name, give out the word frequency and part of speech
nameswords = [i for i in bookwords if i in set(renming)]  #Filter out character names

#Statistical word frequency
bookwords_count = pd.Series(bookwords).value_counts().sort_values(ascending=False)
nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False)
bookwords_count[:100].index

 

After the initial word segmentation, we found that most of the words were ok, but there were still a small number of name words that were not accurately divided, such as' Bree ',' Ron said ',' Voldemort ',' sney ',' di said 'and so on, and' Umbridge 'and' Hogwarts' were divided into two words.

#Customize some words
jieba.add_word('Dumbledore',100,'nr')
jieba.add_word('Hogwarts',100,'n')
jieba.add_word('Umbridge',100,'nr')
jieba.add_word('Ratankes',100,'nr')
jieba.add_word('Voldemort',100,'nr')
jieba.del_word('Ron said')
jieba.del_word('Speak plainly')
jieba.del_word('Sne')

#Second participle
bookwords = words_cut(book)
nameswords = [i for i in bookwords if i in set(renming)]
bookwords_count = pd.Series(bookwords).value_counts().sort_values(ascending=False)
nameswords_count = pd.Series(nameswords).value_counts().sort_values(ascending=False)
bookwords_count[:100].index

 

After word segmentation again, we can see that the errors in the initial word segmentation have been corrected. Next, we will make statistical analysis.

#Statistics of TOP15 words
bar = Bar('Most common words TOP15',background_color = 'white',title_pos = 'center',title_text_size = 20)
x = bookwords_count[:15].index.tolist()
y = bookwords_count[:15].values.tolist()
bar.add('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)
bar

 

The TOP15 words that appear most in the whole novel include Harry, Hermione, Ron, Dumbledore, wand, magic, Malfoy, Snape and Sirius.

We can probably know the main content of Harry Potter, that is, Harry, accompanied by his little friends Hermione and Ron, and with the help and training of the Archmage Dumbledore, used his magic wand to use magic to bring the big boss Voldemort k.o. Of course, Harry Potter is very exciting.

#Statistics on TOP20 words of character names
bar = Bar('Key figures Top20',background_color = 'white',title_pos = 'center',title_text_size = 20)
x = nameswords_count[:20].index.tolist()
y =nameswords_count[:20].values.tolist()
bar.add('',x, y,xaxis_interval = 0,xaxis_rotate = 30,is_label_show = True)
bar

 

According to the number of appearances in the whole novel, we find that Harry's status as the protagonist is unshakable, more than 13000 times higher than that of Hermione, who ranks second. Of course, this is very normal. After all, this book is Harry Potter, not Hermione Granger.

#Analysis of word cloud in the whole novel
name = bookwords_count.index.tolist()
value = bookwords_count.values.tolist()
wc = WordCloud(background_color = 'white')
wc.add("", name, value, word_size_range=[10, 200],shape = 'diamond')
wc

 

#Character relationship analysis
names = {} 
relationships = {} 
lineNames = []
with codecs.open('Harry Potter.txt','r','utf8') as f:
    n = 0
    for line in f.readlines(): 
        n+=1
        print('Processing page{}line'.format(n))
        poss = pseg.cut(line)
        lineNames.append([])
        for w in poss:
            if w.word in set(nameswords):
                lineNames[-1].append(w.word)
                if names.get(w.word) is None:
                    names[w.word] = 0
                    relationships[w.word] = {} 
                names[w.word] += 1
for line in lineNames:
    for name1 in line:
        for name2 in line:
            if name1 == name2:
                continue
            if relationships[name1].get(name2) is None:
                relationships[name1][name2]= 1
            else:
                relationships[name1][name2] = relationships[name1][name2]+ 1
node = pd.DataFrame(columns=['Id','Label','Weight'])
edge = pd.DataFrame(columns=['Source','Target','Weight'])
for name,times in names.items():
        node.loc[len(node)] = [name,name,times]
for name,edges in relationships.items():
        for v, w in edges.items():
            if w > 3:
                edge.loc[len(edge)] = [name,v,w]

After processing, we found that the same person had different names, so we merged and counted them to get 88 nodes.

node.loc[node['Id']=='Harry','Id'] = 'Harry Potter'
node.loc[node['Id']=='Porter','Id'] = 'Harry Potter'
node.loc[node['Id']=='Albus','Id'] = 'Dumbledore'
node.loc[node['Label']=='Harry','Label'] = 'Harry Potter'
node.loc[node['Label']=='Porter','Label'] = 'Harry Potter'
node.loc[node['Label']=='Albus','Label'] = 'Dumbledore'
edge.loc[edge['Source']=='Harry','Source'] = 'Harry Potter'
edge.loc[edge['Source']=='Porter','Source'] = 'Harry Potter'
edge.loc[edge['Source']=='Albus','Source'] = 'Dumbledore'
edge.loc[edge['Target']=='Harry','Target'] = 'Harry Potter'
edge.loc[edge['Target']=='Porter','Target'] = 'Harry Potter'
edge.loc[edge['Target']=='Albus','Target'] = 'Dumbledore'
nresult = node['Weight'].groupby([node['Id'],node['Label']]).agg({'Weight':np.sum}).sort_values('Weight',ascending = False)
eresult = edge.sort_values('Weight',ascending = False)
nresult.to_csv('node.csv',index = False)
eresult.to_csv('edge.csv',index = False)

With the node and edge, we can analyze the relationship between the characters in Harry Potter through gephi:

Tags: Python Data Analysis crawler

Posted by alext on Mon, 30 May 2022 15:56:24 +0530