Video conference abstract extraction system

catalogue

1, Video text extraction

1. Video to audio

2. Audio to text:

2, Abstract extraction

1.txetrank -- Abstract

3. Clustering -- Abstract

4.seq2seq method -- generative summary

3, Summary

preface

Combining the text classification and abstract extraction knowledge, the author constructs a system that can extract video abstract information for reference.

1, Video text extraction

1. Video to audio

Video to audio uses the movie library (pip install movie), and three lines of code realize video to audio conversion:

from moviepy.editor import AudioFileClip
my_audio_clip = AudioFileClip("E:\\film\\Video material\\Spacecraft docking.mp4")
my_audio_clip.write_audiofile("E:\\film\\Video material\\Spacecraft docking audio.wav")

2. Audio to text:

Use the speech recognition library and the pocketsphinx library to identify.

Both can be installed through pip, but note that if installing pocketsphinx directly through PIP fails, install its dependencies through conda install swig first, and then install pocketsphinx through pip. This is because pocketsphinx is implemented based on c + +, and swig is a tool that can package c/c + + code into python libraries.

At the same time, note that pocketsphinx only provides English recognition by default. For Chinese recognition, you need to download the Mandarin upgrade package officially and install it in the same path as the English package (D: \ anaconda3 \ lib \ site packages \ speech_recognition \ pocketsphinx data)

Official website: CMU Sphinx - Browse /Acoustic and Language Models at SourceForge.net

Mandarin upgrade package installation: ① currently, only cmusphinx-zh-CN-5.2.tar GZ this double compressed file is double decompressed after being downloaded from the ladder. ② Note: unzipped zh_cn.cd_cont_5000 folder renamed as acoustic model, zh_cn.lm.bin is named language-model.lm.bin, zh_cn.dic is changed to pronounciation dictionary dict. ③ Create "zh CN" in en US, the same level directory of the default English package, for storing ④ put the four files in step ② into zh CN.

import speech_recognition as sr
audio_file='E:\\film\\Video material\\Spacecraft docking audio.wav'
r=sr.Recognizer()

with sr.AudioFile(audio_file)as source:
    audio=r.record(source)

try:
    print("Recognized text:",r.recognize_sphinx(audio,language='zh-CN'))
except Exception as e:
    print(e)

2, Abstract extraction

1.txetrank -- Abstract

        

from snownlp import SnowNLP

text='''Issued a statement on the event that abandoned tea was changed handsFrom this newspaper(Reporter: Liu Jun) "We are also victims!"yesterday,Some media reported that Master Kang's discarded tea was
    Resell them to bad merchants and pass them off as famous tea to enter the market,A contact of Master Kang said so. Master Kang issued a statement last night, saying that he was a waste processor
    Made "bad behavior",In addition to expressing its serious concern,We have also cooperated with the relevant government departments in conducting investigations.Suspension and production waste plant
    Contract of the supplierMaster Kang confirmed in his statement to this newspaper:,After investigation,The manufacturer that signed the production waste disposal contract with Master Kang is Ji'an Sanshi feed company,And connected
    Obtain contract contracting qualification through public bidding procedures,And signed a commitment to dispose of Master Kang's production waste through legal channels. Yesterday's media reports showed:
    Ji'an Sanshi feed company may violate the contract signed with Master Kang,From now on,Master Kang has suspended the performance of relevant contracts with Ji'an Sanshi feed company,And wait
    After the investigation results of the relevant units, the responsibility shall be strictly investigated.yesterday,Master Kang, a person familiar with the situation, told the reporter,This business started after the third quarter of last year
    It's for Master Kang to deal with production waste,The relevant contract will expire at the end of this year,The contact pointed out that Master Kang and the firm had no equity participation or any capital relationship.'''
text2="""Reporter Fu Yayu reporting from Shenyang came to Shenyang, and the National Olympic team still hasn't got rid of the rain. At 6:00 p.m. on July 31, the routine training of the National Olympic team was again disturbed by heavy rain, and the players had no choice but to jog for only 25 minutes.
  31 At 10:00 a.m. on the 12th, when the Chinese Olympic team was training in the outfield of the Olympic Sports Center, it was cloudy. The weather forecast showed that there would be heavy rain in Shenyang that afternoon. Fortunately, the team's morning training was not disturbed.
  At 6 p.m., when the team arrived at the training ground, the heavy rain had been falling for several hours, and there was no intention of stopping. With an attitude of trying, the team began the routine training that afternoon. After 25 minutes, the weather showed no sign of improving. In order to protect the players, the National Olympic team decided to stop the training that day and the whole team immediately returned to the hotel.
  Training in the rain is not unusual for the football team, but before the Olympic Games begin, the whole team has become "Petite". In the last week's training in Shenyang, the National Olympic team should first ensure that the existing players will not have unexpected injuries to avoid affecting the official game. Therefore, the team has placed a very important position in controlling training injuries and colds at this stage. After arriving in Shenyang, center back Feng Xiaoting has not trained. Feng Xiaoting caught a cold in Changchun on July 27, so he did not participate in the warm-up match with Serbia on July 29. According to the team introduction, Feng Xiaoting did not have fever symptoms, but for safety reasons, he was allowed to rest in the past two days and resume training after his cold was completely cured. Because of the example of Feng Xiaoting, the National Olympic team is particularly cautious about training in the rain, mainly because it is worried that the players will catch a cold and cause a cold, resulting in non combat attrition. The fact that Ma Xiaoxu, a female football player, was injured in the warm-up match and was not eligible for the Olympic Games has also made the National Olympic team in Shenyang particularly vigilant. "During the training, we constantly told the players to pay attention to their movements. We can't have such a thing again." A staff member said.
  From Changchun to Shenyang, the rain accompanied the National Olympic team all the way. "It's also evil. It rains everywhere we go. Several training sessions in Changchun were disrupted by heavy rain. I didn't expect to encounter this kind of thing again in Shenyang." A national Olympic player also wondered about the "favor" of rain."""
s=SnowNLP(text2)
print(s.summary(9))
print(s.summary(1))

2.TF-IDF method -- Abstract

#coding:utf-8
import nltk
import jieba
import numpy



#Clause
def sent_tokenizer(texts):
    start=0
    i=0#Position of each character
    sentences=[]
    punt_list=',.!?:;~,. !?: ;~'#punctuation

    for text in texts:#Traverse every character
        if text in punt_list and token not in punt_list: #Check whether the next character of punctuation is still punctuation
            sentences.append(texts[start:i+1])#Current punctuation position
            start=i+1#start tag to the beginning of the next sentence
            i+=1
        else:
            i+=1#If it is not a punctuation mark, the character position continues to move forward
            token=list(texts[start:i+2]).pop()#The next character (. pop) is deleted
    if start<len(texts):
        sentences.append(texts[start:])#This is to deal with the situation where there is no punctuation at the end of the text
    return sentences

#Load stop words
def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

#Rate sentences
def score_sentences(sentences,topn_words):#Parameter sentences: text group (text divided into sentences, topn_words: high frequency phrases)
    scores=[]
    sentence_idx=-1#Initial sentence index label - 1
    for s in [list(jieba.cut(s)) for s in sentences]:# Traverse every clause. Each clause here is a clause array. Clause 1 is similar to ['flower', 'Orchard', 'central business district', 'F4', 'building', 'B33', 'city', ',']
        sentence_idx+=1 #Sentence index + 1.. 0 means the first sentence
        word_idx=[]#Store the index position of keywords in clauses The results are similar: [1, 2, 3, 4, 5], [0, 1], [0, 1, 2, 4, 5, 7]
        for w in topn_words:#Traverse every high-frequency word
            try:
                word_idx.append(s.index(w))#The index position where the keyword appears in the sub sentence
            except ValueError:#w is not in the sentence
                pass
        word_idx.sort()
        if len(word_idx)==0:
            continue

        #For two consecutive words, the family was calculated by the distance threshold using the word position index
        clusters=[] #Several cluster s are stored. Similar to [[0, 1, 2], [4, 5], [7]]
        cluster=[word_idx[0]] #It stores a category (cluster) similar to [0, 1, 2]
        i=1
        while i<len(word_idx):#Traverse the high-frequency words in the current clause
            CLUSTER_THRESHOLD=2#For example, I set the threshold to 2
            if word_idx[i]-word_idx[i-1]<CLUSTER_THRESHOLD:#If the difference between the current high-frequency word index and the previous high-frequency word index is less than 3,
                cluster.append(word_idx[i])#It is considered as a class
            else:
                clusters.append(cluster[:])#Add the current category to clusters = []
                cluster=[word_idx[i]] #New category
            i+=1
        clusters.append(cluster)

        #Score each family, and the maximum score of each family is the score of the sentence
        max_cluster_score=0
        for c in clusters:#Traverse each cluster
            significant_words_in_cluster=len(c)#Number of high frequency words in the current cluster
            total_words_in_cluster=c[-1]-c[0]+1#Distance between the last high-frequency word in the current cluster and the first
            score=1.0*significant_words_in_cluster*significant_words_in_cluster/total_words_in_cluster
            if score>max_cluster_score:
                max_cluster_score=score
        scores.append((sentence_idx,max_cluster_score))#Store the maximum cluster of the current clause (note that a decomposition may have several clusters) storage format (clause index, maximum cluster score of decomposition)
    return scores;

#Result output
def results(texts,topn_wordnum,n):#texts text, topn_ The number of wordnum high-frequency words is the number of sentences returned
    stopwords = stopwordslist("C:\\Users\\cuguanren\\pytorch Practical projects\\speech recognition\\data\\stopwords\\hit_stopwords.txt")#Load stop words
    sentence = sent_tokenizer(texts)  # Clause
    words = [w for sentence in sentence for w in jieba.cut(sentence) if w not in stopwords if
             len(w) > 1 and w != '\t']  # Words, non word words, and non symbols
    wordfre = nltk.FreqDist(words)  # Statistical word frequency
    topn_words = [w[0] for w in sorted(wordfre.items(), key=lambda d: d[1], reverse=True)][:topn_wordnum]  # Take out the topn with the highest word frequency_ Wordnum words

    scored_sentences = score_sentences(sentence, topn_words)#Rate clauses

    # 1. Filter non important sentences by means of mean and standard deviation
    avg = numpy.mean([s[1] for s in scored_sentences])  # mean value
    std = numpy.std([s[1] for s in scored_sentences])  # standard deviation
    mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences if
                   score > (avg + 0.5 * std)]  # sent_idx clause label, score score

    # 2. Return top n sentences
    top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-n:]  # Sort the scores and take out n sentences
    top_n_scored = sorted(top_n_scored, key=lambda s: s[0])  # Sort the clauses with the highest scores
    c = dict(mean_scoredsenteces=[sentence[idx] for (idx, score) in mean_scored])
    c1=dict(topnsenteces=[sentence[idx] for (idx, score) in top_n_scored])
    return c,c1

if __name__=='__main__':
    texts = str(input('Please enter text:'))
    topn_wordnum=int(input('Please enter the number of high-frequency words:'))
    n=int(input('Please enter the number of sentences to return:'))
    c,c1=results(texts,topn_wordnum,n)
    print(c)
    print(c1)


3. Clustering -- Abstract

        

Download Chinese space model Python - M space download en_ core_ web_ lg

Note # from summarizer sentence_handler import sentience mentioned in sentiencehandler_ Handler has been removed and replaced by summarizer.summary_processor.SentenceHandler method.

from summarizer import Summarizer
#from summarizer.sentence_handler import SentenceHandler
from spacy.lang.zh import Chinese#spaCy is one of the most popular open source NLP development packages. The Chinese version of the pre training model includes part of speech tagging, dependency analysis and named entity recognition
from transformers import *
import summarizer
 
# Loading models through Transformers
modelName = "hfl/rbt6"  # If you use Chinese bert, you can replace it with one you commonly use
custom_config = AutoConfig.from_pretrained(modelName)
custom_config.output_hidden_states = True
custom_tokenizer = AutoTokenizer.from_pretrained(modelName)
custom_model = AutoModel.from_pretrained(modelName, config=custom_config)
 
model = Summarizer(
    custom_model=custom_model,
    custom_tokenizer=custom_tokenizer,
    sentence_handler=summarizer.summary_processor.SentenceHandler(language=Chinese)  # Use Chinese word segmentation
)
 
texts = [
    """Sihai.com,recently,According to media reports:Zhang Ziyi is really pregnant!The report also quoted an insider as saying:,
    "Zhang Ziyi is about four or five months pregnant,The expected date of delivery is around the end of the year,I don't take up work now. " What the hell is going on?Is the news true or false?For this message,23 8:30 pm,
    The reporter of West China Metropolis Daily quickly contacted an insider who had excellent relations with Zhang Ziyi's family,This person confirmed to the reporter of West China Metropolis Daily that::"Ziyi is really pregnant this time.
    She is 36 years old,It's time to get pregnant. After Zhang Ziyi became pregnant with Wang Feng's child,Ziyi's parents were very happy. Ziyi's mother,She has begun to take good care of her daughter.
    Ziyi's due date is probably the end of December this year. " At 9:00 that night,West China Metropolis Daily reporter in order to verify Zhang Ziyi's pregnancy,Zhang Zinan, Zhang Ziyi's brother, was contacted by phone,
    But the phone went through,No one answered. News about Zhang Ziyi's pregnancy since Zhang Ziyi and Wang Feng fell in love in September 2013,It was said N All over!
    however,Time: 2015,But things have changed subtly. March 21, 2015,Zhang Ziyi's film "falling from heaven" started,
    A few group photos at the launch conference,It aroused the curiosity of netizens:"Is Zhang Ziyi really pregnant?"However, it was later confirmed that,Zhang Ziyi's "big belly photo"
    It's just a publicity stunt. Four months later, on July 22,<A new round of publicity of "Taiping wheel",Zhang Ziyi was found in poor condition again,Take a deep breath from time to time,
    I unconsciously want to cover my stomach,I feel inappropriate again. Then one day in August,Zhang Ziyi is having dinner with her friends,I was photographed by fashion studio at the entrance of the hotel,Suspected
    Pregnant!July 11 this year,Wang Feng was supposed to hold a concert in Shanghai,Later, it was cancelled due to typhoon "canhong". The source said:,Wang Feng used to play
    At the concert, I announced important news in front of Zhang Ziyi,Besides, Zhang Ziyi has gone to Shanghai to attend a concert,How did you know that there was a typhoon,Have to postpone,mutually
    I believe there will be surprises in the concert on September 26.""",
    """In view of the poor adaptability of the simplification method of the unified error calculation model,An adaptive segmentation method based on terrain features is proposed to selectively segment terrain regions
    Methods of error calculation and model simplification.In view of the large amount of terrain model data,The level of detail structure is established,And prove the spatial fast indexer
    Validity of law.In order to solve the problem of difficulty in dividing regions with gentle terrain,A feature selection algorithm based on the combination of convex points and diffusion points is proposed,And effectively control
    The density of feature points is calculated.On this basis, a multi-resolution neighborhood node searching and matching method is proposed,Fast coarse-grained segmentation of regions is realized.Surface
    Undulation calculation method,The terrain characteristics of the divided area are further evaluated,So as to subdivide some regions.Experimental research was carried out on real data,
    The results show that the performance of the algorithm and the accuracy and adaptability of the simplified model are better.""",
    """For CCTV 3·15 The chaos in the telecommunications industry exposed at the party, the Ministry of industry and information technology said in the announcement that it would strictly investigate CCTV 3·15 The evening party exposed the illegal behaviors of communication.
    The Ministry of industry and information technology said that it had interviewed the relevant responsible persons of the three major operators and instructed the three major operators and the provincial communications administration to investigate and deal with them seriously according to law and regulations overnight.""",
    """Issued a statement on the event that abandoned tea was changed handsFrom this newspaper(Reporter: Liu Jun) "We are also victims!"yesterday,Some media reported that Master Kang's discarded tea was
    Resell them to bad merchants and pass them off as famous tea to enter the market,A contact of Master Kang said so. Master Kang issued a statement last night, saying that he was a waste processor
    Made "bad behavior",In addition to expressing its serious concern,We have also cooperated with the relevant government departments in conducting investigations.Suspension and production waste plant
    Contract of the supplierMaster Kang confirmed in his statement to this newspaper:,After investigation,The manufacturer that signed the production waste disposal contract with Master Kang is Ji'an Sanshi feed company,And connected
    Obtain contract contracting qualification through public bidding procedures,And signed a commitment to dispose of Master Kang's production waste through legal channels. Yesterday's media reports showed:
    Ji'an Sanshi feed company may violate the contract signed with Master Kang,From now on,Master Kang has suspended the performance of relevant contracts with Ji'an Sanshi feed company,And wait
    After the investigation results of the relevant units, the responsibility shall be strictly investigated.yesterday,Master Kang, a person familiar with the situation, told the reporter,This business started after the third quarter of last year
    It's for Master Kang to deal with production waste,The relevant contract will expire at the end of this year,The contact pointed out that Master Kang and the firm had no equity participation or any capital relationship.
    "The manufacturer promises that the waste will be used to make pillows ""In order to ensure that there will be no trouble, we have a clear agreement with this firm in the contract,Can't use waste tea
    Do anything that violates national laws and regulations. " The above person said:,The firm promised them that tea was used to make pillows and other articles,For Master Kang:
    For safety,We also specially ask them for documents for pillow and other business,I didn't expect such a thing to happen.Master Kang is the largest instant tea drinker in the mainland
    Beverage manufacturer,Its market share is nearly 40%. According to the industrial chain of second-hand tea exposed by the media yesterday,The recycling firm sells the discarded tea soaked by Master Kang at a low price
    Sold to bad merchants,Merchants transport tea to other places for processing and sale. Every year, millions of kilograms of second-hand tea are made into famous tea and exported abroad or sold to domestic enterprises.
    as report goes,The soaked tea was first transported to Yonghe Guishan, Xintang, Zengcheng, Guangzhou,After drying, it is transported to the tea factory in Machong, Dongguan for processing,And then transported to Anji, Zhejiang Province
    Kaifeng Tea Co., Ltd((hereinafter referred to as Kaifeng tea factory))And reprocessing is performed.author:Liu Jun (source:Guangzhou Daily)""",
    """Reporter Fu Yayu reporting from Shenyang came to Shenyang, and the National Olympic team still hasn't got rid of the rain. At 6 p.m. on July 31, the routine training of the National Olympic team was once again affected
    Due to the interference of heavy rain, the team members only jogged for 25 minutes and ended up hastily. At 10:00 a.m. on the 31st, the Chinese Olympic team was training in the outfield of the Olympic Sports Center
    It was overcast. The weather forecast showed that there would be heavy rain in Shenyang that afternoon, but fortunately, the training of the team in the morning was not disturbed. At 6 pm, when
    When the team arrived at the training ground, the heavy rain had been falling for several hours, and there was no intention of stopping. With a try attitude, the team began to dominate the world
    After 25 minutes of routine training in the afternoon, the weather showed no signs of improvement. In order to protect the players, the National Olympic team decided to suspend the training on that day. The whole team stood up
    I.e. return to the hotel. Training in the rain is not unusual for the football team, but before the Olympic Games begin, the whole team has become "Petite". In Shen
    During Yang's last week of training, the National Olympic team should first ensure that existing players do not have unexpected injuries to avoid affecting the official game, so this stage is controlled
    Training injuries and controlling the emergence of cold and other diseases are placed in a very important position by the team. After arriving in Shenyang, center back Feng Xiaoting has not been trained
    ,Feng Xiaoting caught a cold in Changchun on July 27, so he did not participate in the warm-up match with Serbia on July 29. According to the team introduction, Feng Xiaoting did not come out
    Now he has fever, but for safety, let him rest in the past two days and resume training after his cold is completely cured. Because of Feng Xiaoting's example
    Therefore, the National Olympic team is particularly cautious about training in the rain, mainly because it is worried that players will catch a cold and cause non combat attrition. And the women's football team member Ma Xiao
    Xu's criminal record of missing the Olympic Games due to his injury in the warm-up match also makes the National Olympic team in Shenyang particularly vigilant now. "During training, I constantly told my team members to pay attention to their movements,
    We can't do this again. " A staff member said. From Changchun to Shenyang, the rain accompanied the National Olympic team all the way. "It's also evil. We've come to
    Wherever it rains, it rains everywhere. Several training sessions in Changchun were disrupted by heavy rain. I didn't expect to encounter this kind of thing again when I came to Shenyang. " A national Olympic player also wondered about the "favor" of rain.""",
    """Rendezvous and docking, in four simple words, includes two stages. It is the key technology for building China's space station and one of the most complex technologies for spacecraft in orbit operation.
    For example, before the launch, we can imagine that the earth rotates with the long march rocket. From the moment of taking off, they are not moving with the earth, they are directly off the surface
    Use space. Therefore, the accuracy of the take-off time determines whether the rocket is taken by the earth and deviates from the expected initial conditions, which is also a term zero window launch time error
    The difference can only be within one second of the government, and no delay or change is allowed. After that, how can the two spacecraft find each other? It will connect with what is often called a "ten thousand mile fax", but it is actually a distance
    This is not directly proportional to the difficulty, and tracking and establishment may not even consume more fuel. The key is to accurately control the altitude difference in the flight process and the period of the spacecraft itself, the new life, and the future
    The faster it will run by the end of the month. For example, as long as the track of the problem is maintained, the English space station will naturally fly to the space station at a faster speed. During the tracking process, the problem will gradually rise
    High, the relative speed of the orbital space station also gradually decreases. The orbit height of the space station is similar to that of the space station. If the relative speed of the two is zero, the docking is expected to be realized. Shanggong room. Navigation
    After the spacecraft reaches the first speed, it establishes an orbital flight. After that, the speed is only related to the orbital altitude. In short, the flight speed of rapid rendezvous and docking has become faster, but orbit control
    The manufacturing process has become shorter, so the other key word of the fast rendezvous and docking mode is autonomy in addition to speed. Autonomy is to track the aircraft without waiting for the command from the ground,
    Instead, they use their own data to determine their own direction of progress. Here, our big brother Beidou is making great efforts to use the Beidou global navigation technology to provide accurate and real-time track side information."""]
 
for i, j in enumerate(texts):
    result = model(j)
    print(str(i + 1) + ':' + result)

4.seq2seq method -- generative summary

This model uses the open source news collection of Tsinghua University, covering a wide range of fields, with a number of 80w pieces. At the same time, it uses the pre training bert model provided by Harbin Institute of technology.

## THUCNews raw data set
# -*- coding:utf-8 -*-

import torch 
from tqdm import tqdm
import time
import glob
from torch.utils.data import Dataset, DataLoader
from bert_seq2seq import Tokenizer, load_chinese_base_vocab
from bert_seq2seq import load_bert

vocab_path = "./state_dict/roberta_wwm_vocab.txt"  # Location of the roberta model dictionary
word2idx = load_chinese_base_vocab(vocab_path)
model_name = "roberta"  # Select model name
model_path = "./state_dict/roberta_wwm_pytorch_model.bin"  # Model location
recent_model_path = "./state_dict/bert_auto_title_model.bin"   # It is used to continue training the trained model
model_save_path = "./state_dict/bert_auto_title_model.bin"
batch_size = 4
lr = 1e-5
maxlen = 256

class BertDataset(Dataset):
    """
    Define a relevant data retrieval method for a specific data set
    """
    def __init__(self) :
        ## Generally, init function loads all data
        super(BertDataset, self).__init__()
        ## Get the names of all the documents
        self.txts = glob.glob('./corpus/THUCNews/*/*.txt')
        self.idx2word = {k: v for v, k in word2idx.items()}
        self.tokenizer = Tokenizer(word2idx)

    def __getitem__(self, i):
        ## Get single data
        # print(i)
        text_name = self.txts[i]
        with open(text_name, "r", encoding="utf-8") as f:
            text = f.read()
        text = text.split('\n')
        if len(text) > 1:
            title = text[0]
            content = '\n'.join(text[1:])
            token_ids, token_type_ids = self.tokenizer.encode(
                content, title, max_length=maxlen
            )
            output = {
                "token_ids": token_ids,
                "token_type_ids": token_type_ids,
            }
            return output

        return self.__getitem__(i + 1)

    def __len__(self):

        return len(self.txts)
        
def collate_fn(batch):
    """
    dynamic padding, batch Part sample
    """

    def padding(indice, max_length, pad_idx=0):
        """
        pad function
        """
        pad_indice = [item + [pad_idx] * max(0, max_length - len(item)) for item in indice]
        return torch.tensor(pad_indice)

    token_ids = [data["token_ids"] for data in batch]
    max_length = max([len(t) for t in token_ids])
    token_type_ids = [data["token_type_ids"] for data in batch]

    token_ids_padded = padding(token_ids, max_length)
    token_type_ids_padded = padding(token_type_ids, max_length)
    target_ids_padded = token_ids_padded[:, 1:].contiguous()

    return token_ids_padded, token_type_ids_padded, target_ids_padded

class Trainer:
    def __init__(self):
        # Determine whether there are available GPU s
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        print("device: " + str(self.device))
        # Define model
        self.bert_model = load_bert(word2idx, model_name=model_name)
        ## Loading pre trained model parameters ~
        
        self.bert_model.load_pretrain_params(recent_model_path)
        # Load the trained model and continue training

        # Send model to computing device (GPU or CPU)
        self.bert_model.set_device(self.device)
        # Declare parameters to be optimized
        self.optim_parameters = list(self.bert_model.parameters())
        self.optimizer = torch.optim.Adam(self.optim_parameters, lr=lr, weight_decay=1e-3)
        # Declare a custom data loader
        dataset = BertDataset()
        self.dataloader =  DataLoader(dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

    def train(self, epoch):
        # An epoch training
        self.bert_model.train()
        self.iteration(epoch, dataloader=self.dataloader, train=True)
    
    def save(self, save_path):
        """
        Save model
        """
        self.bert_model.save_all_params(save_path)
        print("{} saved!".format(save_path))

    def iteration(self, epoch, dataloader, train=True):
        total_loss = 0
        start_time = time.time() ## Get the current time
        step = 0
        report_loss = 0
        for token_ids, token_type_ids, target_ids in tqdm(dataloader,position=0, leave=True):
            step += 1
            if step % 10 == 0:
                self.bert_model.eval()
                test_data = ["""Issued a statement on the event that abandoned tea was changed handsFrom this newspaper(Reporter: Liu Jun) "We are also victims!"yesterday,According to media reports, Kang Shi
                Fu's discarded tea was sold to bad merchants and passed into the market as famous tea,A contact of Master Kang said so. Master Kang issued it last night
                The statement said that the production waste processor had committed "bad behavior.",In addition to expressing its serious concern,We have also cooperated with the relevant government departments to adjust
                Check.Suspension of contract with production waste manufacturerMaster Kang confirmed in his statement to this newspaper:,After investigation,The manufacturer that signed the production waste disposal contract with Master Kang is:
                Ji'an Sanshi feed company,And obtained the contract contracting qualification through the public bidding procedure,And signed a commitment to dispose of Master Kang's production waste through legal channels.
                Yesterday, the media reported that Ji'an Sanshi feed company may violate the contract signed with Master Kang,From now on,Master Kang has suspended the contract with Ji'an Sanshi Feed Co., Ltd
                Relevant contracts of,And wait for the investigation results of the relevant units to strictly investigate the responsibility.yesterday,Master Kang, a person familiar with the situation, told the reporter,This business started from March of last year
                After the quarter, he began to treat production waste for Master Kang,The relevant contract will expire at the end of this year,The contact pointed out that Master Kang and the firm had no equity participation or any capital relationship.""",
                 """Rendezvous and docking, in four simple words, includes two stages. It is the key technology for building China's space station and one of the most complex technologies for spacecraft in orbit operation.
                 For example, before the launch, we can imagine that the earth rotates with the long march rocket. From the moment they took off, they were not following the earth's movement, leaving the surface and directly using space.
                 Therefore, the accuracy of the take-off time determines whether the rocket is taken by the earth and deviates from the expected initial conditions. Here, it is also a term. The zero window launch time error can only be within one second of the government
                 Any delay and change are allowed. After that, how can the two spacecraft find each other? The docking is often known as "ten thousand li fax". In fact, the distance is not proportional to the difficulty. The tracking is established, and even
                 And may not necessarily consume more fuel. The key is to accurately control the altitude difference in the flight process and the period of the spacecraft itself. The faster the new life will run until the end of the month. For example, just keep
                 The orbit of the problem, the English space station, naturally flies to the space station at a faster speed. During the tracking process, the problem gradually rises, and the relative speed of the orbital space station gradually decreases. Ask the sky and space
                 The orbit height of the station is similar, and the relative speed of the two is zero, so the docking is expected to be realized. Shanggong room. After the spacecraft reaches the first speed, it establishes an orbital flight. After that, the speed is only higher than the orbit
                 Degree. To put it simply, the flight speed of rapid rendezvous and docking has become faster, but the process of orbit control has become shorter, so the other key word of the rapid rendezvous and docking mode besides speed is autonomy.
                 Autonomy is to track the aircraft, not to wait for the command from the ground, but to use the data they have to determine their own forward direction. Here, our big brother Beidou is making great efforts in Beidou global navigation
                 The application of enables accurate and real-time track side.""",
                 "8 On June 28, the Internet broke that the user data of Huazhu group's chain hotels was suspected to have been leaked. From the content released by the seller, the data includes the guest information of more than 10 brand hotels under Huazhu, including Hanting, Xiyue, orange and Ibis. The leaked information includes the registration information on the official website of Huazhu, the identity information of hotel check-in and hotel room opening records, the name, mobile phone number, email address, ID number, login account password, etc. The seller packages and sells about 500 million pieces of data. The third-party security platform threatened the hunter to verify the 30000 pieces of data provided by the information seller, believing that the data was very authentic. In the afternoon of the same day, Huazhu group claimed that it had carried out internal verification rapidly and called the police at the first time. That night, the Shanghai police reported that they had received a report from Huazhu group and the police had intervened in the investigation."]
                for text in test_data:
                    print(self.bert_model.generate(text, beam_size=3))
                print("loss is " + str(report_loss))
                report_loss = 0
                # self.eval(epoch)
                self.bert_model.train()
            if step % 8000 == 0:
                self.save(model_save_path)

            # Because the target tag is passed in, loss will be calculated and returned
            predictions, loss = self.bert_model(token_ids,
                                                token_type_ids,
                                                labels=target_ids,
                                               
                                                )
            report_loss += loss.item()
            # Back propagation
            if train:
                # Gradient before emptying
                self.optimizer.zero_grad()
                # Back propagation to obtain new gradients
                loss.backward()
                # Update the model parameters with the obtained gradients
                self.optimizer.step()

            # To calculate the average loss of the current epoch
            total_loss += loss.item()

        end_time = time.time()
        spend_time = end_time - start_time
        # Print training information
        print("epoch is " + str(epoch)+". loss is " + str(total_loss) + ". spend time is "+ str(spend_time))
        # Save model
        self.save(model_save_path)

    def testtry(self,s):
        #self.bert_model.train()
        self.bert_model.eval()
        with torch.no_grad():
            test_data = s
            for text in test_data:
                print(self.bert_model.generate(text, beam_size=1))

if __name__ == '__main__':

    trainer = Trainer()
    train_epoches = 1

    for epoch in range(train_epoches):
        # Train an epoch
        trainer.train(epoch)

3, Summary

Among the above four abstract extraction methods, the reader has tested that the third Abstract extraction method, that is, clustering method, is the best one. It is the best choice in terms of both the time required to extract the abstract and the quality of the extracted abstract-

Tags: Database Java Python NLP Front-end

Posted by zwxetlp on Mon, 12 Sep 2022 22:17:31 +0530