Gods are silent - personal CSDN blog directory
This article will introduce some simple keyword extraction algorithms using Python3. At present, we have only sorted out some relatively simple methods. For example, we will learn more and more cutting-edge algorithms in the later stage, and we will continue to update this article.
1. Chinese keyword extraction based on TF-IDF algorithm: implemented with jieba package
extracted_sentences="With the continuous production of commodity sales, its data is of great significance for its own marketing planning, market analysis and logistics planning. However, there are many factors that affect the sales volume forecast. The traditional statistical based measurement models, such as time series models, have too many assumptions about the reality, resulting in poor forecast results. Therefore, better intelligence is required AI Algorithm to improve the accuracy of prediction, so as to help enterprises reduce inventory costs, shorten the delivery cycle, and improve the anti risk ability of enterprises." import jieba.analyse print(jieba.analyse.extract_tags(extracted_sentences, topK=20, withWeight=False, allowPOS=()))
Output:
Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.457 seconds. Prefix dict has been built successfully. ['forecast', 'Model', 'sales volume', 'Reduce inventory', 'enterprise', 'AI', 'plan', 'increase', 'accuracy', 'Assist', 'delivery', 'algorithm', 'metering', 'sequence', 'Poor', 'various', 'too much', 'hypothesis', 'shorten', 'Marketing']
Function input parameters:
- topK: returns the number of keywords with the largest TF-IDF weight (the default value is 20)
- Whether the withWeight returns the keyword weight value together. The default value is False
- allowPOS only includes words of the specified part of speech. The default value is blank, that is, it is not filtered
The inverse file frequency (IDF) text corpus used for keyword extraction can be switched to the path of the custom corpus:
Usage: jieba analyse. set_ idf_ path(file_name) # file_ Name is the path of the custom corpus
Custom corpus example: https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
Usage example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
The Stop Words text corpus used for keyword extraction can be switched to the path of the custom corpus:
Usage: jieba analyse. set_ stop_ words(file_name) # file_ Name is the path of the custom corpus
Custom corpus example: https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
Usage example: https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
2. Chinese keyword extraction based on TextRank algorithm: implemented with jieba package
extracted_sentences="With the continuous production of commodity sales, its data is of great significance for its own marketing planning, market analysis and logistics planning. However, there are many factors that affect the sales volume forecast. The traditional statistical based measurement models, such as time series models, have too many assumptions about the reality, resulting in poor forecast results. Therefore, better intelligence is required AI Algorithm to improve the accuracy of prediction, so as to help enterprises reduce inventory costs, shorten the delivery cycle, and improve the anti risk ability of enterprises." import jieba.analyse print(jieba.analyse.textrank(extracted_sentences, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
Output:
Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.451 seconds. Prefix dict has been built successfully. ['enterprise', 'forecast', 'Model', 'plan', 'increase', 'sales volume', 'such as', 'time', 'market', 'analysis', 'Reduce inventory', 'cost', 'shorten', 'delivery', 'influence', 'factor', 'situation', 'metering', 'reality', 'data']
The input parameters are the same as those in the first section, but the default value of allowPOS is different.
TextRank uses a fixed window size (5 by default, adjusted by span attribute), takes words as nodes, and the co-occurrence relationship between words as edges to build an undirected weighted graph.
Then calculate the scores of nodes in the graph in a way similar to PageRank.
For more in-depth understanding of the calculation method and principle of PageRank, please refer to my previous blog: cs224w (Figure machine learning) 2021 winter course learning notes 4 Link Analysis: PageRank (Graph as Matrix)_ The silent blog of gods -CSDN blog
3. didn't say the importance of Chinese words based on any algorithm: LAC implementation
The final output value is the importance score of the corresponding words.
extracted_sentences="With the continuous production of commodity sales, its data is of great significance for its own marketing planning, market analysis and logistics planning. However, there are many factors that affect the sales volume forecast. The traditional statistical based measurement models, such as time series models, have too many assumptions about the reality, resulting in poor forecast results. Therefore, better intelligence is required AI Algorithm to improve the accuracy of prediction, so as to help enterprises reduce inventory costs, shorten the delivery cycle, and improve the anti risk ability of enterprises." from LAC import LAC lac=LAC(mode='rank') seg_result=lac.run(extracted_sentences) #Take Unicode string as input parameter print(seg_result)
Output:
W0625 20:13:22.369424 33363 init.cc:157] AVX is available, Please re-compile on local machine W0625 20:13:22.455566 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled. W0625 20:13:22.455617 33363 init.cc:157] AVX is available, Please re-compile on local machine --- Running analysis [ir_graph_build_pass] --- Running analysis [ir_graph_clean_pass] --- Running analysis [ir_analysis_pass] --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [attention_lstm_fuse_pass] --- Running IR pass [seqconv_eltadd_relu_fuse_pass] --- Running IR pass [seqpool_cvm_concat_fuse_pass] --- Running IR pass [fc_lstm_fuse_pass] --- Running IR pass [mul_lstm_fuse_pass] --- Running IR pass [fc_gru_fuse_pass] --- Running IR pass [mul_gru_fuse_pass] --- Running IR pass [seq_concat_fc_fuse_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [repeated_fc_relu_fuse_pass] --- Running IR pass [squared_mat_sub_fuse_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [is_test_pass] --- Running IR pass [runtime_context_cache_pass] --- Running analysis [ir_params_sync_among_devices_pass] --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] W0625 20:13:22.561131 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled. W0625 20:13:22.561169 33363 init.cc:157] AVX is available, Please re-compile on local machine --- Running analysis [ir_graph_build_pass] --- Running analysis [ir_graph_clean_pass] --- Running analysis [ir_analysis_pass] --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [attention_lstm_fuse_pass] --- Running IR pass [seqconv_eltadd_relu_fuse_pass] --- Running IR pass [seqpool_cvm_concat_fuse_pass] --- Running IR pass [fc_lstm_fuse_pass] --- Running IR pass [mul_lstm_fuse_pass] --- Running IR pass [fc_gru_fuse_pass] --- Running IR pass [mul_gru_fuse_pass] --- Running IR pass [seq_concat_fc_fuse_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [repeated_fc_relu_fuse_pass] --- Running IR pass [squared_mat_sub_fuse_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [is_test_pass] --- Running IR pass [runtime_context_cache_pass] --- Running analysis [ir_params_sync_among_devices_pass] --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] [['along with', 'enterprise', 'continued', 'produce', 'of', 'commodity', 'sales volume', ',', 'his', 'data', 'about', 'oneself', 'Marketing', 'plan', ',', 'market analysis', ',', 'logistics', 'plan', 'all', 'have', 'important', 'significance', '. ', 'however', 'sales volume', 'forecast', 'of', 'influence', 'factor', 'various', ',', 'tradition', 'of', 'be based on', 'Statistics', 'of', 'metering', 'Model', ',', 'such as', 'time', 'sequence', 'Model', 'etc.', 'because', 'yes', 'reality', 'of', 'hypothesis', 'situation', 'too much', ',', 'cause', 'forecast', 'result', 'Poor', '. ', 'therefore', 'need', 'More', 'excellent', 'of', 'intelligence', 'AI algorithm', ',', 'with', 'increase', 'forecast', 'of', 'accuracy', ',', 'thus', 'Assist', 'enterprise', 'reduce', 'stock', 'cost', ',', 'shorten', 'delivery', 'cycle', ',', 'increase', 'enterprise', 'resist', 'risk', 'ability', '. '], ['p', 'n', 'vd', 'v', 'u', 'n', 'n', 'w', 'r', 'n', 'p', 'r', 'vn', 'n', 'w', 'n', 'w', 'n', 'n', 'd', 'v', 'a', 'n', 'w', 'c', 'n', 'vn', 'u', 'vn', 'n', 'a', 'w', 'a', 'u', 'p', 'v', 'u', 'vn', 'n', 'w', 'v', 'n', 'n', 'n', 'u', 'p', 'p', 'n', 'u', 'vn', 'n', 'a', 'w', 'v', 'vn', 'n', 'a', 'w', 'c', 'v', 'd', 'a', 'u', 'n', 'nz', 'w', 'p', 'v', 'vn', 'u', 'n', 'w', 'c', 'v', 'n', 'v', 'n', 'n', 'w', 'v', 'vn', 'n', 'w', 'v', 'n', 'v', 'n', 'n', 'w'], [0, 1, 1, 1, 0, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 2, 2, 1, 0, 0, 0, 2, 0, 2, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0]]