Author | Daulet Nurmanbetov Compile|VK Source | Towards Data Science
Have you ever needed to summarize a lengthy document into a summary? Or provide a summary for a document? As you know, this process is tedious and slow for us humans - we need to read the entire document, then focus on the important sentences, and finally, rewrite the sentences into a coherent summary.
This is where automatic summaries can help us. Machine learning has made great strides in summarizing, but there is still a lot of room for improvement. Generally, machine summaries are divided into two types
Abstract extraction: If an important sentence appears in the original document, extract it.
Summarize the Abstract: Summarize the important ideas or facts contained in the document without repeating the words in the article. This is what we usually think of when asked to summarize a document.
I would like to show you some recent results, summarizing the abstract with BERT_Sum_Abs, the work of Yang Liu and Mirella Lapata Text Summarization with Pretrained Encoders: https://arxiv.org/pdf/1908.08345.pdf
Performance of BERT Summary Summary
Abstracts are designed to compress a document into a shorter version while retaining most of its meaning. Summarizing the summarization task requires language generation capabilities to create summaries that contain new words and phrases that are not present in the source documents. Abstract extraction is usually defined as a binary classification task whose labels indicate whether a text range (usually a sentence) should be included in the abstract.
Here's how BERT_Sum_Abs handles standard summary datasets: CNN and Daily Mail, which are commonly used for benchmarking. The evaluation metric is called ROGUE F1 score
The results show that the BERT_Sum_Abs model outperforms most non-Transformer based models. Even better, the code behind the model is open source and the implementation is available on Github ( https://github.com/huggingface/transformers/tree/master/examples/summarization/bertabs).
Demonstration and code
Let's wrap up an article with an example. We'll choose the following articles to wrap up the abstract, Fed officials say, central bankers are unanimous in their response to the coronavirus. This is the full text
The Federal Reserve Bank of New York president, John C. Williams, made clear on Thursday evening that officials viewed the emergency rate cut they approved earlier this week as part of an international push to cushion the economy as the coronavirus threatens global growth. Mr. Williams, one of the Fed's three key leaders, spoke in New York two days after the Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. The move came shortly after a call between finance ministers and central bankers from the Group of 7, which also includes Britain, Canada, France, Germany, Italy and Japan. "Tuesday's phone call between G7 finance ministers and central bank governors, the subsequent statement, and policy actions by central banks are clear indications of the close alignment at the international level," Mr. Williams said in a speech to the Foreign Policy Association. Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank — which already have interest rates set below zero — have yet to further cut borrowing costs, but they have pledged to support their economies. Mr. Williams's statement is significant, in part because global policymakers were criticized for failing to satisfy market expectations for a coordinated rate cut among major economies. Stock prices temporarily rallied after the Fed's announcement, but quickly sank again. Central banks face challenges in offsetting the economic shock of the coronavirus. Many were already working hard to stoke stronger economic growth, so they have limited room for further action. That makes the kind of carefully orchestrated, lock step rate cut central banks undertook in October 2008 all but impossible. Interest rate cuts can also do little to soften the near-term hit from the virus, which is forcing the closure of offices and worker quarantines and delaying shipments of goods as infections spread across the globe. "It's up to individual countries, individual fiscal policies and individual central banks to do what they were going to do," Fed Chair Jerome H. Powell said after the cut, noting that different nations had "different situations." Mr. Williams reiterated Mr. Powell's pledge that the Fed would continue monitoring risks in the "weeks and months" ahead. Economists widely expect another quarter-point rate cut at the Fed's March 18 meeting. The New York Fed president, whose reserve bank is partly responsible for ensuring financial markets are functioning properly, also promised that the Fed stood ready to act as needed to make sure that everything is working smoothly. Since September, when an obscure but crucial corner of money markets experienced unusual volatility, the Fed has been temporarily intervening in the market to keep it calm. The goal is to keep cash flowing in the market for overnight and short-term loans between banks and other financial institutions. The central bank has also been buying short-term government debt. "We remain flexible and ready to make adjustments to our operations as needed to ensure that monetary policy is effectively implemented and transmitted to financial markets and the broader economy," Mr. Williams said Thursday.
First, we need to get the model code, install the dependencies, and download the dataset, which you can easily do on your own Linux computer as follows:
# Install Transformers for Huggingface git clone https://github.com/huggingface/transformers && cd transformers pip install . pip install nltk py-rouge cd examples/summarization #------------------------------ # Download the original summary dataset. Code download from google drive on Linux wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p' wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ' -O cnn_stories.tgz wget --save-cookies cookies.txt --keep-session-cookies --no-check-certificate 'https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/Code: \1\n/p' wget --load-cookies cookies.txt --no-check-certificate 'https://drive.google.com/uc?export=download&confirm=<CONFIRMATION CODE HERE>&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs' -O dailymail_stories.tgz # unzip files tar -xvf cnn_stories.tgz && tar -xvf dailymail_stories.tgz rm cnn_stories.tgz dailymail_stories.tgz #Move an article to a location mkdir bertabs/dataset mkdir bertabs/summaries_out cp -r bertabs/cnn/stories dataset cp -r bertabs/dailymail/stories dataset # Select a subset of articles to summarize the abstract mkdir bertabs/dataset2 cd bertabs/dataset && find . -maxdepth 1 -type f | head -1000 | xargs cp -t ../dataset2/
After executing the above code, we now execute the python command shown below to summarize the document summary in the /dataset2 directory:
python run_summarization.py \ --documents_dir bertabs/dataset2 \ --summaries_output_dir bertabs/summaries_out \ --batch_size 64 \ --min_length 50 \ --max_length 200 \ --beam_size 5 \ --alpha 0.95 \ --block_trigram true \ --compute_rouge true
The parameters here are as follows
documents_dir, the folder where the documents are located
summaries_output_dir, the folder where summaries are written. Defaults to the folder where the document is located
batch_size, batch size per GPU/CPU used for training
beam_size, the number of beams to start each example
block_trigram, whether to block repeated trigrams in text generated by beam search
compute_rouge, computes the ROUGE metric during evaluation. Only available for CNN/DailyMail dataset
alpha, the alpha value of the length penalty in beam search (the larger the value, the larger the penalty)
min_length, the minimum number of tokens for digests
max_length, the maximum number of tokens for digests
After BERT_Sum_Abs completes, we get the following summary:
The Fed slashed borrowing costs by half a point in its first emergency move since the depths of the 2008 financial crisis. Rate cuts followed in Canada, Asia and the Middle East on Wednesday. The Bank of Japan and European Central Bank have yet to further cut borrowing costs, but they have pledged to support their economies.
Here is another English article: https://news.stonybrook.edu/newsroom/study-shows-low-carb-diet-may-prevent-reverse-age-related-effects-within-the-brain/
The resulting summary is as follows
The research team focused on the Presymptomatic period during which prevention may be most effective. They showed that communication between brain regions destabilizes with age, typically in the late 40's, and that destabilization associated with poorer cognition. The good news is that we may be able to prevent or reverse these effects with diet, mitigating the impact of encroaching Hypometabolism by exchanging glucose for ketones as fuel for neurons.
As you can see, BERT is improving every aspect of NLP. This means that, while open source, we see NLP performance approaching human level every day.
NLP commercialization products are approaching, and each new NLP model not only sets new records on benchmarks, but can be used by anyone. Just as OCR technology was commoditized 10 years ago, so will NLP in the next few years.
Welcome to the Panchuang AI blog site: http://panchuang.net/
sklearn machine learning Chinese official documents: http://sklearn123.com/
Welcome to the Panchuang blog resource summary station: http://docs.panchuang.net/