@[TOC] (in-depth Bert practice (pytoch) ---wordpiece embeddings)
https://www.bilibili.com/video/BV1K5411t7MD?p=5
https://www.youtube.com/channel/UCoRX98PLOsaN8PtekB9kWrw/videos
In depth BERT practice (PyTorch) by ChrisMcCormickAI
This is the code explained by ChrisMcCormickAI in the pytorch of WordPiece Embeddings, the second in the 8 episode series of tubing bert. There is a download address under the tubing video. If you can't climb the wall, you can leave an email. I'll send it to you after finishing reading it all.
Load model
Install huggingface implementation
!pip install pytorch-pretrained-bert import torch from pytorch_pretrained_bert import BertTokenizer # Load pre training model tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
View vocabulary in Bert
Retrieve the entire "tokens" list and write them to a text file so you can read them carefully.
with open("vocabulary.txt", 'w', encoding='utf-8') as f: # For each token... Get each word for token in tokenizer.vocab.keys(): # Write it out and escape any unicode characters. # Write and escape to Unicode characters f.write(token + '\n')
-
Print it out to get
-
The first 999 word orders are reserved positions, and their forms are similar to [unused957]
-
1 - [PAD] truncation
101 - [UNK] unknown character
102 - [CLS] beginning of sentence, indicating classification task
103 - [sep] sentences separating two inputs in Bert
104 - [MASK] MASK mechanism -
Lines 1000-1996 appear to be a dump of a single character.
-
They do not appear to be sorted by frequency (for example, the letters of the alphabet are sorted in order).
-
The first word is the 1997 position
-
From here on, these words seem to be sorted by frequency.
The first 18 words are complete words, and the 2016 word is \s, which is probably the most common sub word.
The last complete word is "necessary" in 29612 bits
Single character
The following code prints all single character tokens in the vocabulary and all single character tokens preceded by '\.
It turns out that these are all matching sets -- each individual character has a "\\. There are 997 single character marks.
The following cells iterate through the vocabulary, taking out all single character tags.
one_chars = [] one_chars_hashes = [] # For each token in the vocabulary... Traverse all single characters for token in tokenizer.vocab.keys(): # Record any single-character tokens. Write it down if len(token) == 1: one_chars.append(token) # Record single-character tokens preceded by the two hashes. # record##Single character elif len(token) == 3 and token[0:2] == '##': one_chars_hashes.append(token)
Print single character
print('Number of single character tokens:', len(one_chars), '\n') # Print all of the single characters, 40 per row. # For every batch of 40 tokens... for i in range(0, len(one_chars), 40): # Limit the end index so we don't go past the end of the list. end = min(i + 40, len(one_chars) + 1) # Print out the tokens, separated by a space. print(' '.join(one_chars[i:end]))
Print \\
print('Number of single character tokens with hashes:', len(one_chars_hashes), '\n') # Print all of the single characters, 40 per row. Print at 40 per line # Strip the hash marks, since they just clutter the display.remove## tokens = [token.replace('##', '') for token in one_chars_hashes] # For every batch of 40 tokens... 40 per batch for i in range(0, len(tokens), 40): # Limit the end index so we don't go past the end of the list. Limit end position end = min(i + 40, len(tokens) + 1) # Print out the tokens, separated by a space. print(' '.join(tokens[i:end]))
Number of single character tokens: 997 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ° ± ² ³ ´ µ ¶ · ¹ º » ¼ ½ ¾ ¿ × ß æ ð ÷ ø þ đ ħ ı ł ŋ œ ƒ ɐ ɑ ɒ ɔ ɕ ə ɛ ɡ ɣ ɨ ɪ ɫ ɬ ɯ ɲ ɴ ɹ ɾ ʀ ʁ ʂ ʃ ʉ ʊ ʋ ʌ ʎ ʐ ʑ ʒ ʔ ʰ ʲ ʳ ʷ ʸ ʻ ʼ ʾ ʿ ˈ ː ˡ ˢ ˣ ˤ α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω а б в г д е ж з и к л м н о п р с т у ф х ц ч ш щ ъ ы ь э ю я ђ є і ј љ њ ћ ӏ ա բ գ դ ե թ ի լ կ հ մ յ ն ո պ ս վ տ ր ւ ք ־ א ב ג ד ה ו ז ח ט י ך כ ל ם מ ן נ ס ע ף פ ץ צ ק ר ש ת ، ء ا ب ة ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي ٹ پ چ ک گ ں ھ ہ ی ے अ आ उ ए क ख ग च ज ट ड ण त थ द ध न प ब भ म य र ल व श ष स ह ा ि ी ो । ॥ ং অ আ ই উ এ ও ক খ গ চ ছ জ ট ড ণ ত থ দ ধ ন প ব ভ ম য র ল শ ষ স হ া ি ী ে க ச ட த ந ன ப ம ய ர ல ள வ ா ி ு ே ை ನ ರ ಾ ක ය ර ල ව ා ก ง ต ท น พ ม ย ร ล ว ส อ า เ ་ ། ག ང ད ན པ བ མ འ ར ལ ས မ ა ბ გ დ ე ვ თ ი კ ლ მ ნ ო რ ს ტ უ ᄀ ᄂ ᄃ ᄅ ᄆ ᄇ ᄉ ᄊ ᄋ ᄌ ᄎ ᄏ ᄐ ᄑ ᄒ ᅡ ᅢ ᅥ ᅦ ᅧ ᅩ ᅪ ᅭ ᅮ ᅯ ᅲ ᅳ ᅴ ᅵ ᆨ ᆫ ᆯ ᆷ ᆸ ᆼ ᴬ ᴮ ᴰ ᴵ ᴺ ᵀ ᵃ ᵇ ᵈ ᵉ ᵍ ᵏ ᵐ ᵒ ᵖ ᵗ ᵘ ᵢ ᵣ ᵤ ᵥ ᶜ ᶠ ‐ ‑ ‒ – — ― ‖ ' ' ‚ " " „ † ‡ • ... ‰ ′ ″ › ‿ ⁄ ⁰ ⁱ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁿ ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₍ ₎ ₐ ₑ ₒ ₓ ₕ ₖ ₗ ₘ ₙ ₚ ₛ ₜ ₤ ₩ € ₱ ₹ ℓ № ℝ ™ ⅓ ⅔ ← ↑ → ↓ ↔ ↦ ⇄ ⇌ ⇒ ∂ ∅ ∆ ∇ ∈ − ∗ ∘ √ ∞ ∧ ∨ ∩ ∪ ≈ ≡ ≤ ≥ ⊂ ⊆ ⊕ ⊗ ⋅ ─ │ ■ ▪ ● ★ ☆ ☉ ♠ ♣ ♥ ♦ ♭ ♯ ⟨ ⟩ ⱼ Hou Hou , . 〈 〉 < > 「 」 『 』 〜 あ い う え お か き く け こ さ し す せ そ た ち っ つ て と な に ぬ ね の は ひ ふ へ ほ ま み む め も や ゆ よ ら り る れ ろ を ん ァ ア ィ イ ウ ェ エ オ カ キ ク ケ コ サ シ ス セ タ チ ッ ツ テ ト ナ ニ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ャ ュ ョ ラ リ ル レ ロ ワ ン ・ ー 13. The Lord of the world has not been around for a long time. 2. The people of Wujing are benevolent On behalf of Sashi Baoxin, jianyuanguang, bagongnei of the Yi Association, Liu Li, Jiasheng, Shiqian, Nanbo Yuankou ancient history, North District, siheji, the same name and the fourth mouth The master of Bancheng hall, national territory, xiawai Datian taifunai women's school, Yuanzong Dingxuan Palace Jiasu temple, and the emperor of sakagawa Prefecture, Okayama Island In ordinary years, I'm lucky to have Guanghong, Zhang Kuai, the empress of Yu De, the heart of Yu De, the loyalty and love of Zhang Kuai, the new side of politics and literature, the star of the new day, Chun Zhao, Zhi Qu, Shu Yue, and the woody Li Village Dongsong Linsen, yangshuqiao, Gezhi, Zhengwu, Bi's people, water, Yongjiang, river, governance, Haiqing, Han Dynasty, fire version of the dog king, Shengtian, male, white Emperor's eyes The real stone in the province shows the God of the society, the blessing, the show, the autumn sky, the seal, the bamboo, the beauty, the righteousness, the ear, the sound, the flower, the Yinghua, the yetengxing street, the west, the language Valley, the beigui, the Che Jun, the Dao Lang Chenyangxiong, head of Jinling Town, duliye, County Department fi fl ! ( ) , - . / : ? ~
The above two codes and \\+ single character and single character results are the same
# return True print('Are the two sets identical?', set(one_chars) == set(tokens))
Subwords vs. Whole-words
Print some vocabulary statistics.
import matplotlib.pyplot as plt import seaborn as sns import numpy as np sns.set(style='darkgrid') # Increase the plot size and font size. sns.set(font_scale=1.5) plt.rcParams["figure.figsize"] = (10,5) # Measure the length of every token in the vocab. Load each word token_lengths = [len(token) for token in tokenizer.vocab.keys()] # Plot the number of tokens of each length. sns.countplot(token_lengths) plt.title('Vocab Token Lengths') plt.xlabel('Token Length') plt.ylabel('# of Tokens') print('Maximum token length:', max(token_lengths))
Count the tokens beginning with '\\.
num_subwords = 0 subword_lengths = [] # For each token in the vocabulary... for token in tokenizer.vocab.keys(): # If it's a subword... if len(token) >= 2 and token[0:2] == '##': # Tally all subwords num_subwords += 1 # Measure the sub word length (without the hashes) length = len(token) - 2 # Record the lengths. subword_lengths.append(length)
Relative to the number occupied by the complete vocabulary
vocab_size = len(tokenizer.vocab.keys()) print('Number of subwords: {:,} of {:,}'.format(num_subwords, vocab_size)) # Calculate the percentage of words that are '##' subwords. prcnt = float(num_subwords) / vocab_size * 100.0 print('%.1f%%' % prcnt)
Number of subwords: 5,828 of 30,522 19.1%
Results of mapping statistics
sns.countplot(subword_lengths) plt.title('Subword Token Lengths (w/o "##")') plt.xlabel('Subword Length') plt.ylabel('# of ## Subwords')
You can check out examples of misspellings by yourself
'misspelled' in tokenizer.vocab # Right 'mispelled' in tokenizer.vocab # Wrong 'government' in tokenizer.vocab # Right 'goverment' in tokenizer.vocab # Wrong 'beginning' in tokenizer.vocab # Right 'begining' in tokenizer.vocab # Wrong 'separate' in tokenizer.vocab # Right 'seperate' in tokenizer.vocab # Wrong
For abbreviations
"can't" in tokenizer.vocab # False "cant" in tokenizer.vocab # False
Opening and intermediate subwords
For a single character, there is both a single character and a "\\" version corresponding to each character. Is it the same with subwords?
# For each token in the vocabulary... for token in tokenizer.vocab.keys(): # If it's a subword... if len(token) >= 2 and token[0:2] == '##': if not token[2:] in tokenizer.vocab: print('Did not find a token for', token[2:]) break
You can see that the first returned \ly is in the thesaurus, but ly is not in the thesaurus
Did not find a token for ly '##ly' in tokenizer.vocab # True 'ly' in tokenizer.vocab # False
For names
Download data
!pip install wget import wget import random print('Beginning file download with wget module') url = 'http://www.gutenberg.org/files/3201/files/NAMES.TXT' wget.download(url, 'first-names.txt')
Encoding, lowercase, output length
# Read them in. with open('first-names.txt', 'rb') as f: names_encoded = f.readlines() names = [] # Decode the names, convert to lowercase, and strip newlines. for name in names_encoded: try: names.append(name.rstrip().lower().decode('utf-8')) except: continue print('Number of names: {:,}'.format(len(names))) print('Example:', random.choice(names))
See how many names are in BERT's Thesaurus
num_names = 0 # For each name in our list... for name in names: # If it's in the vocab... if name in tokenizer.vocab: # Tally it. num_names += 1 print('{:,} names in the vocabulary'.format(num_names))
For numbers
# Count how many numbers are in the vocabulary. Count how many numbers are in the glossary count = 0 # For each token in the vocabulary... for token in tokenizer.vocab.keys(): # Tally if it's a number. if token.isdigit(): count += 1 # Any numbers >= 10,000? if len(token) > 4: print(token) print('Vocab includes {:,} numbers.'.format(count))
Calculate how many numbers in 1600-2021
# Count how many dates between 1600 and 2021 are included. count = 0 for i in range(1600, 2021): if str(i) in tokenizer.vocab: count += 1 print('Vocab includes {:,} of 421 dates from 1600 - 2021'.format(count))