In depth Bert practice (pytoch)

@[TOC] (in-depth Bert practice (pytoch) ---wordpiece embeddings)
In depth BERT practice (PyTorch) by ChrisMcCormickAI
This is the code explained by ChrisMcCormickAI in the pytorch of WordPiece Embeddings, the second in the 8 episode series of tubing bert. There is a download address under the tubing video. If you can't climb the wall, you can leave an email. I'll send it to you after finishing reading it all.

Load model

Install huggingface implementation

!pip install pytorch-pretrained-bert

import torch
from pytorch_pretrained_bert import BertTokenizer

# Load pre training model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

View vocabulary in Bert

Retrieve the entire "tokens" list and write them to a text file so you can read them carefully.

with open("vocabulary.txt", 'w', encoding='utf-8') as f:
    # For each token...  Get each word
    for token in tokenizer.vocab.keys():
        # Write it out and escape any unicode characters.   
        # Write and escape to Unicode characters         
        f.write(token + '\n')
  • Print it out to get

  • The first 999 word orders are reserved positions, and their forms are similar to [unused957]

  • 1 - [PAD] truncation
    101 - [UNK] unknown character
    102 - [CLS] beginning of sentence, indicating classification task
    103 - [sep] sentences separating two inputs in Bert
    104 - [MASK] MASK mechanism

  • Lines 1000-1996 appear to be a dump of a single character.

  • They do not appear to be sorted by frequency (for example, the letters of the alphabet are sorted in order).

  • The first word is the 1997 position

  • From here on, these words seem to be sorted by frequency.
    The first 18 words are complete words, and the 2016 word is \s, which is probably the most common sub word.
    The last complete word is "necessary" in 29612 bits

Single character

The following code prints all single character tokens in the vocabulary and all single character tokens preceded by '\.

It turns out that these are all matching sets -- each individual character has a "\\. There are 997 single character marks.

The following cells iterate through the vocabulary, taking out all single character tags.

one_chars = []
one_chars_hashes = []

# For each token in the vocabulary...  Traverse all single characters
for token in tokenizer.vocab.keys():
    # Record any single-character tokens. Write it down
    if len(token) == 1:
    # Record single-character tokens preceded by the two hashes. 
    # record##Single character   
    elif len(token) == 3 and token[0:2] == '##':

Print single character

print('Number of single character tokens:', len(one_chars), '\n')

# Print all of the single characters, 40 per row.

# For every batch of 40 tokens...
for i in range(0, len(one_chars), 40):
    # Limit the end index so we don't go past the end of the list.
    end = min(i + 40, len(one_chars) + 1)
    # Print out the tokens, separated by a space.
    print(' '.join(one_chars[i:end]))

Print \\

print('Number of single character tokens with hashes:', len(one_chars_hashes), '\n')

# Print all of the single characters, 40 per row. Print at 40 per line

# Strip the hash marks, since they just clutter the display.remove##
tokens = [token.replace('##', '') for token in one_chars_hashes]

# For every batch of 40 tokens... 40 per batch
for i in range(0, len(tokens), 40):
    # Limit the end index so we don't go past the end of the list. Limit end position
    end = min(i + 40, len(tokens) + 1)
    # Print out the tokens, separated by a space.
    print(' '.join(tokens[i:end]))
Number of single character tokens: 997 

! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ [ \ ] ^ _ ` a b
c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬
® ° ± ² ³ ´ µ ¶ · ¹ º » ¼ ½ ¾ ¿ × ß æ ð ÷ ø þ đ ħ ı ł ŋ œ ƒ ɐ ɑ ɒ ɔ ɕ ə ɛ ɡ ɣ ɨ
ɪ ɫ ɬ ɯ ɲ ɴ ɹ ɾ ʀ ʁ ʂ ʃ ʉ ʊ ʋ ʌ ʎ ʐ ʑ ʒ ʔ ʰ ʲ ʳ ʷ ʸ ʻ ʼ ʾ ʿ ˈ ː ˡ ˢ ˣ ˤ α β γ δ
ε ζ η θ ι κ λ μ ν ξ ο π ρ ς σ τ υ φ χ ψ ω а б в г д е ж з и к л м н о п р с т у
ф х ц ч ш щ ъ ы ь э ю я ђ є і ј љ њ ћ ӏ ա բ գ դ ե թ ի լ կ հ մ յ ն ո պ ս վ տ ր ւ
ք ־ א ב ג ד ה ו ז ח ט י ך כ ל ם מ ן נ ס ע ף פ ץ צ ק ר ש ת ، ء ا ب ة ت ث ج ح خ د
ذ ر ز س ش ص ض ط ظ ع غ ـ ف ق ك ل م ن ه و ى ي ٹ پ چ ک گ ں ھ ہ ی ے अ आ उ ए क ख ग च
ज ट ड ण त थ द ध न प ब भ म य र ल व श ष स ह ा ि ी ो । ॥ ং অ আ ই উ এ ও ক খ গ চ ছ জ
ট ড ণ ত থ দ ধ ন প ব ভ ম য র ল শ ষ স হ া ি ী ে க ச ட த ந ன ப ம ய ர ல ள வ ா ி ு ே
ை ನ ರ ಾ ක ය ර ල ව ා ก ง ต ท น พ ม ย ร ล ว ส อ า เ ་ ། ག ང ད ན པ བ མ འ ར ལ ས မ ა
ბ გ დ ე ვ თ ი კ ლ მ ნ ო რ ს ტ უ ᄀ ᄂ ᄃ ᄅ ᄆ ᄇ ᄉ ᄊ ᄋ ᄌ ᄎ ᄏ ᄐ ᄑ ᄒ ᅡ ᅢ ᅥ ᅦ ᅧ ᅩ ᅪ ᅭ ᅮ
ᅯ ᅲ ᅳ ᅴ ᅵ ᆨ ᆫ ᆯ ᆷ ᆸ ᆼ ᴬ ᴮ ᴰ ᴵ ᴺ ᵀ ᵃ ᵇ ᵈ ᵉ ᵍ ᵏ ᵐ ᵒ ᵖ ᵗ ᵘ ᵢ ᵣ ᵤ ᵥ ᶜ ᶠ ‐ ‑ ‒ – — ―
‖ ' ' ‚ " " „ † ‡ • ... ‰ ′ ″ › ‿ ⁄ ⁰ ⁱ ⁴ ⁵ ⁶ ⁷ ⁸ ⁹ ⁺ ⁻ ⁿ ₀ ₁ ₂ ₃ ₄ ₅ ₆ ₇ ₈ ₉ ₊ ₍
₎ ₐ ₑ ₒ ₓ ₕ ₖ ₗ ₘ ₙ ₚ ₛ ₜ ₤ ₩ € ₱ ₹ ℓ № ℝ ™ ⅓ ⅔ ← ↑ → ↓ ↔ ↦ ⇄ ⇌ ⇒ ∂ ∅ ∆ ∇ ∈ − ∗
∘ √ ∞ ∧ ∨ ∩ ∪ ≈ ≡ ≤ ≥ ⊂ ⊆ ⊕ ⊗ ⋅ ─ │ ■ ▪ ● ★ ☆ ☉ ♠ ♣ ♥ ♦ ♭ ♯ ⟨ ⟩ ⱼ Hou Hou , .  〈 〉
< > 「 」 『 』 〜 あ い う え お か き く け こ さ し す せ そ た ち っ つ て と な に ぬ ね の は ひ ふ へ ほ ま み
む め も や ゆ よ ら り る れ ろ を ん ァ ア ィ イ ウ ェ エ オ カ キ ク ケ コ サ シ ス セ タ チ ッ ツ テ ト ナ ニ ノ ハ
ヒ フ ヘ ホ マ ミ ム メ モ ャ ュ ョ ラ リ ル レ ロ ワ ン ・ ー 13. The Lord of the world has not been around for a long time. 2. The people of Wujing are benevolent
 On behalf of Sashi Baoxin, jianyuanguang, bagongnei of the Yi Association, Liu Li, Jiasheng, Shiqian, Nanbo Yuankou ancient history, North District, siheji, the same name and the fourth mouth
 The master of Bancheng hall, national territory, xiawai Datian taifunai women's school, Yuanzong Dingxuan Palace Jiasu temple, and the emperor of sakagawa Prefecture, Okayama Island
 In ordinary years, I'm lucky to have Guanghong, Zhang Kuai, the empress of Yu De, the heart of Yu De, the loyalty and love of Zhang Kuai, the new side of politics and literature, the star of the new day, Chun Zhao, Zhi Qu, Shu Yue, and the woody Li Village
 Dongsong Linsen, yangshuqiao, Gezhi, Zhengwu, Bi's people, water, Yongjiang, river, governance, Haiqing, Han Dynasty, fire version of the dog king, Shengtian, male, white Emperor's eyes
 The real stone in the province shows the God of the society, the blessing, the show, the autumn sky, the seal, the bamboo, the beauty, the righteousness, the ear, the sound, the flower, the Yinghua, the yetengxing street, the west, the language Valley, the beigui, the Che Jun, the Dao Lang
 Chenyangxiong, head of Jinling Town, duliye, County Department fi fl ! ( ) , - . / :  ? ~

The above two codes and \\+ single character and single character results are the same

# return True
print('Are the two sets identical?', set(one_chars) == set(tokens))

Subwords vs. Whole-words

Print some vocabulary statistics.

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Increase the plot size and font size.
plt.rcParams["figure.figsize"] = (10,5)

# Measure the length of every token in the vocab.  Load each word
token_lengths = [len(token) for token in tokenizer.vocab.keys()]

# Plot the number of tokens of each length.
plt.title('Vocab Token Lengths')
plt.xlabel('Token Length')
plt.ylabel('# of Tokens')

print('Maximum token length:', max(token_lengths))

Count the tokens beginning with '\\.

num_subwords = 0

subword_lengths = []

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():
    # If it's a subword...
    if len(token) >= 2 and token[0:2] == '##':
        # Tally all subwords
        num_subwords += 1

        # Measure the sub word length (without the hashes)
        length = len(token) - 2

        # Record the lengths.        

Relative to the number occupied by the complete vocabulary

vocab_size = len(tokenizer.vocab.keys())

print('Number of subwords: {:,} of {:,}'.format(num_subwords, vocab_size))

# Calculate the percentage of words that are '##' subwords.
prcnt = float(num_subwords) / vocab_size * 100.0

print('%.1f%%' % prcnt)
Number of subwords: 5,828 of 30,522

Results of mapping statistics

plt.title('Subword Token Lengths (w/o "##")')
plt.xlabel('Subword Length')
plt.ylabel('# of ## Subwords')

You can check out examples of misspellings by yourself

'misspelled' in tokenizer.vocab # Right
'mispelled' in tokenizer.vocab # Wrong
'government' in tokenizer.vocab # Right
'goverment' in tokenizer.vocab # Wrong
'beginning' in tokenizer.vocab # Right
'begining' in tokenizer.vocab # Wrong
'separate' in tokenizer.vocab # Right
'seperate' in tokenizer.vocab # Wrong

For abbreviations

"can't" in tokenizer.vocab    # False
"cant" in tokenizer.vocab    # False

Opening and intermediate subwords

For a single character, there is both a single character and a "\\" version corresponding to each character. Is it the same with subwords?

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():
    # If it's a subword...
    if len(token) >= 2 and token[0:2] == '##':
        if not token[2:] in tokenizer.vocab:
            print('Did not find a token for', token[2:])

You can see that the first returned \ly is in the thesaurus, but ly is not in the thesaurus

Did not find a token for ly

'##ly' in tokenizer.vocab    # True
'ly' in tokenizer.vocab    # False

For names

Download data

!pip install wget

import wget
import random 

print('Beginning file download with wget module')

url = '', 'first-names.txt')

Encoding, lowercase, output length

# Read them in.
with open('first-names.txt', 'rb') as f:
    names_encoded = f.readlines()

names = []

# Decode the names, convert to lowercase, and strip newlines.
for name in names_encoded:

print('Number of names: {:,}'.format(len(names)))
print('Example:', random.choice(names))

See how many names are in BERT's Thesaurus

num_names = 0

# For each name in our list...
for name in names:

    # If it's in the vocab...
    if name in tokenizer.vocab:
        # Tally it.
        num_names += 1

print('{:,} names in the vocabulary'.format(num_names))

For numbers

# Count how many numbers are in the vocabulary.  Count how many numbers are in the glossary
count = 0

# For each token in the vocabulary...
for token in tokenizer.vocab.keys():

    # Tally if it's a number.
    if token.isdigit():
        count += 1
        # Any numbers >= 10,000?
        if len(token) > 4:

print('Vocab includes {:,} numbers.'.format(count))

Calculate how many numbers in 1600-2021

# Count how many dates between 1600 and 2021 are included.
count = 0 
for i in range(1600, 2021):
    if str(i) in tokenizer.vocab:
        count += 1

print('Vocab includes {:,} of 421 dates from 1600 - 2021'.format(count))

Tags: NLP

Posted by jumpenjuhosaphat on Wed, 27 Jul 2022 21:42:48 +0530