Character encoding, document processing

1, Knowledge reserve

1. relationship between program operation and three core hardware

Programs are first stored on the hard disk
 When a program runs, it loads the program from the hard disk to the memory
 then CPU Read data from memory and run instructions

2. the data generated by the program is initially stored in memory
3. Three steps for running Python programs:
For example: python3 D:\a.py

① Start first python interpreter
② Interpreter will file a.py The contents in are loaded from the hard disk into the memory as normal memory (there is no syntax meaning at this time)
③ The interpreter will interpret and execute the contents just read into memory, and identify python Syntax executor

2, Character encoding

Character ----------- (standard) ------- number
Character code table: a table for storing the corresponding relationship between characters and numbers
1. ASCII: only English characters can be recognized
Features: 8bit is used to correspond to one English character, that is, one Byte
2. GBK: Chinese and English characters can be recognized
Features: 16bit is used to correspond to one character (both Chinese and English are 16bit, i.e. two bytes)
3. Shift JIS: can recognize Japanese and English characters
4. Unicode: can recognize 30000 Chinese characters
Features: two bytes corresponding to one character

Characters ------->unicode numbers (corresponding to all characters converted to GBK, shift JIS, English, etc.)

Chinese characters, English characters------------>unicode Binary number----------->gbk Binary number
 Japanese characters, English characters------------>unicode Binary number----------->shiftJIS Binary number
 Korean characters, English characters------------>unicode Binary number----------->Euc-kr Binary number
 Universal character-------------------->unicode Binary number----------->utf-8 Binary number

5,utf-8
Features:
1Byte corresponds to one English character
3Byte corresponds to a Chinese character

3, Development history of character coding

1. Separatist regime
Languages of different countries are not unified and mutually incompatible

English characters-----------------Memory: ASCII Binary number--------------->Hard disk: ASCII Binary number
 Chinese and English characters--------------Memory: GBK Binary number----------------->Hard disk: GBK Binary number
 Japanese English characters--------------Memory: shiftJIS Binary number------------>Hard disk: shiftJIS Binary number
 Korean English characters--------------Memory: Euc-Kr Binary number-------------->Hard disk: Euc-Kr Binary number

2. Transition phase
The memory is fixed in Unicode format,
The encoding format written from the memory to the hard disk can be changed artificially

Chinese and English characters------------Memory: unicode=========gbk============>Hard disk: GBK Binary number
 Japanese English characters------------Memory: unicode=========shif-JIS=======>Hard disk: shiftJIS Binary number
 Korean English characters------------Memory: unicode=========Euc-Kr=========>Hard disk: Euc-Kr Binary number
 Universal character---------------Memory: unicode=========utf-8==========>Hard disk: utf-8 Binary number

3. A long separation will bring you together
At this stage, some old programs and data are still saved in GBK, ASCII and other encoding formats, so the current memory is in Unicode encoding format for compatibility.
When the transition phase becomes history, the default encoding format of memory will change to utf-8

Universal character----------------Memory: unicode=========utf-8============>Hard disk: utf-8 Binary number
 Universal character----------------Memory: utf-8=====Write and read directly without transcoding=====>Hard disk: utf-8 Binary number

4, Garbled code problem

1. It was chaotic when it was stored
The character encoding table used does not recognize the input character
The code is garbled when it is stored and cannot be remedied. It will also be garbled when it is decoded when it is retrieved
Solution: the encoding format stored in the hard disk should be utf-8 format
2. There is no confusion when saving
The adopted character coding table can recognize the input characters
But the code is garbled when retrieving: the character coding table used is not the same as that used when saving
Solution: when fetching, the decoding uses the same encoding format as when saving
Namely:

character-----coding-------->unicode Formatted number------coding----->gbk Formatted number
 character-----decoding-------->unicode Formatted number<------decoding-----gbk Formatted number

3. garbled code related to python programs
① Ensure that the first two steps of python operation are not garbled:

Add a line at the beginning of the file:

#coding: the encoding format when the file is saved

② Ensure that the third step of python running will not be garbled:
Use python3 (string type values in python3 are numbers in Unicode format)
If python2 is used, prefix the string with u
4.bytes

bytes Type can be understood as a native format of hard disk
x = 'good'
res1 = x.encode('GBK')
res2 = x.encode('utf-8')
print(res1)  # b'\xba\xc3'
print(res2)  # b'\xe5\xa5\xbd'

data1 = res1.decode('gbk')
data2 = res2.decode('utf-8')
print(data1)  # good
print(data2)  # good

5, File operation

1. What is a file?

Files are provided to users by the operating system/A function (virtual unit) in which an application operates a hard disk
 user/Applications' read and write operations to the hard disk are all calls to the operating system
 After receiving the call request, the operating system will convert the request into a specific hard disk operation

2. Why use files?

Files are required to access the hard disk
 Application operation files are used to permanently save the data in memory to the hard disk

3. How do I use files?

f = open(file_path, mode)
f.write(data)
f.close()  # After calling, close the handle and release the operating system resources

Relationship between file operation and computer three-tier structure:

user/application program----------File object/handle----------->Remote control
 operating system---------------Open file a.txt --------->air-conditioning
 computer hardware--------------Hard disk

(1) Path to the file (address where the file was found)

Absolute path:
		Windows In the system, such as  r'D:\py\data\a.txt' ,prefix r To avoid escape characters in file paths
		Linux In the system, such as:/a/b/c/d.txt
 Relative path: look back based on the folder where the program is located, such as r'a.txt' ,(In the current folder a.txt)

(2) File handle

f = open(r'a.txt', mode='rt', encoding='utf-8')
data = f.read()
print(data)
f.close()  

Tags: Python

Posted by MartiniMan on Mon, 30 May 2022 03:43:57 +0530