Character coding of python learning part

1: Reserve knowledge

1. Relationship between program operation and three core hardware

  • Programs are stored on the hard disk first

  • The program runs by loading the code from the hard disk into memory

  • Then the cpu fetches instructions from memory to run

2. Data generated during program operation is first placed in memory

3. Three steps of python program running

python3 D:\a.py
  • 1. Start the python interpreter first
  • 2. The interpreter will read the contents of a.py from the hard disk into the memory as normal memory. At this time, it has no grammatical meaning
  • 3. The interpreter will interpret and execute the memory just read into the memory and begin to recognize python syntax

2: What is character encoding

Characters ------------ (standard) -------------- numbers

Character code table: stores the correspondence between characters and numbers

1. ASCII: only English characters can be recognized

    Features: 8 bit Corresponding to one English character
            8bit=>1Byte

2. GBK: can recognize Chinese strings and English characters

    Features: 16 bit Corresponding character, which can be English or Chinese

3,shift-JIS

Can recognize Japanese and English strings

4. unicode: can recognize universal characters

  • Feature: 2Bytes corresponds to one character

5,utf-8

  • 1Byte corresponds to English characters
  • 3Byte corresponds to a Chinese character

3: unicode, utf-8

  • unicode features: 2Bytes corresponds to one character
  • utf-8 features: 1Byte corresponds to English characters and 3Byte corresponds to a Chinese character
Chinese characters, English characters------------>unicode Binary number----------->gbk Binary number
 Japanese characters, English characters------------>unicode Binary number----------->shiftJIS Binary number
 Korean characters, English characters------------>unicode Binary number----------->Euc-kr Binary number
 Universal character------------>unicode Binary number----------->utf-8 Binary number

1. Separatist regime:

English characters--------------Memory: ASCII Binary number--------------->Hard disk: ASCII Binary number
 Chinese and English characters--------------Memory: GBK Binary number--------------->Hard disk: GBK Binary number
 Japanese English characters--------------Memory: shiftJIS Binary number--------------->Hard disk: shiftJIS Binary number
 Korean English characters--------------Memory: Euc-Kr Binary number--------------->Hard disk: Euc-Kr Binary number

2. Transition phase:

Chinese and English characters------------Memory: unicode=========gbk============>Hard disk: GBK Binary number
 Japanese English characters------------Memory: unicode=========shifJIS========>Hard disk: shiftJIS Binary number
 Korean English characters------------Memory: unicode=========Euc-Kr=========>Hard disk: Euc-Kr Binary number
 Universal character----------------Memory: unicode=========utf-8==========>Hard disk: utf-8 Binary number
  • Fixed memory usage: unicode
    What we can change is the encoding format used to write from memory to the hard disk

3. Long separation and long convergence:

Currently:

Universal character----------------Memory: unicode=========utf-8==========>Hard disk: utf-8 Binary number

One day:

Universal character----------------Memory: utf-8==========================>Hard disk: utf-8 Binary number

Garbled code problem

It's a mess when you save it

  • Disordered storage: the adopted character coding table cannot recognize the input characters

  • It is already disordered when saving: it cannot be remedied, and it must also be disordered when withdrawing

Solution: the encoding format stored in the hard disk should be utf-8 format

The code was garbled when I took it

There is no garbled code when saving, and the adopted character coding table can recognize the input characters
But the code is garbled when retrieving: the character coding table used is not the same as that used when saving

  • Solution: use the same code format when saving and retrieving

Garbled code problems related to running python programs:

Ensure that the first two stages of running a python program are not garbled
Add a line to the beginning of the python file:

  • #coding: the encoding format used when saving the file

Differences in interpreter versions

In python3, the values of string type are all numbers in unicode format in memory

In python2, the value of string type in memory is a number in the encoding format specified by the file header

Prefix the string with u to force the string to be saved in unicode format, which is recommended in python2

Tags: Python

Posted by Grizzzzzzzzzz on Sun, 29 May 2022 23:09:55 +0530