Dealing with garbled characters when converting to shift-jis in C# utf-8 encoding

When working on a project recently, when exporting CSV files, the client requested that the exported CSV files must be shift-jis encoded CSV files, and our database is stored in unicode, so there will be a lot of ? encodings when exporting. Because of:

Use the code table to explain:

Shift_JIS

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

00

NUL

SOH

STX

ETX

EOT

ENQ

ACK

BEL

BS

HT

LF

VT

FF

CR

SO

SI

10

DLE

DC1

DC2

DC3

DC4

NAK

SYN

ETB

CAN

EM

SUB

ESC

FS

GS

RS

US

20

SP

!

"

#

$

%

&

'

(

)

*

+

,

-

.

/

30

0

1

2

3

4

5

6

7

8

9

:

;

<

=

>

?

40

@

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

50

P

Q

R

S

T

U

V

W

X

Y

Z

[

¥

]

^

_

60

`

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

70

p

q

r

s

t

u

v

w

x

y

z

{

|

}

~

DEL

80

90

A0

B0

ソ

C0

D0

E0

F0

 

Shift_JIS is a code table commonly used in Japanese computer systems. It can accommodate full-width and half-width Latin letters, Hiragana, Katakana, symbols, and Japanese Kanji.

The reason it is named Shift_JIS is that when it places full-width characters, it avoids the half-width kana characters originally placed in 0xA1-0xDF.

In the Japanese computer systems of Microsoft and IBM, this code table is used. This code table is called CP932.

byte structure

The following characters are represented by one byte in Shift_JIS.

ASCII characters (0x20-0x7E), but "/" is replaced by "¥"

ASCII control characters (0x00-0x1F, 0x7F)

Half-width punctuation and katakana (0xA1-0xDF) in the JIS X 0201 standard

In some operating systems, 0xA0 is used to place "no newline space".

The following characters are represented by two bytes in Shift_JIS.

All characters of JIS X 0208 character set

"The first byte" uses 0x81-0x9F, 0xE0-0xEF (total 47)

"Second byte" uses 0x40-0x7E, 0x80-0xFC (total 188)

user defined area

"The first byte" uses 0xF0-0xFC (total 47)

"Second byte" uses 0x40-0x7E, 0x80-0xFC (total 188)

In the Shift_JIS code table, 0xFD, 0xFE and 0xFF are not used.

In the Japanese computer systems of Microsoft and IBM, 388 symbols and Chinese characters that are not included in JIS X 0208 are added to the two-byte areas of 0xFA, 0xFB, and 0xFC.

 

Because there are many encodings of unicode and shift-jis is not used, so there is no corresponding encoding conversion for shift-jis when converting, so when converting to byte, it is replaced by 63, that is? It is displayed, because we need to replace the characters corresponding to the bytecode of the original string with the corresponding characters that can be displayed by shift-jis.

 

Our design ideas are as follows:

1. Use a conversion table to process and save the code table and character table to be replaced.

2. Use two processing methods to process the conversion code.

a: Use encoding to replace, some special characters do not display the string, but they exist, such as null character, 0xa0, there is no corresponding encoding in shift-jis. There are also some special characters like utf-8 which is an empty string for new byte[] {0xef, 0xbb,0xbf}.

b: Replace before string conversion. Such as some obviously saveable character strings. If ~ is replaced by ~, replace it directly with Replace.

 

The problem will follow, we can only save strings like 0xef, 0xbb,0xbf in the table, how to convert it into new byte[] {0xef, 0xbb,0xbf}?

The way we handle it is as follows:

        private byte[] ConvertStringToByte(string originalStr)
        {
            if (string.IsNullOrEmpty(originalStr)) return null;
            string[] originalSplit = originalStr.Split(',');            
            int originalFirstValue = 0, originalSecondValue = 0, originalThirdValue = 0;
            byte[] resultByte;
            originalFirstValue = Convert.ToInt32(originalSplit[0].Trim(), 16);
            if (originalSplit.Length == 2)
            {
                originalSecondValue = Convert.ToInt32(originalSplit[1].Trim(), 16);
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0], BitConverter.GetBytes(originalSecondValue)[0] };
            }
            else  if (originalSplit.Length == 3)
            {
                originalSecondValue = Convert.ToInt32(originalSplit[1].Trim(), 16);
                originalThirdValue = Convert.ToInt32(originalSplit[2].Trim(), 16);
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0], BitConverter.GetBytes(originalSecondValue)[0], BitConverter.GetBytes(originalThirdValue)[0] };
            }
            else
            {
                resultByte = new byte[] { BitConverter.GetBytes(originalFirstValue)[0] };
            }
            return resultByte;
        }

 

 

 

Convert to the corresponding byte stream according to the incoming code. Then write code to replace it according to our processing logic.

code show as below:

       public string ReplaceString(string content)
        {
            List<MessyCodeHandleBE> messyCodeHandleBEList = RetrieveAll();

            foreach (MessyCodeHandleBE entity in messyCodeHandleBEList)
            {
                if (entity.ConvertType == MessyCodeHandleConvertTypeChoices.ENCODEREPLACE)
                {
                    content = content.Replace(Encoding.UTF8.GetString(ConvertStringToByte(entity.OriginalCode)), entity.ReplaceCode);
                }
                else
                {
                    content = content.Replace(entity.OriginalCode, entity.ReplaceCode);
                }
            }
            return content;
        }

 

And how to obtain the encoding of a special character can be calculated according to the following code, the code is as follows:

        private string ConvertToShiftJis(string content)
        {
            Encoding orginal = Encoding.GetEncoding("utf-8");
            Encoding ShiftJis = Encoding.GetEncoding("Shift-JIS");
            byte[] unf8Bytes = orginal.GetBytes(content);
            byte[] myBytes = Encoding.Convert(orginal, ShiftJis, unf8Bytes);
            string JISContent = ShiftJis.GetString(myBytes);
            return JISContent;
        }

 

View its byte encoding during debugging, as shown in the figure:

 

The hexadecimal of 239 is 0xef, the hexadecimal of 187 is 0xbb, and the hexadecimal of 191 is 0xbf.

 

Summarize

It is to find out what the corresponding byte[] byte is when the shift-jis code of the string is 63, and then replace it with Replace and it will be OK. If you have any new discoveries, please leave a message to exchange.

 

 

author: spring yang

source: http://www.cnblogs.com/springyangwc/

The copyright of this article belongs to the author and the blog garden. Reprinting is welcome, but this statement must be retained without the author's consent, and a link to the original text should be given in an obvious place on the article page, otherwise the right to pursue legal responsibility is reserved.

Reprinted in: https://www.cnblogs.com/springyangwc/archive/2011/07/05/2098053.html

Tags: Database Java Operating System

Posted by Dimensional on Sun, 20 Nov 2022 22:52:59 +0530