background
As a programmer, I have always had a dream of using a compiler in my heart, but I have not been able to put it into practice due to insufficient technology. Although JSON is not a language, it is very suitable for training with a compiler. The reason is that
- Fewer keywords, simple structure
- The grammar is simple, there is no judgment, loop and other advanced language grammar
- Text format, more convenient for testing
Although writing code and hard parsing can also be done, it is not scientific after all. For complex grammar, hard parsing cannot solve it at all. Learned from Ruan Yifeng's blog the-super-tiny-compiler This project, this project is a mini compiler. When converting lisp expressions into c language expressions, the code removes comments and less than 200 lines, which is very suitable for learning. This project gave me a lot of inspiration, and I started to write a json parser This series of blog posts will record a complete implementation process of the json parser. Of course, I am also a novice, not a compiler expert. I just hope to give you a little reference, and the master can make a detour.
illustrate
How to parse a json string into a java object, roughly divided into the following steps
- The tokenizer decomposes the json string into individual units, such as the following simple json string
{ "name": "asan", "age": 32 }
After word segmentation, it will be decomposed into the following format
[ {"type":"object","value":"{","valueType":"object"},{"type":"key","value":"name","valueType":"string"},{"type":"value","value":"asan","valueType":"string"},{"type":"key","value":"age","valueType":"string"},{"type":"value","value":32,"valueType":"number"},{"type":"object","value":"}","valueType":"object"} ]
During this period, useless characters will be filtered and decomposed into token s
- Parsing abstract syntax tree (AST): tokenizer just decomposes and tiles strings, and AST is responsible for turning each tiled token into a tree according to semantics, with a hierarchical structure. For example, the token parsing abstract syntax tree above is as follows
{ "items": [ { "name": "name", "type": "value", "value": "asan" }, { "name": "age", "type": "value", "value": 32 } ], "type": "object" }
- Object generation, generating objects based on the abstract syntax tree
Whether it is tokenizer or ast, the format is not fixed. The above is just a reference, but the functions are similar. Basically, the parser has to go through two steps of tokenizer and ast.
word segmentation (tokenizer)
example json
Unless otherwise specified, subsequent programs are tested based on the following json
{ "name": "asan", "age": 32, "mail": null, "married": true, "birthday": "1992-02-08", "salary": 1234.56, "deposit": -342.34, "description": "a \"hundsome\" man", "tags": [ "coder", "mid-age" ], "location": { "province": "Fujian Province", "city": "Quanzhou", "area": "Jinjiang City" }, "family": [ { "relation": "couple", "name": "Helen" }, { "relation": "daughter", "name": "XiaoMan" } ] }
This example basically contains all common elements of json, which can satisfy basic tests
- Basic data types: string, integer, float, date, null, boolean
- object (location)
- array (family)
- Basic type array (tags)
Participle
We first define a structure to save word segmentation results, which needs to contain at least the following two fields
- type: token type, including object (object), array (array), key (field name), value (field value), kvSymbol (colon between key-value:)
- value: token value
But one type may not be enough to describe json. For example, the value of json has string, integer, floating point, etc., but the type is all value. You may say why not define a type for each. If each defines a type , then it will be more troublesome to judge whether the token is a value type. It needs to be judged in turn whether it is a string, an integer, a floating point type, etc., so we add a field valueType to store the value type
- type: token type
- value: token value
- valueType: value type (string, bool, null, number)
We don't define the enumeration for the time being, first implement the parser and then refactor the code, and don't consider the rationality of the code for the time being.
Here is the first version of the parser
import java.util.ArrayList; import java.util.List; /** * @Description: * @author: jianfeng.zheng * @since: 2022/12/20 11:56 * @history: 1.2022/12/20 created by jianfeng.zheng */ public class JSONParser { //currentIndex saves the position of the current string scan, and the string is scanned character by character private int currentIndex = 0; /** * Tokenize the json string * * @param json json string * @return token the list */ public List<Token> tokenizer(String json) { // Save word segmentation results List<Token> tokens = new ArrayList<>(); while (currentIndex < json.length()) { char c = json.charAt(currentIndex); //Blanks can be skipped directly, if there are more blanks, just add a new judgment if (c == ' ' || c == '\r' || c == '\n' || c == ',') { //As long as the character has been processed, the current position must be moved to the next ++currentIndex; continue; } else if (c == '{' || c == '}') { //object tokens.add(new Token("object", c)); } else if (c == '[' || c == ']') { //array tokens.add(new Token("array", c)); } else if (c == '"') { //string StringBuffer value = new StringBuffer(); char cc = json.charAt(++currentIndex); // Here " is used as the sign of the end of the string // Of course, this is not rigorous because escaping is not considered, but this problem will be solved later, we will ignore it for the time being while (cc != '"') { value.append(cc); cc = json.charAt(++currentIndex); } tokens.add(new Token("string", value.toString())); } else if (c == ':') { // The separator between key-value tokens.add(new Token("kvSymbol", "kvSymbol", c)); } else if (c >= '0' && c <= '9') { //number StringBuffer value = new StringBuffer(); value.append(c); char cc = json.charAt(++currentIndex); //Here a floating point number with a decimal point is considered while (cc == '.' || (cc >= '0' && cc <= '9')) { value.append(cc); cc = json.charAt(++currentIndex); } //Numbers are temporarily represented by floating-point numbers tokens.add(new Token("value", "number", Float.parseFloat(value.toString()))); } ++currentIndex; } return tokens; } }
The code flow is as follows
- loop through json string
- Detect keywords, and identify keywords and save them as token s
- For the processing of strings, whether it is a key or a value, the string type is currently stored uniformly.
- The processing of digital types is currently stored in Float floating point numbers
test
We write a program program to test
public class Main { public static void main(String[] args) { String json = "{\"name\": \"asan\", \"age\": 32}"; JSONParser parser = new JSONParser(); List<Token> tokens = parser.tokenizer(json); System.out.println(String.format("|%-12s|%-12s|%-15s|", "type", "valueType", "value")); System.out.println("-------------------------------------------"); for (Token t : tokens) { System.out.println(String.format("|%-12s|%-12s|%-15s|", t.getType(), t.getValueType(), t.getValue())); } System.out.println("-------------------------------------------"); } }
Let's take a relatively simple json{"name": "asan", "age": 32} to test first, and the test results are as follows
|type |valueType |value | ------------------------------------------- |object |object |{ | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |asan | |string |string |age | |kvSymbol |kvSymbol |: | |value |number |32.0 | -------------------------------------------
The result so far is in line with our expectations.
optimization
other basic types
At present, the value type of the program only processes strings and numbers, and the bool and null types are not processed. Since the program scans the string character by character, but to judge bool and null, it must be scanned later.
- judge null
if ((c == 'n') && json.startsWith("null", currentIndex)) { tokens.add(new Token("value", "null", null)); //If you read a null value, you need to move the current pointer forward by 3 characters (null occupies 4 characters, and you need to move 3 characters after removing the 1 string that has been read) currentIndex += 3; }
- Judgment bool value
The judgment of the bool value is to judge the two strings of true and false, which is similar to judging the null value
if ((c == 't') && json.startsWith("true", currentIndex)) { tokens.add(new Token("value", "bool", true)); currentIndex += 3; } if ((c == 'f') && json.startsWith("false", currentIndex)) { tokens.add(new Token("value", "bool", false)); //false is 5 characters so need to shift 4 bits currentIndex += 4; }
We modify the test json string to {"name": "asan", "age": 32,"mail": null,"married": true} and then test, the results are as follows
|type |valueType |value | ------------------------------------------- |object |object |{ | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |asan | |string |string |age | |kvSymbol |kvSymbol |: | |value |number |32.0 | |string |string |mail | |kvSymbol |kvSymbol |: | |value |null |null | |string |string |married | |kvSymbol |kvSymbol |: | |value |bool |true | |object |object |} | -------------------------------------------
The result is as expected.
string handling
The current processing method of the string is to detect "as a string until the next " appears, but this processing method is not rigorous, it is possible that the string itself contains ", so it is necessary to escape the character Processing, we modify the processing function of the string
if (c == '"') { //string StringBuffer value = new StringBuffer(); char cc = json.charAt(++currentIndex); // Here " is used as the sign of the end of the string while (cc != '"') { if (cc == '\\') { cc = json.charAt(++currentIndex); } value.append(cc); cc = json.charAt(++currentIndex); } tokens.add(new Token("string", value.toString())); }
We changed the test string to {"name": "asan", "age": 32,"description": "a \"hudsom\" man","married": true} and tested again, the results are as follows
|type |valueType |value | ------------------------------------------- |object |object |{ | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |asan | |string |string |age | |kvSymbol |kvSymbol |: | |value |number |32.0 | |string |string |description | |kvSymbol |kvSymbol |: | |string |string |a "hudsom" man | |string |string |married | |kvSymbol |kvSymbol |: | |value |bool |true | |object |object |} | -------------------------------------------
Successfully recognized the escaped string
digital processing
Digital processing currently has some problems
- Negative case not handled
- Does not deal with science and technology law
- There is no distinction between floating point and integers, all of which are floating point numbers
The program is modified as follows
if ((c >= '0' && c <= '9') || c == '-') { // number StringBuffer value = new StringBuffer(); value.append(c); // Judging whether it is a floating point number boolean isFloat = false; //If json is an integer such as: 1, then an error will be reported if there is no judgment here if (currentIndex + 1 < json.length()) { char cc = json.charAt(++currentIndex); // Judgment includes floating point type, integer type, science and technology method while (cc == '.' || (cc >= '0' && cc <= '9') || cc == 'e' || cc == 'E' || cc == '+' || cc == '-') { value.append(cc); if (cc == '.') { isFloat = true; } cc = json.charAt(++currentIndex); } } if (isFloat) { //floating point number tokens.add(new Token("value", "float", Float.parseFloat(value.toString()))); } else { //integer tokens.add(new Token("value", "long", Long.parseLong(value.toString()))); } }
We use the string {"age":32,"deposit": -342.34} to test, and the test results are as follows
|type |valueType |value | ------------------------------------------- |object |object |{ | |string |string |age | |kvSymbol |kvSymbol |: | |value |long |32 | |string |string |deposit | |kvSymbol |kvSymbol |: | |value |float |-342.34 | -------------------------------------------
full test
We test with the complete string and the results are as follows
|type |valueType |value | ------------------------------------------- |object |object |{ | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |asan | |string |string |age | |kvSymbol |kvSymbol |: | |value |long |32 | |string |string |married | |kvSymbol |kvSymbol |: | |value |bool |true | |string |string |birthday | |kvSymbol |kvSymbol |: | |string |string |1992-02-08 | |string |string |salary | |kvSymbol |kvSymbol |: | |value |float |1234.56 | |string |string |description | |kvSymbol |kvSymbol |: | |string |string |a "hudsom" man | |string |string |tags | |kvSymbol |kvSymbol |: | |array |array |[ | |string |string |coder | |string |string |mid-age | |array |array |] | |string |string |location | |kvSymbol |kvSymbol |: | |object |object |{ | |string |string |province | |kvSymbol |kvSymbol |: | |string |string |Fujian Province | |string |string |city | |kvSymbol |kvSymbol |: | |string |string |Quanzhou | |string |string |area | |kvSymbol |kvSymbol |: | |string |string |Jinjiang City | |object |object |} | |string |string |family | |kvSymbol |kvSymbol |: | |array |array |[ | |object |object |{ | |string |string |relation | |kvSymbol |kvSymbol |: | |string |string |couple | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |Helen | |object |object |} | |object |object |{ | |string |string |relation | |kvSymbol |kvSymbol |: | |string |string |daughter | |string |string |name | |kvSymbol |kvSymbol |: | |string |string |XiaoMan | |object |object |} | |array |array |] | |object |object |} | -------------------------------------------
full code
public class JSONParser { //currentIndex saves the position of the current string scan, and the string is scanned character by character private int currentIndex = 0; /** * Tokenize the json string * * @param json json string * @return token the list */ public List<Token> tokenizer(String json) { // Save word segmentation results List<Token> tokens = new ArrayList<>(); while (currentIndex < json.length()) { char c = json.charAt(currentIndex); //Blanks can be skipped directly, if there are more blanks, just add a new judgment if (c == ' ' || c == '\r' || c == '\n' || c == ',') { //As long as the character has been processed, the current position must be moved to the next ++currentIndex; continue; } else if (c == '{' || c == '}') { //object tokens.add(new Token("object", c)); } else if (c == '[' || c == ']') { //array tokens.add(new Token("array", c)); } else if (c == '"') { //string StringBuffer value = new StringBuffer(); char cc = json.charAt(++currentIndex); // Here " is used as the sign of the end of the string while (cc != '"') { if (cc == '\\') { cc = json.charAt(++currentIndex); } value.append(cc); cc = json.charAt(++currentIndex); } tokens.add(new Token("string", value.toString())); } else if (c == ':') { // The separator between key-value tokens.add(new Token("kvSymbol", "kvSymbol", c)); } else if ((c >= '0' && c <= '9') || c == '-') { // number StringBuffer value = new StringBuffer(); value.append(c); // Judging whether it is a floating point number boolean isFloat = false; //If json is an integer such as: 1, then an error will be reported if there is no judgment here if (currentIndex + 1 < json.length()) { char cc = json.charAt(++currentIndex); // Judgment includes floating point type, integer type, science and technology method while (cc == '.' || (cc >= '0' && cc <= '9') || cc == 'e' || cc == 'E' || cc == '+' || cc == '-') { value.append(cc); if (cc == '.') { isFloat = true; } cc = json.charAt(++currentIndex); } } if (isFloat) { //floating point number tokens.add(new Token("value", "float", Float.parseFloat(value.toString()))); } else { //integer tokens.add(new Token("value", "long", Long.parseLong(value.toString()))); } } else if ((c == 'n') && json.startsWith("null", currentIndex)) { tokens.add(new Token("value", "null", null)); currentIndex += 3; } else if ((c == 't') && json.startsWith("true", currentIndex)) { tokens.add(new Token("value", "bool", true)); currentIndex += 3; } else if ((c == 'f') && json.startsWith("false", currentIndex)) { tokens.add(new Token("value", "bool", false)); //false is 5 characters so need to shift 4 bits currentIndex += 4; } ++currentIndex; } //reset current location currentIndex = 0; return tokens; } }
the code
Please refer to the project for the complete code https://github.com/wls1036/tiny-json-parser 0x01 branch