Write your own json parser 0x01-word segmentation

background

As a programmer, I have always had a dream of using a compiler in my heart, but I have not been able to put it into practice due to insufficient technology. Although JSON is not a language, it is very suitable for training with a compiler. The reason is that

  • Fewer keywords, simple structure
  • The grammar is simple, there is no judgment, loop and other advanced language grammar
  • Text format, more convenient for testing

Although writing code and hard parsing can also be done, it is not scientific after all. For complex grammar, hard parsing cannot solve it at all. Learned from Ruan Yifeng's blog the-super-tiny-compiler This project, this project is a mini compiler. When converting lisp expressions into c language expressions, the code removes comments and less than 200 lines, which is very suitable for learning. This project gave me a lot of inspiration, and I started to write a json parser This series of blog posts will record a complete implementation process of the json parser. Of course, I am also a novice, not a compiler expert. I just hope to give you a little reference, and the master can make a detour.

illustrate

How to parse a json string into a java object, roughly divided into the following steps

  • The tokenizer decomposes the json string into individual units, such as the following simple json string
{
  "name": "asan",
  "age": 32
}

After word segmentation, it will be decomposed into the following format

[
{"type":"object","value":"{","valueType":"object"},{"type":"key","value":"name","valueType":"string"},{"type":"value","value":"asan","valueType":"string"},{"type":"key","value":"age","valueType":"string"},{"type":"value","value":32,"valueType":"number"},{"type":"object","value":"}","valueType":"object"}
]

During this period, useless characters will be filtered and decomposed into token s

  • Parsing abstract syntax tree (AST): tokenizer just decomposes and tiles strings, and AST is responsible for turning each tiled token into a tree according to semantics, with a hierarchical structure. For example, the token parsing abstract syntax tree above is as follows
{
    "items": [
        {
            "name": "name",
            "type": "value",
            "value": "asan"
        },
        {
            "name": "age",
            "type": "value",
            "value": 32
        }
    ],
    "type": "object"
}
  • Object generation, generating objects based on the abstract syntax tree

Whether it is tokenizer or ast, the format is not fixed. The above is just a reference, but the functions are similar. Basically, the parser has to go through two steps of tokenizer and ast.

word segmentation (tokenizer)

example json

Unless otherwise specified, subsequent programs are tested based on the following json

{
  "name": "asan",
  "age": 32,
  "mail": null,
  "married": true,
  "birthday": "1992-02-08",
  "salary": 1234.56,
  "deposit": -342.34,
  "description": "a \"hundsome\" man",
  "tags": [
    "coder",
    "mid-age"
  ],
  "location": {
    "province": "Fujian Province",
    "city": "Quanzhou",
    "area": "Jinjiang City"
  },
  "family": [
    {
      "relation": "couple",
      "name": "Helen"
    },
    {
      "relation": "daughter",
      "name": "XiaoMan"
    }
  ]
}

This example basically contains all common elements of json, which can satisfy basic tests

  • Basic data types: string, integer, float, date, null, boolean
  • object (location)
  • array (family)
  • Basic type array (tags)

Participle

We first define a structure to save word segmentation results, which needs to contain at least the following two fields

  • type: token type, including object (object), array (array), key (field name), value (field value), kvSymbol (colon between key-value:)
  • value: token value

But one type may not be enough to describe json. For example, the value of json has string, integer, floating point, etc., but the type is all value. You may say why not define a type for each. If each defines a type , then it will be more troublesome to judge whether the token is a value type. It needs to be judged in turn whether it is a string, an integer, a floating point type, etc., so we add a field valueType to store the value type

  • type: token type
  • value: token value
  • valueType: value type (string, bool, null, number)

We don't define the enumeration for the time being, first implement the parser and then refactor the code, and don't consider the rationality of the code for the time being.

Here is the first version of the parser

import java.util.ArrayList;
import java.util.List;

/**
 * @Description:
 * @author: jianfeng.zheng
 * @since: 2022/12/20 11:56
 * @history: 1.2022/12/20 created by jianfeng.zheng
 */
public class JSONParser {

    //currentIndex saves the position of the current string scan, and the string is scanned character by character
    private int currentIndex = 0;

    /**
     * Tokenize the json string
     *
     * @param json json string
     * @return token the list
     */
    public List<Token> tokenizer(String json) {
        // Save word segmentation results
        List<Token> tokens = new ArrayList<>();

        while (currentIndex < json.length()) {
            char c = json.charAt(currentIndex);

            //Blanks can be skipped directly, if there are more blanks, just add a new judgment
            if (c == ' ' || c == '\r' || c == '\n' || c == ',') {
                //As long as the character has been processed, the current position must be moved to the next
                ++currentIndex;
                continue;
            } else if (c == '{' || c == '}') {
                //object
                tokens.add(new Token("object", c));
            } else if (c == '[' || c == ']') {
                //array
                tokens.add(new Token("array", c));
            } else if (c == '"') {
                //string
                StringBuffer value = new StringBuffer();
                char cc = json.charAt(++currentIndex);

                // Here " is used as the sign of the end of the string
                // Of course, this is not rigorous because escaping is not considered, but this problem will be solved later, we will ignore it for the time being
                while (cc != '"') {
                    value.append(cc);
                    cc = json.charAt(++currentIndex);
                }
                tokens.add(new Token("string", value.toString()));
            } else if (c == ':') {
                // The separator between key-value
                tokens.add(new Token("kvSymbol", "kvSymbol", c));
            } else if (c >= '0' && c <= '9') {
                //number
                StringBuffer value = new StringBuffer();
                value.append(c);
                char cc = json.charAt(++currentIndex);
                //Here a floating point number with a decimal point is considered
                while (cc == '.' || (cc >= '0' && cc <= '9')) {
                    value.append(cc);
                    cc = json.charAt(++currentIndex);
                }
                //Numbers are temporarily represented by floating-point numbers
                tokens.add(new Token("value", "number", Float.parseFloat(value.toString())));
            }
            ++currentIndex;
        }
        return tokens;
    }
}

The code flow is as follows

  • loop through json string
  • Detect keywords, and identify keywords and save them as token s
  • For the processing of strings, whether it is a key or a value, the string type is currently stored uniformly.
  • The processing of digital types is currently stored in Float floating point numbers

test

We write a program program to test

public class Main {

    public static void main(String[] args) {
        String json = "{\"name\": \"asan\", \"age\": 32}";
        JSONParser parser = new JSONParser();
        List<Token> tokens = parser.tokenizer(json);
        System.out.println(String.format("|%-12s|%-12s|%-15s|", "type", "valueType", "value"));
        System.out.println("-------------------------------------------");
        for (Token t : tokens) {
            System.out.println(String.format("|%-12s|%-12s|%-15s|",
                    t.getType(),
                    t.getValueType(),
                    t.getValue()));
        }
        System.out.println("-------------------------------------------");
    }
}

Let's take a relatively simple json{"name": "asan", "age": 32} to test first, and the test results are as follows

|type        |valueType   |value          |
-------------------------------------------
|object      |object      |{              |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |asan           |
|string      |string      |age            |
|kvSymbol    |kvSymbol    |:              |
|value       |number      |32.0           |
-------------------------------------------

The result so far is in line with our expectations.

optimization

other basic types

At present, the value type of the program only processes strings and numbers, and the bool and null types are not processed. Since the program scans the string character by character, but to judge bool and null, it must be scanned later.

  • judge null
if ((c == 'n') &&
    json.startsWith("null", currentIndex)) {
  tokens.add(new Token("value", "null", null));
  //If you read a null value, you need to move the current pointer forward by 3 characters (null occupies 4 characters, and you need to move 3 characters after removing the 1 string that has been read)
  currentIndex += 3;
}
  • Judgment bool value

The judgment of the bool value is to judge the two strings of true and false, which is similar to judging the null value

if ((c == 't') &&
    json.startsWith("true", currentIndex)) {
  tokens.add(new Token("value", "bool", true));
  currentIndex += 3;
}
if ((c == 'f') &&
    json.startsWith("false", currentIndex)) {
  tokens.add(new Token("value", "bool", false));
  //false is 5 characters so need to shift 4 bits
  currentIndex += 4;
}

We modify the test json string to {"name": "asan", "age": 32,"mail": null,"married": true} and then test, the results are as follows

|type        |valueType   |value          |
-------------------------------------------
|object      |object      |{              |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |asan           |
|string      |string      |age            |
|kvSymbol    |kvSymbol    |:              |
|value       |number      |32.0           |
|string      |string      |mail           |
|kvSymbol    |kvSymbol    |:              |
|value       |null        |null           |
|string      |string      |married        |
|kvSymbol    |kvSymbol    |:              |
|value       |bool        |true           |
|object      |object      |}              |
-------------------------------------------

The result is as expected.

string handling

The current processing method of the string is to detect "as a string until the next " appears, but this processing method is not rigorous, it is possible that the string itself contains ", so it is necessary to escape the character Processing, we modify the processing function of the string

if (c == '"') {
  //string
  StringBuffer value = new StringBuffer();
  char cc = json.charAt(++currentIndex);
  // Here " is used as the sign of the end of the string
  while (cc != '"') {
      if (cc == '\\') {
          cc = json.charAt(++currentIndex);
      }
      value.append(cc);
      cc = json.charAt(++currentIndex);
  }
  tokens.add(new Token("string", value.toString()));
 }

We changed the test string to {"name": "asan", "age": 32,"description": "a \"hudsom\" man","married": true} and tested again, the results are as follows

|type        |valueType   |value          |
-------------------------------------------
|object      |object      |{              |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |asan           |
|string      |string      |age            |
|kvSymbol    |kvSymbol    |:              |
|value       |number      |32.0           |
|string      |string      |description    |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |a "hudsom" man |
|string      |string      |married        |
|kvSymbol    |kvSymbol    |:              |
|value       |bool        |true           |
|object      |object      |}              |
-------------------------------------------

Successfully recognized the escaped string

digital processing

Digital processing currently has some problems

  • Negative case not handled
  • Does not deal with science and technology law
  • There is no distinction between floating point and integers, all of which are floating point numbers

The program is modified as follows

if ((c >= '0' && c <= '9') || c == '-') {
  // number
  StringBuffer value = new StringBuffer();
  value.append(c);
  // Judging whether it is a floating point number
  boolean isFloat = false;
  //If json is an integer such as: 1, then an error will be reported if there is no judgment here
  if (currentIndex + 1 < json.length()) {
      char cc = json.charAt(++currentIndex);
      // Judgment includes floating point type, integer type, science and technology method
      while (cc == '.' || (cc >= '0' && cc <= '9') || cc == 'e' || cc == 'E' || cc == '+' || cc == '-') {
          value.append(cc);
          if (cc == '.') {
              isFloat = true;
          }
          cc = json.charAt(++currentIndex);
      }
  }
  if (isFloat) {
      //floating point number
      tokens.add(new Token("value", "float", Float.parseFloat(value.toString())));
  } else {
      //integer
      tokens.add(new Token("value", "long", Long.parseLong(value.toString())));
  }
}

We use the string {"age":32,"deposit": -342.34} to test, and the test results are as follows

|type        |valueType   |value          |
-------------------------------------------
|object      |object      |{              |
|string      |string      |age            |
|kvSymbol    |kvSymbol    |:              |
|value       |long        |32             |
|string      |string      |deposit        |
|kvSymbol    |kvSymbol    |:              |
|value       |float       |-342.34        |
-------------------------------------------

full test

We test with the complete string and the results are as follows

|type        |valueType   |value          |
-------------------------------------------
|object      |object      |{              |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |asan           |
|string      |string      |age            |
|kvSymbol    |kvSymbol    |:              |
|value       |long        |32             |
|string      |string      |married        |
|kvSymbol    |kvSymbol    |:              |
|value       |bool        |true           |
|string      |string      |birthday       |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |1992-02-08     |
|string      |string      |salary         |
|kvSymbol    |kvSymbol    |:              |
|value       |float       |1234.56        |
|string      |string      |description    |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |a "hudsom" man |
|string      |string      |tags           |
|kvSymbol    |kvSymbol    |:              |
|array       |array       |[              |
|string      |string      |coder          |
|string      |string      |mid-age        |
|array       |array       |]              |
|string      |string      |location       |
|kvSymbol    |kvSymbol    |:              |
|object      |object      |{              |
|string      |string      |province       |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |Fujian Province            |
|string      |string      |city           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |Quanzhou            |
|string      |string      |area           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |Jinjiang City            |
|object      |object      |}              |
|string      |string      |family         |
|kvSymbol    |kvSymbol    |:              |
|array       |array       |[              |
|object      |object      |{              |
|string      |string      |relation       |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |couple         |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |Helen          |
|object      |object      |}              |
|object      |object      |{              |
|string      |string      |relation       |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |daughter       |
|string      |string      |name           |
|kvSymbol    |kvSymbol    |:              |
|string      |string      |XiaoMan        |
|object      |object      |}              |
|array       |array       |]              |
|object      |object      |}              |
-------------------------------------------

full code

public class JSONParser {

    //currentIndex saves the position of the current string scan, and the string is scanned character by character
    private int currentIndex = 0;

    /**
     * Tokenize the json string
     *
     * @param json json string
     * @return token the list
     */
    public List<Token> tokenizer(String json) {
        // Save word segmentation results
        List<Token> tokens = new ArrayList<>();
        while (currentIndex < json.length()) {
            char c = json.charAt(currentIndex);
            //Blanks can be skipped directly, if there are more blanks, just add a new judgment
            if (c == ' ' || c == '\r' || c == '\n' || c == ',') {
                //As long as the character has been processed, the current position must be moved to the next
                ++currentIndex;
                continue;
            } else if (c == '{' || c == '}') {
                //object
                tokens.add(new Token("object", c));
            } else if (c == '[' || c == ']') {
                //array
                tokens.add(new Token("array", c));
            } else if (c == '"') {
                //string
                StringBuffer value = new StringBuffer();
                char cc = json.charAt(++currentIndex);
                // Here " is used as the sign of the end of the string
                while (cc != '"') {
                    if (cc == '\\') {
                        cc = json.charAt(++currentIndex);
                    }
                    value.append(cc);
                    cc = json.charAt(++currentIndex);
                }
                tokens.add(new Token("string", value.toString()));
            } else if (c == ':') {
                // The separator between key-value
                tokens.add(new Token("kvSymbol", "kvSymbol", c));
            } else if ((c >= '0' && c <= '9') || c == '-') {
                // number
                StringBuffer value = new StringBuffer();
                value.append(c);
                // Judging whether it is a floating point number
                boolean isFloat = false;
                //If json is an integer such as: 1, then an error will be reported if there is no judgment here
                if (currentIndex + 1 < json.length()) {
                    char cc = json.charAt(++currentIndex);
                    // Judgment includes floating point type, integer type, science and technology method
                    while (cc == '.' || (cc >= '0' && cc <= '9') || cc == 'e' || cc == 'E' || cc == '+' || cc == '-') {
                        value.append(cc);
                        if (cc == '.') {
                            isFloat = true;
                        }
                        cc = json.charAt(++currentIndex);
                    }
                }
                if (isFloat) {
                    //floating point number
                    tokens.add(new Token("value", "float", Float.parseFloat(value.toString())));
                } else {
                    //integer
                    tokens.add(new Token("value", "long", Long.parseLong(value.toString())));
                }
            } else if ((c == 'n') && json.startsWith("null", currentIndex)) {
                tokens.add(new Token("value", "null", null));
                currentIndex += 3;
            } else if ((c == 't') &&
                    json.startsWith("true", currentIndex)) {
                tokens.add(new Token("value", "bool", true));
                currentIndex += 3;
            } else if ((c == 'f') &&
                    json.startsWith("false", currentIndex)) {
                tokens.add(new Token("value", "bool", false));
                //false is 5 characters so need to shift 4 bits
                currentIndex += 4;
            }
            ++currentIndex;
        }
        //reset current location
        currentIndex = 0;
        return tokens;
    }
}

the code

Please refer to the project for the complete code https://github.com/wls1036/tiny-json-parser 0x01 branch

Tags: compiler

Posted by tolputt-craig on Thu, 22 Dec 2022 08:02:02 +0530