Serialize the third part of "explanation of Chrome V8 principles", look at the V8 compilation process and learn lexical analysis

Original source:   See the V8 compilation process and learn lexical analysis - Security guest, security information platform

Content of this article

This is the third part to explain the implementation of lexical analysis (scanner) in V8, which involves several important data structures and some related compilation knowledge. This paper also tries to comprehensively explain the relevant compilation knowledge, so as to give readers a comprehensive understanding. Note: This article does not cover the optimized compilation of V8

1.V8 compilation process

Generally speaking, the V8 compilation process is implemented synchronously. The main process is to scan a token word during initialization, put it into the cache, and then start analysis (parser), take a token from the cache for analysis, then generate a node of the abstract syntax tree, and then take a token for analysis. Such a cycle. If the cache misses, a scan is started to generate a token, as shown in Figure 1.

Let's take a look at the code implementation in V8. The Parser::ParseProgram method is the entry point of the scanner in Figure 1. In this code, you can see scanner_ Initialize() and FunctionLiteral* result = DoParseProgram(isolate, info). The functions of these two methods are:

1. Initialize scanner   Generate the first token word, scanner.c0_ Point to the first character to start, that is, the first character to scan.

2.DoParseProgram(isolate, info):   Read the token word and generate the AST body. Finally, all bodies are aggregated into the AST syntax tree.

void Parser::ParseProgram(Isolate* isolate, Handle<Script> script,
                          ParseInfo* info,
                          MaybeHandle<ScopeInfo> maybe_outer_scope_info) {
  DCHECK_EQ(script->id(), flags().script_id());

  // It's OK to use the Isolate & counters here, since this function is only
  // called in the main thread.
  DCHECK(parsing_on_main_thread_);
  RCS_SCOPE(runtime_call_stats_, flags().is_eval()
                                     ? RuntimeCallCounterId::kParseEval
                                     : RuntimeCallCounterId::kParseProgram);
  TRACE_EVENT0(TRACE_DISABLED_BY_DEFAULT("v8.compile"), "V8.ParseProgram");
  base::ElapsedTimer timer;
  if (V8_UNLIKELY(FLAG_log_function_events)) timer.Start();

  // Initialize parser state.
  DeserializeScopeChain(isolate, info, maybe_outer_scope_info,
                        Scope::DeserializationMode::kIncludingVariables);

  DCHECK_EQ(script->is_wrapped(), info->is_wrapped_as_function());
  if (script->is_wrapped()) {
    maybe_wrapped_arguments_ = handle(script->wrapped_arguments(), isolate);
  }

  scanner_.Initialize();
  FunctionLiteral* result = DoParseProgram(isolate, info);
  MaybeResetCharacterStream(info, result);
  MaybeProcessSourceRanges(info, result, stack_limit_);
  PostProcessParseResult(isolate, info, result);

  HandleSourceURLComments(isolate, script);

  if (V8_UNLIKELY(FLAG_log_function_events) && result != nullptr) {
    double ms = timer.Elapsed().InMillisecondsF();
    const char* event_name = "parse-eval";
    int start = -1;
    int end = -1;
    if (!flags().is_eval()) {
      event_name = "parse-script";
      start = 0;
      end = String::cast(script->source()).length();
    }
    LOG(isolate,
        FunctionEvent(event_name, flags().script_id(), ms, start, end, "", 0));
  }
}

The above code (Parser::ParseProgram) is the nearest entry point for analyzing the compiled code in V8. Enter the trace from here and you can see the whole compilation process. There are also many callers in the outer layer of this method, but the task of these callers is only to make preparations. This method has entered the real compilation. In particular, it should be noted that the first scan has been started in the scanner. Initialize phase.

2.V8 lexical analysis (scanner) workflow

The first step of the compiler is called lexical analysis or scanning. In the course of compilation principle, my teacher uses the term "lexical analysis". When I look at the compiler source code and face the engineering implementation, I see the name "scanning". One of these two names focuses on principle and the other on implementation. The lexical analyzer reads the character stream of the source program, organizes them into a sequence of meaningful morphemes, and regenerates them into corresponding lexical units (tokens).
In V8, the character stream of the source program is UTF-16 encoded, and the data structure for storing the character stream is as follows:

// ---------------------------------------------------------------------
// Buffered stream of UTF-16 code units, using an internal UTF-16 buffer.
// A code unit is a 16 bit value representing either a 16 bit code point
// or one part of a surrogate pair that make a single 21 bit code point.
class Utf16CharacterStream {
 public:
  static constexpr base::uc32 kEndOfInput = static_cast<base::uc32>(-1);

  virtual ~Utf16CharacterStream() = default;

  V8_INLINE void set_parser_error() {
    buffer_cursor_ = buffer_end_;
    has_parser_error_ = true;
  }
  V8_INLINE void reset_parser_error_flag() { has_parser_error_ = false; }
  V8_INLINE bool has_parser_error() const { return has_parser_error_; }

  inline base::uc32 Peek() {
    if (V8_LIKELY(buffer_cursor_ < buffer_end_)) {
      return static_cast<base::uc32>(*buffer_cursor_);
    } else if (ReadBlockChecked()) {
      return static_cast<base::uc32>(*buffer_cursor_);
    } else {
      return kEndOfInput;
    }
  }

  // Returns and advances past the next UTF-16 code unit in the input
  // stream. If there are no more code units it returns kEndOfInput.
  inline base::uc32 Advance() {
    base::uc32 result = Peek();
    buffer_cursor_++;
    return result;
  }

//...............
//Omit a lot of code in the middle
//..............



  // Read more data, and update buffer_*_ to point to it.
  // Returns true if more data was available.
  //
  // ReadBlock() may modify any of the buffer_*_ members, but must sure that
  // the result of pos() remains unaffected.
  //
  // Examples:
  // - a stream could either fill a separate buffer. Then buffer_start_ and
  //   buffer_cursor_ would point to the beginning of the buffer, and
  //   buffer_pos would be the old pos().
  // - a stream with existing buffer chunks would set buffer_start_ and
  //   buffer_end_ to cover the full chunk, and then buffer_cursor_ would
  //   point into the middle of the buffer, while buffer_pos_ would describe
  //   the start of the buffer.
  virtual bool ReadBlock() = 0;

  const uint16_t* buffer_start_;
  const uint16_t* buffer_cursor_;
  const uint16_t* buffer_end_;
  size_t buffer_pos_;
  RuntimeCallStats* runtime_call_stats_;
  bool has_parser_error_ = false;
};

In the above code, many intermediate codes are omitted. I keep the initial English comments and the English comments of the last important members. These important members are bufferstart, buffercursor and bufferend, which are used to record the source code stream. See the comments in the code for details. Let's take a look at the application scenario of class Utf16CharacterStream through an example.

In Figure 2, the js source code is on the left and the source code is read into memory and ready for compilation. The following code is handed over to bufferstart for management.

  bool ReadBlock() final {
    size_t position = pos();
    buffer_pos_ = position;
    buffer_start_ = &buffer_[0];
    buffer_cursor_ = buffer_start_;

    DisallowGarbageCollection no_gc;
    Range<uint8_t> range =
        byte_stream_.GetDataAt(position, runtime_call_stats(), &no_gc);
    if (range.length() == 0) {
      buffer_end_ = buffer_start_;
      return false;
    }

    size_t length = std::min({kBufferSize, range.length()});
    i::CopyChars(buffer_, range.start, length);
    buffer_end_ = &buffer_[length];
    return true;
  }

bufferstart of code  = & Buffer [0], this sentence uses buffer_start points to the first character of the source code to be compiled. Figure 3 shows the important functions in the Utf16CharacterStream class. This function reads a block code (code unit) and is the input string for subsequent token generation.

The previously mentioned member scanner.c0 always points to the string that will start token generation. The following code also explains the role of C0.

  void Advance() {
    if (capture_raw) {
      AddRawLiteralChar(c0_);
    }
    c0_ = source_->Advance();//Look here
  }

In the last step of scanner initialization, the following method is called to generate a token.

void Scanner::Scan(TokenDesc* next_desc) {
  DCHECK_EQ(next_desc, &next());

  next_desc->token = ScanSingleToken();
  DCHECK_IMPLIES(has_parser_error(), next_desc->token == Token::ILLEGAL);
  next_desc->location.end_pos = source_pos();

#ifdef DEBUG
  SanityCheckTokenDesc(current());
  SanityCheckTokenDesc(next());
  SanityCheckTokenDesc(next_next());
#endif
}

We can see next_desc - > token = scansingletoken() this code is generating a token. You can see next_desc is the key structure describing token word. It is a structure defined in Scanner. The code is as follows:

  struct TokenDesc {
    Location location = {0, 0};
    LiteralBuffer literal_chars;
    LiteralBuffer raw_literal_chars;
    Token::Value token = Token::UNINITIALIZED;
    MessageTemplate invalid_template_escape_message = MessageTemplate::kNone;
    Location invalid_template_escape_location;
    uint32_t smi_value_ = 0;
    bool after_line_terminator = false;

#ifdef DEBUG
    bool CanAccessLiteral() const {
      return token == Token::PRIVATE_NAME || token == Token::ILLEGAL ||
             token == Token::ESCAPED_KEYWORD || token == Token::UNINITIALIZED ||
             token == Token::REGEXP_LITERAL ||
             base::IsInRange(token, Token::NUMBER, Token::STRING) ||
             Token::IsAnyIdentifier(token) || Token::IsKeyword(token) ||
             base::IsInRange(token, Token::TEMPLATE_SPAN, Token::TEMPLATE_TAIL);
    }
    bool CanAccessRawLiteral() const {
      return token == Token::ILLEGAL || token == Token::UNINITIALIZED ||
             base::IsInRange(token, Token::TEMPLATE_SPAN, Token::TEMPLATE_TAIL);
    }
#endif  // DEBUG
  };

At this point, the initialization of lexical analysis is completed, and Parse drives it to work. The entry points of lexical analysis are as follows.

void Scanner::Scan(TokenDesc* next_desc) {
  DCHECK_EQ(next_desc, &next());

  next_desc->token = ScanSingleToken();
  DCHECK_IMPLIES(has_parser_error(), next_desc->token == Token::ILLEGAL);
  next_desc->location.end_pos = source_pos();

#ifdef DEBUG
  SanityCheckTokenDesc(current());
  SanityCheckTokenDesc(next());
  SanityCheckTokenDesc(next_next());
#endif
}

3. Knowledge points of V8 lexical analysis

Compilation technology is a huge field of knowledge, involving many aspects of theoretical knowledge. The specific implementation of a compiler mainly includes: lexical analysis, semantic analysis, intermediate code, machine independent code, machine code and other stages. In addition, there are various optimization technologies, such as control flow optimization, data flow optimization, register optimization and so on, These optimization techniques are interleaved with these stages to generate the target program. From the perspective of compilation technology, each piece of code compiled by V8 has corresponding theoretical support. After understanding these theories, it will be easy to look at the V8 compilation source code. Let's take a look at the knowledge points of V8 lexical analysis.
Lexical analysis is the first stage of compilation. The main task of the lexical analyzer is to read in the input characters of the source program, form them into lexeme s, generate and output a token, each lexical unit corresponds to a morpheme, and the lexical unit is output to the parser for syntax analysis. When the lexical analyzer finds an identifier, it will also save it using the symbol table. The workflow of lexical analysis and the interaction with the symbol table are driven by the parser, as shown in Figure 4.

In Figure 4, getNextToken is a command that drives the lexical analyzer to ask for a token until it recognizes the next morpheme, and then returns the generated token to the parser. Take a look at the implementation of this process in V8.

void ParserBase<Impl>::ParseStatementList(StatementListT* body,
                                          Token::Value end_token) {
  // StatementList ::
  //   (StatementListItem)* <end_token>
  DCHECK_NOT_NULL(body);

  while (peek() == Token::STRING) {
    bool use_strict = false;
#if V8_ENABLE_WEBASSEMBLY
    bool use_asm = false;
#endif  // V8_ENABLE_WEBASSEMBLY

    Scanner::Location token_loc = scanner()->peek_location();

    if (scanner()->NextLiteralExactlyEquals("use strict")) {
      use_strict = true;
#if V8_ENABLE_WEBASSEMBLY
    } else if (scanner()->NextLiteralExactlyEquals("use asm")) {
      use_asm = true;
#endif  // V8_ENABLE_WEBASSEMBLY
    }

    StatementT stat = ParseStatementListItem();
//.......................
//Omit a lot
//.......................
  }

  while (peek() != end_token) {
    StatementT stat = ParseStatementListItem();
    if (impl()->IsNull(stat)) return;
    if (stat->IsEmptyStatement()) continue;
    body->Add(stat);
  }
}

This code is very long, and we may not be able to cover every statement due to the different debugging cases selected during the debugging process. Let's just look at the last part. My debugging code triggers the execution of the last while(), of which the most important is ParseStatementListItem(). Figure 5 shows the call stack of Parse driven schanner and the function call process in the stack, that is, the specific implementation of getNextToken in Figure 4.

You can also see the cache mentioned earlier (Figure 1) in Figure 5. Its implementation in V8 is the Consum method. The following is a detailed implementation of Token generation in V8.

V8_INLINE Token::Value Scanner::ScanSingleToken() {
  Token::Value token;
  do {
    next().location.beg_pos = source_pos();

    if (V8_LIKELY(static_cast<unsigned>(c0_) <= kMaxAscii)) {
      token = one_char_tokens[c0_];

      switch (token) {
//..............
//The code is too long and omitted a lot
//..............

        case Token::CONDITIONAL:
          // ? ?. ?? ??=
          Advance();
          if (c0_ == '.') {
            Advance();
            if (!IsDecimalDigit(c0_)) return Token::QUESTION_PERIOD;
            PushBack('.');
          } else if (c0_ == '?') {
            return Select('=', Token::ASSIGN_NULLISH, Token::NULLISH);
          }
          return Token::CONDITIONAL;

        case Token::STRING:
          return ScanString();

        case Token::LT:
          // < <= << <<= <!--
          Advance();
          if (c0_ == '=') return Select(Token::LTE);
          if (c0_ == '<') return Select('=', Token::ASSIGN_SHL, Token::SHL);
          if (c0_ == '!') {
            token = ScanHtmlComment();
            continue;
          }
          return Token::LT;
        case Token::ASSIGN:
          // = == === =>
          Advance();
          if (c0_ == '=') return Select('=', Token::EQ_STRICT, Token::EQ);
          if (c0_ == '>') return Select(Token::ARROW);
          return Token::ASSIGN;

        case Token::NOT:
          // ! != !==
          Advance();
          if (c0_ == '=') return Select('=', Token::NE_STRICT, Token::NE);
          return Token::NOT;

        case Token::ADD:
          // + ++ +=
          Advance();
          if (c0_ == '+') return Select(Token::INC);
          if (c0_ == '=') return Select(Token::ASSIGN_ADD);
          return Token::ADD;
//..............
//The code is too long and omitted a lot
//..............

        default:
          UNREACHABLE();
      }
    }

    if (IsIdentifierStart(c0_) ||
        (CombineSurrogatePair() && IsIdentifierStart(c0_))) {
      return ScanIdentifierOrKeyword();
    }
    if (c0_ == kEndOfInput) {
      return source_->has_parser_error() ? Token::ILLEGAL : Token::EOS;
    }
    token = SkipWhiteSpace();

    // Continue scanning for tokens as long as we're just skipping whitespace.
  } while (token == Token::WHITESPACE);

  return token;
}

As can be seen from the code, the process of generating token is the switch case, which is an automata, and regular expression is also widely used in the process of generating token. To sum up: the generation process of token is the process of character matching. The token template (TOKEN_LIST) is pre-defined in V8, and then the character matching is completed by using switch case to generate token.
Well, that's all for today. See you next time.
Wechat: qq9123013 remarks: v8 exchange learning email: v8blink@outlook.com

Tags: Javascript Web Security chrome compiler

Posted by martinacevedo on Mon, 27 Sep 2021 04:21:46 +0530