Remember the CPU soaring problem caused by Nacos!

sequence

Suddenly this afternoon, the cpu in the test environment soared to 60%, and the response time of other projects was significantly longer. . . It's a bit scary, I don't want to take the blame

Background of the project

The problematic project is a project that needs to connect different nacos and different namespace s for corresponding operations. The operations on nacos are the api interfaces called by httpClient. "There is no problem with the httpClient method, so don't question this."

positioning problem

First of all, the cpu is too high, just check it with top -Hp

Locate the process id, then execute jstack process id -> 1.txt

See the stack information, there are many prompts below

"com.alibaba.nacos.client.config.security.updater" #2269 daemon prio=5 os_prio=0 tid=0x00007fa3ec401800 nid=0x8d85 waiting on condition [0x00007fa314396000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f7f3eae0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1093)
        at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
copy

But the above prompt information shows that it is inside the thread, and it is inside the nacos client

What you do makes me very uncomfortable. I always call it through http. At that time, it was to prevent useless threads from being opened. . . . . "how"

Then I'll go and find out where it was printed according to your keywords, "keyword com.alibaba.nacos.client.config.security.updater"

Methods of the ServerHttpAgent Class

// init executorService
this.executorService = new ScheduledThreadPoolExecutor(1, new ThreadFactory() {
    @Override
    public Thread newThread(Runnable r) {
        Thread t = new Thread(r);
        t.setName("com.alibaba.nacos.client.config.security.updater");
        t.setDaemon(true);
        return t;
    }
});
copy

This is the construction method. It should only be initialized once. Going up to debug, I rely on it, it is called in the NacosConfigService class.

It was called once when the project was initialized, the business system depends on nacos, ok understandable

Then there is the long wait. After 30s, it is found that it is another call. I will go. How is it possible. . .

Go back to debug, the code is as follows

scheduler.schedule("Timing proofread grayscale nacos configuration", () -> loadGrayConfig(grayFileName),
    1800, 1800, TimeUnit.SECONDS);
/**
 * Grayscale configuration update to solve the problem of network isolation
 *
 * @param grayFileName The name of the grayscale file
 */
private void loadGrayConfig(String grayFileName) {
    synchronized (this) {
        System.err.println("loadGrayConfig datetime: " + DateUtils.formatDate(new Date()));
        //Refresh the cache to re-acquire the nacos content assignment
        grayConfigManager.loadNoCache(grayFileName);
    }
}
copy

At that time, it was for the grayscale function, and a thread pool was used for timing data verification. At that time, I thought it was appropriate to use the thread pool. . . Also specially called the Nocache method to let him create a new nacos Config object for data proofreading

"But every time NacosFactory.createConfigService(properties) is called, the nacos config constructor will open a thread, which leads to this problem"

You may have to ask here. You said that this scheduling task was added to prevent network isolation. What is network isolation?

I first heard about this concept when learning Raft

Suppose a Raft cluster has three nodes, and the "network is isolated" of node 3, then according to the implementation of "BasicRaft", the cluster will have the following actions:

  • Because the network is isolated, Node 3 cannot receive Heartbeat and AppendEntries from the Leader, so Node 3 will enter the election process. Of course, the election process cannot receive votes, so Node 3 will repeatedly time out the election; Node 3's Term will always increase
  • Node 1 and Node 2 will work normally and stay in the current Term

After the network is restored, when the Leader sends RPCs to node 3, node 3 will reject these RPCs because the term of the sender is too short.

After receiving the rejection sent by node 3, the Leader will increase its Term and become a Follower.

Afterwards, the cluster starts a new election, and there is a high probability that the original Leader will become a new round of Leader.

So how does Raft solve network isolation?

The security issue of multiple rounds of voting is tricky, and it is necessary to avoid the situation of submitting two different blocks at the same height and different rounds. In Tendermint, this problem can be solved by a locking mechanism.

Locking rules: "Prevote-the-Lock":

Validators can only "pre-vote" blocks they are locked into. This prevents validators from pre-committing a block in the previous round and then pre-voting another block in the next round.

Unlock-on-Polka: The validator can only release the lock after seeing a higher round (relative to the round number of the block it is currently locked on) of Polka. This allows validators to unlock if they pre-committed a block, but the rest of the network does not want to commit, thus protecting the operation of the entire network, and doing so without compromising network security.

"The solution is to replace term with (term, nodeid), and compare the size in lexicographical order (a > b === a.term > b.term || a.term == b.term && a.nodeid > b . node_id). This is the practice in paxos, to ensure that there will be no conflicts in raft.”

The principle is that raft has two values ​​to describe the voting stage: term and which node_id is currently voted for, that is, [term, nodeid], because raft does not allow one term to vote for two different nodes, that is, vote_req.term > local.term && vote_req.nodeid == local.nodeid will grant this vote request.

After replacing term with (term,nodeid), the size comparison of the vote stage becomes: vote_req.term > local.term || vote_req.term == local.term && vote_req.nodeid >= local.nodeid, the condition is loose Yes. In the same term, the leader with a larger nodeid can take away the established leader with a smaller nodeid.

The term originally recorded in the log also needs to be replaced with (term, node_id), because the combination of these two items can uniquely determine a leader. Previously, only one term in raft could uniquely determine a leader.

Compare the largest log id in the vote, change from comparing tuple (term, index) to comparing tuple (term, node_id, index).

Just a little modification.

"To sum up, sorting according to the dictionary and pre-voting locks guarantee that when multiple candidate s with the same term meet, there will definitely be one that wins the majority vote."

idea

If we come back after abnormal network isolation, it may lead to data inconsistency, but the above solution is not suitable for us because it is relatively heavy, so we simply introduce "scheduled tasks for regular proofreading for comparison (same as reconciliation)"

repair

I traverse the nacos config connection to check whether it is alive, if not, I will shut down and generate a new one, instead of generating all of them. After all, the constructor has opened a thread. . . .

Speaking of it, it was because I was confident at the time, and I didn’t look at this call. I wrote it in the subclass to open the thread. Haha, okay, change it, and run to the test environment to see the effect (CPU)

egg

It seems that the response time of the test environment has become longer, which has nothing to do with me. . . . It was someone else's pressure test, which ate up the bandwidth. . . . . "See through but not tell"

copy

Tags: node.js raft

Posted by pouncer on Thu, 23 Feb 2023 06:18:04 +0530