Technical aspects of Java Development Engineer (Hash consistency algorithm)

1, Usage scenario:

        1. Use of Redis cluster:

    To ensure Redis High availability, improved Redis Read and write performance, the simplest way---Master slave copy,
form Master-Master(Main mode)perhaps Master-Slave(Master-slave mode)Form, or build
 stand Redis Cluster is used for data read-write separation, which is similar to master-slave replication and read-write separation of database.

    Also similar to database, form data > 500W It is necessary to divide the warehouse and table. So when the amount of data
 Very large(Redis The data volume is different from that of the database),We can also be right Redis Perform warehouse and table splitting operations.


    Hypothesis: a social networking site needs to use Redis Store picture resources in the format
 Key value pair(key:name --> value:path),We need to find the file by its name
 The path on the server. The data volume is about 2000 W. The rule is to divide the database and table
 random allocation. Suppose 8 cache servers are deployed, each server is about 500 w data And carry out
 Master slave replication.

    Then: since the rules are random, a piece of data may exist in any group Redis
 Therefore, you need to traverse all in turn Redis The server can query the expected results. obviously,
The efficiency of traversal is very low.

                      

  2. Use Hash value for Redis cluster:

[the above assumptions can not reach our overdue results. Therefore, the random allocation rule is not applicable. Then imitate the database sub database and sub table:

The efficiency can be improved according to common rules such as Hash value, modulus, category and a field value]

Suppose we're looking for“ a.png",Because there are four servers(Exclude from library),Therefore, the formula:
 Hash(a.png) % 4 = 2,It can be located on server 2. In this way, it can be avoided
 Traversing all servers greatly improves performance.

[Core idea: each picture corresponds to Hash After the value is modeled and calculated, it corresponds to the label of different servers to realize the function of locating to the corresponding server]

                        

  3. Disadvantages and drawbacks of using only Hash values:

1,[Question 1] follow the above assumption: if four servers cannot meet our cache requirements, we need to add one server.
   So when you add a Redis After the node, only four nodes are stored in the same location as the original node. At this time, other nodes,
Unable to get data from cache at(Cache invalidation),It will read data from the background database together, which will cause cache blood avalanche.


2,[Question 2] the same as the above assumption: if one of the four servers suddenly fails, we need to remove it,
When the servers are changed from 4 to 3, the above cache avalanche will also occur.

Original 4 sets:

  Insufficient server cache. After adding, 5 servers:

  2, Uncover the mystery of consistent Hash algorithm:

1, Consistency Hash Modular method of algorithm:

    1,Upper description Redis Cluster mode:
        
        Hash Value is used to model the number of servers

    2,uniformity Hash Modular method of algorithm:

        Hash Value pair(2^32)-1 Take the mold so that the whole hash The value space forms a virtual ring.

    3,Hash Ring: the whole space is organized clockwise, directly above 0 until(2^32)-1. By this 2^32
 A ring of points is called Hash And the starting point and the ending point coincide in the zero direction.

Suppose the value space of a hash function H is 0 ~ (2 ^ 32) - 1 modulo [that is, the hash value is a 32-bit unsigned integer]:

                  

 

2, Server location assignment:

    1,For each server hash Hash values:

        Fetch server IP Or hash the host name as a keyword, so you can
 Identify each machine at hash The corresponding position on the ring.

Suppose that the above four servers are located in the ring space after hashing using IP addresses:

3, How does the data access the corresponding server(Use the data to locate the corresponding server): 

    1,Set data value key Use the same function Hash Calculated hash Value to determine the position of this data on the ring.

    2,From the calculated position, go clockwise along the ring, and the first server you encounter is the server to which the modified data should be located

    3,In easy to understand terms, it is to classify the data and locate them in different areas on the ring,
The area where this kind of data is located is a server used to carry these data.

Suppose we have four objects: Object A, Object B, Object C and Object D. after hash calculation, the drinking position in ring space is as follows: according to the consistency hash algorithm, data Object A is located on Node A.

  3, Fault tolerance and scalability of consistent hash algorithm:

1,Fault tolerance:

    hypothesis Node C Unfortunately, only A,B,D Will not be affected, C The object is redirected to D. 
Just this server C To the previous server in its ring space B The data between will be affected, and others will not be affected.

2,Scalability:

    increase Node X,Only counterclockwise is the closest to it C The data will be relocated to X,Other data objects
A,B,D Not affected.

  To sum up: the consistency hash algorithm only needs to relocate a small part of the data in the ring space for the increase or decrease of nodes, so it ensures fault tolerance and scalability.

4, Data skew of hash ring:

1,When there are too few service nodes, it is easy to cause the problem of data skew due to the uneven distribution of nodes.

2,A large amount of data is distributed on a node, which will be under great pressure

Suppose there are only 2 servers in the system:

                          

  Suppose that in the above case, three virtual nodes are calculated for each server, so "Node A1, Node A2

The hash values of "Node A3, Node B1, Node B2 and Node B3", so there are six virtual nodes:

At the same time, the data location algorithm remains unchanged, only the mapping from virtual nodes to actual nodes is added, so that the data on node a can be evenly located on Node A1, Node A2 and Node A3, which solves the problem of data skew.

  5, Conclusion:

The consistency hash algorithm mainly considers that each node of the distributed system may fail, and the new nodes are likely to increase or decrease dynamically. In order to ensure that the system can still provide good services when the number of system nodes changes, this is the benefit of the consistency hash algorithm.

Refer to blog post 1:   Principle of consistent hash algorithm - lpfuture - blog Garden (cnblogs.com)

Refer to blog post 2: (1 message) Redis distributed algorithm principle - Hash consistency understanding I think, so I am very excited in coding - CSDN blog

Refer to blog post 3: (1 message) interview prerequisites: what is consistency hash algorithm? _javabackend technology CSDN blog what is consistency hash

Tags: Java Algorithm hash

Posted by Wildhalf on Mon, 20 Sep 2021 10:55:06 +0530