foreword
undertake the above The evolution of Internet architecture
1,ServiceMesh Architecture design
2,High Availability Design Means
3,High Concurrency Design Means
First, I recommend you a very useful tool for drawing software design drawings.
gitmind
This picture is only to show the effect of drawing with gitmind, please ignore the content
Concise, clear and effective

text begins
ServiceMesh Architecture Design
Public business logic sinks

Business programs do not focus on communication components
pass jar import by package
Difficulty upgrading infrastructure components

Service component version upgrade
To allow business groups to jar replace in package
Affects the ability and speed of delivery of infrastructure teams

between multiple programming languages communication question
Every language needs to write a set of infrastructure high cost
Because the communication component is coupled with the application
The business R&D team and the infrastructure team must be physically decoupled
One set of infrastructure supports multilingual development
Apps are available in multiple languages
Infrastructure capabilities sink from the application into a separate process
service mesh
1,independent process
2,communication between services
3,Whether stateful services or stateless services will run in cloud native in the future(docker k8s)superior
4,Lightweight web proxy
5,Deployed with the application Transparent to the application
Service mesh architecture

sidecar Is an rpc Serve
application A->sidecarA->sidecarB->application B
Reverse analogy
Both communication protocols are tcp
data protocol pb
Regardless of the language of the application only one set sidecar just fine
Service upgrades do not depend on business teams
The business team iterates fast
Why it must be deployed together
the same physical machine or k8s pod
based on tcp lightweight
No need to consider load balancing\Retry\service registration\Configuration Center\routing
The earliest ServiceMesh

Case 1 - Baidu Space

is used pull model
Baidu Space Big amount of data Data consistency is not so high is an asynchronous architecture

Case 2 - Social IM

pc desktop version tcp Long connection
web can't use tcp Long connection
http Protocol simulation long polling
Similar to live difficulty Anti-simultaneous concurrency
Synchronous Architecture real-time news

The routing layer is IM Unique no need to pay attention
split horizontally

at the very beginning
Simple business few people business in one process good maintenance
Coarse granularity
Split vertically

common logic layer
1,componentized jar Bag
2,Servicing Sink into independent services Provide compatible interface
continue to split horizontally

Internet core technology practice
1,High Availability Design Means
2,High concurrency design means (from architecture, code, algorithm)
3,Service statelessness (one of the means of high availability)
4,load balancing (Common Load Balancing Algorithms More generalized perspectives)
5,Service idempotency (depends on distributed locks)
a The request is repeated multiple times The service guarantees that the final result is completely consistent
b business Only one left in stock Multiple users snap up at the same time Guarantee that the product is not oversold
c User places an order User does not end before Repeat orders are not allowed
d message to mq posted multiple times Downstream consumption ensures message deduplication
6,Distributed transaction
7,Service downgrade Limiting fuse
8,grayscale release
9,Service full link stress test
High Availability Design
Hardware always has a life cycle
1,
x86 32cpu 128 Memory 10 Gigabit 1T hard disk about 50,000
Basically use 3 years
Downtime is related to the number of servers The more the number of units, the greater the possibility of downtime
2,
distributed CAP
stand-alone CA does not exist no network partition
AP and CP More Multiple hard drives in the computer room network division machine still available
Software always has bugs
Evaluation dimension
- Unscientific (because the flow has low and high peaks)
one year/the first quarter/January % of downtime
99 means 365*24*60*1%=88 Hour
1 9 is 90%
365*24*60*90%=1/10 880 Hours
- science
Total requests affected by downtime/total requests
Redundancy deploy multiple copies
Services are deployed in different cabinets and different racks
Stateless (full peer-to-peer) for rapid scaling Elastic shrink
such as two machines Each is full session data not stateless
If a service hangs reboot session Gone not equal
load balancing
gateway->Business Logic Tier 1 and Business Logic Tier 2
process:
If the business logic layer 1 hangs up
The load balancing component will know that 1 is down
kick 1
transfer to 2
restore 1
Asynchronous is a means of high concurrency and high availability
don't care much about returning results
not a critical route for the request
Core traffic synchronization non-core async
More than just service-level high availability
The data layer should also be highly available
Service real-time monitoring

real time monitoring implement logic:
Do it based on logs
5 record a log in seconds
Time-consuming on logging
write to local disk
pass flume data collection
send to kafaka message queue
Then spark real-time statistics real-time time-consuming
Service tiering - reducing and avoiding service failures

How to stop online services seamlessly
Target:Downtime does no harm to users
i.e. if your request is accepted I 100%finished
rejected at the gateway level
Gateway hot swap function
Hot switch switching
For example, stop at 8 o'clock
8 All incoming requests are processed
8 request after point switch from 0 to 1 All requests are denied
When are requests before 8 o'clock guaranteed to be processed?
1,Inelegant: whether to log on each layer If the printing log layer is still processing
2,Elegance: front-end requests has a timeout
If the timeout is 5 seconds
Then at 8:05 time out
Shutdown stops the service from the upper layer to the lower layer
No thermal switch
The firewall configuration of the machine can only go in and out at a certain point in time
request process

High Concurrency Design
Reduce latency
Increase throughput
Put the system in a reasonable state

space for time
cache database data
Because the database is more expensive than the memory
time for space
network transmission http communication gz compression Decompression will consume cpu time
Data with few changes
1,app Shopping categories on the page Level 1 cell phone secondary computer
change less You can't pull it every time you log in
Judging by the version number which are updated Download only if updated
2,Number of friends Pull once every time you log in change infrequently
List data has a version number server Both the client and the client put one
Check if the version number has been updated Pull once if there is one
Which requests are taking a lot of time
1,If a service cluster has 40,000 QPS
For accounting for 90%The top 5 interfaces for traffic are optimized
leftover qps also cannot be ignored If the query is slow neither
2,See how many calls a service has made rpc ask
If data, algorithms and other cores are synchronized
Asynchronous execution of non-core processes
Split into separate modules to execute such as message queues
3,One logic calls multiple RPC interface
No data dependencies between interfaces Consider calling in parallel
optimization level

code level
1,don't loop through rpc ask
Instead, the batch interface should be called to assemble the data
2,Avoid generating too many useless objects such as using isDebugEnabled() should be directly log.debug()
3,ArrayList HashMap Is the initial capacity setting reasonable?
Expansion is expensive
For example, directly initialize the size of 1 million according to the actual situation of the business volume Although it consumes some memory But performance can be guaranteed
4,rpc Data reuse from interface query
5,make a copy of the data Modify directly
Read more and write less use CopyOnWriteArrayList
6,StringBuilder of capacity in the case of pre-allocation performance ratio String Increase about 15 times
7,Is the data initialized correctly Shared data globally hungry man mode
That is, it is initialized before user access
database
1,status value If the length is within 255 use unsigned tinyint ; ip use int instead of varchar
2,use enum the scene use tinyint replace because enum Expansion needs to be changed
3,prohibit select * will cost io,Memory, cpu,network
4,Analyze query scenarios to build appropriate indexes
Analyze field selectivity, index length, pair length varchar Use prefix index
5,field Not Null
allow Null Fields require additional storage space to process Null
and difficult to optimize
The purpose is to bring down the server CPU usage, IO Traffic, memory usage, network consumption, reduced response time
locality principle

This is a two-dimensional array is a one-dimensional array in memory
first paragraph time consuming 140ms
second paragraph time consuming 2700ms

The closer to the CUP the faster
1,The speed is getting higher and higher Memory->L3->L2->L1 multilevel cache
2,is a large one-dimensional array in memory
2D array arranged in memory row by row
store first a[0]Row put it again a[1]Row
3,Traversing by row: the principle of locality Cache Hit(High cache hit rate)
4,Traversal by column: the array elements of the next column and the previous column are not contiguous in memory
likely to cause Cache Miss(cache miss)
5,CPU Need to load data into memory faster CPU L1 Cache The speed is much reduced
(main memory 100ns L1 Cache 0.5ns)
6,Use cache if you can use cache Whether it is a local cache or a distributed cache
7,high frequency visit The timeliness is not high suitable for caching for example advertising space
high timeliness Cache coherency issues need to be considered Not suitable for caching Compare transaction data
Code logic needs to adapt to scenarios where data changes

1,explain:SQL execution plan
2,prossible_keys idx_addtime
key null means no index
Once the amount of query data exceeds 30% no index full table scan
Report Query
Only calculate the incremental data and merge the previous calculation results
Concurrency and lock optimization
based on CAS LockFee(Read without lock write lock)Compare mentex(Both read and write are locked)better performance
Case 1 - E-commerce spike system

Data Hierarchical Check
The upper layer tries to filter invalid requests
can be inexact filtering
Layer-by-layer current limiting The last layer does data consistency check deducted inventory
funnel pattern

1,static data Html js css static files put CDN cache to the client(APP/browser)
2,Non-real-time dynamic data Cached in a location close to the user's access link (product title, product description, whether the user is eligible for the seckill, whether the seckill has ended)
3,Real-time data: marketing data (red envelopes, discounts) commodity inventory filter out users
How to make sure you don't oversell
DB transaction guarantees consistency

Case 2-Feed System

Hotspot data is cached where the calling link is closer to the user

1,Memory stores the most active data
2,L1 Small cache capacity is responsible for resisting the hottest data
L2 Cache consideration goal is capacity Cache a wider range of data
general user timeline
High hotspot data is cached separately For example, setting a whitelist Big V User data is cached separately
3,feed first 3 pages 97% The first few pages of data are cached as hotspot data to L1 cache
4,The business logic layer often also opens some caches to store hot data such as big V of id

push mode
like push That only pushes active users
like 10,000 users in the business logic layer Each batch of 100 users needs to push
100 parallel push
Optimize strategically
Active users first
How to distinguish active users?
Active user list length 1 million
If the user is online, write it in the list offline delete
Weibo data storage solution

1,Pika Key-Value curing storage(persistent storage)
2,object storage Ceph\FastDFS
WeChat Moments are a combination of push and pull
1,Find There is a message reminder is to push
2,Click to open the circle of friends is to pull
Weibo latest data display logic
Say you have 500 friends Take 100 stats of each person A total of 50,000 pieces of data In the business logic layer according to timeline reverse order
There is essentially no difference between websocket and long polling
websocket The bottom layer is also long polling
web can't be used tcp protocol
websocket exist http On the basis of encapsulating long polling
postscript
Will continue to share service mesh ServiceMesh relevant practice
If you find it useful click and see😄