regular expression
match class symbol
-
what is regular expression
Regular expressions are a tool for making complex string problems easy
Regular is not a python-specific syntax (it does not belong to python), all high-level programming languages support regular, and regular syntax is universal
No matter what problem is solved by regular expressions, when writing regular expressions, you are using regular expressions to describe string rules -
Python's re module
The re module is a module used by python to support regular expressions. The module contains all functions related to regular expressions.
fullmatch( regular expression, string) - Determines whether the regular expression matches the specified string exactly (determines whether the entire string conforms to the rules described by the regular expression)
-
Regular Grammar - Matching Class Symbols
from re import fullmatch # 1) Ordinary symbols - symbols that represent the symbols themselves in regular expressions result = fullmatch(r'abc', 'abcd') print(result) # 2). - matches any character result = fullmatch(r'a.c', 'a it is good c') print(result) result = fullmatch(r'..abc', '12dabc') print(result) # 3) \d - matches any digit character result = fullmatch(r'a\dc', 'a5c') print(result) # 4) \s - matches any whitespace character # whitespace - characters that produce whitespace effects, such as spaces, newlines, horizontal tabs result = fullmatch(r'a\sc', 'ac') print(result) # 5) \D - matches any non-digit character result = fullmatch(r'a\Dc', 'a1c') print(result) # 6)\S - matches any non-whitespace character result = fullmatch(r'a\Sc', 'a1c') print(result) # 7) [Character Set] - matches any character in the charset """ [abc] - match a or b or c [abc\d] - match a or b or c or any number [1-9] - matches any number from 1 to 9 [a-z] - matches any lowercase letter [A-Z] - matches any capital letter [a-zA-Z] - matches any letter [a-zA-Z\d_] - Match alphanumeric or underscore [\u4e00-\u9fa5] - Match any Chinese character Notice:[]The minus sign is placed between two characters to indicate who is going to whom (the way of determination is determined according to the size of the character encoding value); if the minus sign is not between the two characters, it means an ordinary minus sign """ result = fullmatch(r'a[MN12]b', 'aNb') print(result) result = fullmatch(r'a[MN\d]b', 'a8b') print(result) result = fullmatch(r'a[\u4e00-\u9fa5]c', 'a it is good c') print(result) result = fullmatch(r'a[A-Z]c', 'aBc') print(result) # 8) [^Charset] - matches any character not in the charset result = fullmatch(r'a[^MN]b', 'a_b') print(result) result = fullmatch(r'a[^\u4e00-\u9fa5]c', 'a0c') print(result)
number of matches
from re import fullmatch # 1.* - 0 or more times (any number of times) # Note: * who controls the number of times behind """ a* - 0 one or more a \d* - 0 one or more\d """ result = fullmatch(r'a*123', 'aaaaaaa123') print(result) result = fullmatch(r'\d*abc', '11234abc') print(result) result = fullmatch(r'[MN]*abc', 'NMabc') print(result) # 2.+ - 1 or more times (at least once) result = fullmatch(r'a+123', 'a123') print(result) # 3.? - 0 or 1 time result = fullmatch(r'A?123', 'A123') print(result) # 4.{} """ {M,N} - M arrive N Second-rate {M,} - At least M Second-rate {,N} - most N Second-rate {N} - N Second-rate """ result = fullmatch(r'[a-z]{3,5}123', 'absds123') print(result) result = fullmatch(r'[a-z]{3,}123', 'absdsdfgsdf123') print(result) result = fullmatch(r'[a-z]{,3}123', 'a123') print(result) result = fullmatch(r'[a-z]{8}123', 'absdasdf123') print(result) # Exercise: Write a regular code to determine whether the input content is a legal QQ number (the length is a number of 5~12 digits, and the first digit cannot be 0) def f1(qq: str): return bool(fullmatch(r'[1-9]\d{4,11}', qq)) qq = input('please enter a qq:') print(f1(qq)) # Exercise: Determine whether the input content is a legal identifier (composed of letters, numbers, underscores, numbers cannot start) def f2(str1: str): return bool(fullmatch(r'[a-zA-Z_][\da-zA-Z_]*', str1)) str1 = input('Please enter an identifier:') print(f2(str1))
greedy not greedy
from re import match # match( regular expression, string) - judges that the beginning of the string matches the rules described by the regular expression result = match(r'\d{3}', '234hkdfjk') print(result)
-
Greed and anti-greed
When the number of matches is uncertain (, +, ?, {M,N}, {N,}, {,N}), the matching mode is divided into two types: greedy and non-greedy, the default is greedy
Greedy and non-greedy: There are multiple matching results when the match is successful. Greedy takes the matching result corresponding to the most times, and non-greedy takes the matching result corresponding to the least number of times.
(Where the number of matches is uncertain, there are multiple matching methods that can be successfully matched. Greedy takes the maximum number of times, and non-greedy takes the minimum number of times)
Greedy mode: , +, ?, {M,N}, {N,}, {,N}
Non-greedy mode: *?, +?, {M,N}?, {N,}?, {,N}?# 1) Example 1: # greedy mode result = match(r'a.+b', 'ambcbdb') print(result) # Anti-greed mode result = match(r'a.+?b', 'ambcbdb') print(result) # Note: If there is only one possible match result, then greedy and non-greedy results are the same # greedy mode result = match(r'a.+b', 'ambc') print(result) # Anti-greed mode result = match(r'a.+?b', 'ambc') print(result)
grouping and branching
-
Grouping - ()
1) Whole - perform related operations on a part of the regular expression as a whole
2) Repeat - You can use \M in the regular expression to repeat the matching result of the Mth group in front of it
3) Capture - only get a part of the matching results in the regular expression (divided into manual and automatic capture)from re import fullmatch, findall # findall( regex, string) - get all substrings in a string that satisfy the regular expression # '12DF45ER65ER45WE' result = fullmatch(r'(\d\d[A-Z]{2})+', '12DF45ER65ER45WE') print(result) # 23m23,98k98,12p12 - True # 23m34,98k18 - False result = fullmatch(r'(\d{2})[a-z]\1[a-z]\1', '12d12d12') print(result) result = fullmatch(r'(\d{2})([a-z]{3})=\2\1{3}', '12abc=abc121212') print(result) # \M can only repeat the content of the group before it, not the content that appears after it # result = fullmatch(r'(\d{2})\1=\2([a-z]{3})', '12abc=abc12') # Error! result = fullmatch(r'(\d{2})\1=([a-z]{3})\2', '1212=abcabc') print(result) # Extract the numeric substring corresponding to the amount in the message message = 'I am 18 years old, with a monthly salary of 500,000 yuan, height 180, weight 70 kg, and 8-pack abs. Pay a Tencent membership fee of 300 yuan per year, a monthly mortgage loan of 3,000 yuan, and a car loan of 2,200 yuan per month.' result = findall(r'(\d+)Yuan', message) print(result) # ['500000', '300', '3000', '2200']
-
branch - |
Regular1|Regular2|Regular3|… - matches a string that can match any of multiple regularities
Matches a string of three numbers or two lowercase letters
result = fullmatch(r'\d{3}|[a-z]{2}', '234') print(result) # a236b,amvb result = fullmatch(r'a\d{3}b|a[a-z]{2}b', 'amvb') print(result) # Note: If you want a part of the regular expression to achieve the effect of multiple selection, the changed part is represented by grouping result = fullmatch(r'a(\d{3}|[a-z]{2})b', 'amvb') print(result)
escape symbol
-
escape symbol
The escape character in regular is to add '' before the symbol with special function or special meaning, so that the symbol becomes a common symbol.
# Match any string corresponding to a decimal result = fullmatch(r'\d+\.\d+', '23.897') print(result) # +123,+5456 result = fullmatch(r'\+\d+', '+123456') print(result) # (mv),(ahsjkd) result = fullmatch(r'\([a-z]+\)', '(msk)') print(result) result = fullmatch(r'\\\d+', r'\465654654') print(result)
-
Escape symbols inside []
There are symbols with special meaning (+, *, ?, ., etc.) alone, and the special meaning disappears automatically in []
result = fullmatch(r'\d+[.]\d+', '23.897') print(result) # There are symbols with special functions in square brackets. If you want to express ordinary symbols, you must add '\' result = fullmatch(r'a[M\-N]b', 'a-b') print(result) result = fullmatch(r'a[MN-]b', 'a-b') print(result) result = fullmatch(r'a[\^MN]b', 'a^b') print(result) result = fullmatch(r'a[MN^]b', 'a^b') print(result)
detection class symbol
-
\b - check for word boundaries
Word boundaries: The symbols that can distinguish two words are word boundaries, such as: whitespace, English punctuation, beginning of string and end of string
result = fullmatch(r'abc\b mn', 'abc mn') print(result) message = '203mn45,89 mn12de;99mll==910,230 90=' result = findall(r'\d+', message) print(result) # ['203', '45', '89', '12', '99', '910', '230', '90'] result = findall(r'\d+\b', message) print(result) # ['45', '89', '910', '230', '90'] result = findall(r'\b\d+', message) print(result) # ['203', '89', '99', '910', '230', '90'] result = findall(r'\b\d+\b', message) print(result) # ['89', '910', '230', '90']
-
\B - check if not a word boundary
result = findall(r'\B\d+\B', message) print(result) # ['89', '910', '230', '90']
-
^ - check if it is the beginning of a string
result = findall(r'^\d+', message) print(result) # ['203'] # Extract the first 5 characters of a string result = findall(r'^.{5}', message) print(result) # ['203mn']
-
$ - Check for end of string
# Extract the last 5 characters of a string result = findall(r'.{5}$', message) print(result) # ['0 90=']
re module
-
Common functions
from re import * # 1) fullmatch (regular expression, string) - complete match, to determine whether the entire string conforms to the rules described by the regular expression, if the match succeeds, the match object is returned, and if the match fails, it returns empty result = fullmatch(r'abc', 'abc') print(result) # 2) match( regular expression, string) - matches the beginning of the string, and judges whether the beginning of the string conforms to the rules described by the regular expression. If the match succeeds, the match object will be returned, and if the match fails, it will return empty. result = match(r'abc', 'abc12345') print(result) # 3) search( regular expression, string) - get the first substring in the string that can successfully match the regular, can find the returned matching object, if the matching fails, return empty result = search(r'1', '1a1b1c') print(result) # 4) findall( regular expression, string) - Get all substrings in the string that satisfy the regularity, return a list, and the elements in the list are strings. # Note: If there is a group in the regular expression, it will be automatically captured for the group (only the results matched by the group are obtained) result = findall(r'1', 'a1b1c1') print(result) # 5) finditer (regular expression, string) - Get all substrings in the string that satisfy the regularity, and return an iterator, the elements in the iterator are the matching results corresponding to each string result = finditer(r'1', 'a1b1c1') print(result) # 6) split (regular expression, string) - use all substrings in the string that satisfy the regularity as the cutting point to cut the string result = split(r'\\\d+', r'\1abc\2abc\56abc\789abc') print(result) # 7) sub( regular expression, string 1, string 2) - replace all substrings in string 2 that satisfy the regularity with string 1 result = sub(r'\\\d+', '-', r'\1abc\2abc\56abc\789abc') print(result) print('Dividing line:', '==' * 50)
-
match object
result = search(r'(\d{3})([A-Z]{2})', 'jsdha123mn123ASasfda125OPp') print(result) # 1) Directly obtain the matching result corresponding to the entire regular expression: match object.group() print(result.group()) # 123AS # 2) Manually capture the matching result corresponding to a group: match object.group (number of groups) print(result.group(1)) # 123 print(result.group(2)) # AS # 3) Get the bit of the matching result in the original string: match object.span() print(result.span()) # (10, 15) print(result.span(2)) # (13, 15)
-
parameter
# 1) Match ignore case: (?i) result = fullmatch(r'(?i)abc', 'aBC') print(result) # 2) Single line match: (?s) """ Multi-line matching: when matching.cannot and newline(\n)to match Single line matching: when matching.can and newline(\n)to match """ result = fullmatch(r'(?s)abc.123', 'abc\n123') print(result)