# day-17 regular expression

regular expression

## match class symbol

1. what is regular expression

Regular expressions are a tool for making complex string problems easy
Regular is not a python-specific syntax (it does not belong to python), all high-level programming languages ​​support regular, and regular syntax is universal
No matter what problem is solved by regular expressions, when writing regular expressions, you are using regular expressions to describe string rules

2. Python's re module

The re module is a module used by python to support regular expressions. The module contains all functions related to regular expressions.

fullmatch( regular expression, string) - Determines whether the regular expression matches the specified string exactly (determines whether the entire string conforms to the rules described by the regular expression)

3. Regular Grammar - Matching Class Symbols

```from re import fullmatch
# 1) Ordinary symbols - symbols that represent the symbols themselves in regular expressions
result = fullmatch(r'abc', 'abcd')
print(result)
# 2). - matches any character
result = fullmatch(r'a.c', 'a it is good c')
print(result)
result = fullmatch(r'..abc', '12dabc')
print(result)
# 3) \d - matches any digit character
result = fullmatch(r'a\dc', 'a5c')
print(result)
# 4) \s - matches any whitespace character
# whitespace - characters that produce whitespace effects, such as spaces, newlines, horizontal tabs
result = fullmatch(r'a\sc', 'ac')
print(result)
# 5) \D - matches any non-digit character
result = fullmatch(r'a\Dc', 'a1c')
print(result)
# 6)\S - matches any non-whitespace character
result = fullmatch(r'a\Sc', 'a1c')
print(result)
# 7) [Character Set] - matches any character in the charset
"""
[abc] - match a or b or c
[abc\d] - match a or b or c or any number
[1-9] - matches any number from 1 to 9
[a-z] - matches any lowercase letter
[A-Z] - matches any capital letter
[a-zA-Z] - matches any letter
[a-zA-Z\d_] - Match alphanumeric or underscore
[\u4e00-\u9fa5] - Match any Chinese character
Notice:[]The minus sign is placed between two characters to indicate who is going to whom (the way of determination is determined according to the size of the character encoding value); if the minus sign is not between the two characters, it means an ordinary minus sign
"""
result = fullmatch(r'a[MN12]b', 'aNb')
print(result)
result = fullmatch(r'a[MN\d]b', 'a8b')
print(result)
result = fullmatch(r'a[\u4e00-\u9fa5]c', 'a it is good c')
print(result)
result = fullmatch(r'a[A-Z]c', 'aBc')
print(result)
# 8) [^Charset] - matches any character not in the charset
result = fullmatch(r'a[^MN]b', 'a_b')
print(result)
result = fullmatch(r'a[^\u4e00-\u9fa5]c', 'a0c')
print(result)
```

## number of matches

```from re import fullmatch

# 1.* - 0 or more times (any number of times)
# Note: * who controls the number of times behind
"""
a* - 0 one or more a
\d* - 0 one or more\d
"""
result = fullmatch(r'a*123', 'aaaaaaa123')
print(result)
result = fullmatch(r'\d*abc', '11234abc')
print(result)
result = fullmatch(r'[MN]*abc', 'NMabc')
print(result)
# 2.+ - 1 or more times (at least once)
result = fullmatch(r'a+123', 'a123')
print(result)
# 3.? - 0 or 1 time
result = fullmatch(r'A?123', 'A123')
print(result)
# 4.{}
"""
{M,N} - M arrive N Second-rate
{M,}  - At least M Second-rate
{,N}  - most N Second-rate
{N}   - N Second-rate
"""
result = fullmatch(r'[a-z]{3,5}123', 'absds123')
print(result)
result = fullmatch(r'[a-z]{3,}123', 'absdsdfgsdf123')
print(result)
result = fullmatch(r'[a-z]{,3}123', 'a123')
print(result)
result = fullmatch(r'[a-z]{8}123', 'absdasdf123')
print(result)

# Exercise: Write a regular code to determine whether the input content is a legal QQ number (the length is a number of 5~12 digits, and the first digit cannot be 0)
def f1(qq: str):
return bool(fullmatch(r'[1-9]\d{4,11}', qq))

qq = input('please enter a qq:')
print(f1(qq))

# Exercise: Determine whether the input content is a legal identifier (composed of letters, numbers, underscores, numbers cannot start)
def f2(str1: str):
return bool(fullmatch(r'[a-zA-Z_][\da-zA-Z_]*', str1))

str1 = input('Please enter an identifier:')
print(f2(str1))
```

## greedy not greedy

```from re import match

# match( regular expression, string) - judges that the beginning of the string matches the rules described by the regular expression
result = match(r'\d{3}', '234hkdfjk')
print(result)
```
1. Greed and anti-greed

When the number of matches is uncertain (, +, ?, {M,N}, {N,}, {,N}), the matching mode is divided into two types: greedy and non-greedy, the default is greedy
Greedy and non-greedy: There are multiple matching results when the match is successful. Greedy takes the matching result corresponding to the most times, and non-greedy takes the matching result corresponding to the least number of times.
(Where the number of matches is uncertain, there are multiple matching methods that can be successfully matched. Greedy takes the maximum number of times, and non-greedy takes the minimum number of times)
Greedy mode: , +, ?, {M,N}, {N,}, {,N}
Non-greedy mode: *?, +?, {M,N}?, {N,}?, {,N}?

```# 1) Example 1:
# greedy mode
result = match(r'a.+b', 'ambcbdb')
print(result)
# Anti-greed mode
result = match(r'a.+?b', 'ambcbdb')
print(result)
# Note: If there is only one possible match result, then greedy and non-greedy results are the same
# greedy mode
result = match(r'a.+b', 'ambc')
print(result)
# Anti-greed mode
result = match(r'a.+?b', 'ambc')
print(result)
```

## grouping and branching

1. Grouping - ()

1) Whole - perform related operations on a part of the regular expression as a whole
2) Repeat - You can use \M in the regular expression to repeat the matching result of the Mth group in front of it
3) Capture - only get a part of the matching results in the regular expression (divided into manual and automatic capture)

```from re import fullmatch, findall

# findall( regex, string) - get all substrings in a string that satisfy the regular expression

# '12DF45ER65ER45WE'
result = fullmatch(r'(\d\d[A-Z]{2})+', '12DF45ER65ER45WE')
print(result)
# 23m23,98k98,12p12 - True
# 23m34,98k18 - False
result = fullmatch(r'(\d{2})[a-z]\1[a-z]\1', '12d12d12')
print(result)
result = fullmatch(r'(\d{2})([a-z]{3})=\2\1{3}', '12abc=abc121212')
print(result)
# \M can only repeat the content of the group before it, not the content that appears after it
# result = fullmatch(r'(\d{2})\1=\2([a-z]{3})', '12abc=abc12')  # Error!
result = fullmatch(r'(\d{2})\1=([a-z]{3})\2', '1212=abcabc')
print(result)
# Extract the numeric substring corresponding to the amount in the message
message = 'I am 18 years old, with a monthly salary of 500,000 yuan, height 180, weight 70 kg, and 8-pack abs. Pay a Tencent membership fee of 300 yuan per year, a monthly mortgage loan of 3,000 yuan, and a car loan of 2,200 yuan per month.'
result = findall(r'(\d+)Yuan', message)
print(result)  # ['500000', '300', '3000', '2200']
```
2. branch - |

Regular1|Regular2|Regular3|… - matches a string that can match any of multiple regularities

Matches a string of three numbers or two lowercase letters

```result = fullmatch(r'\d{3}|[a-z]{2}', '234')
print(result)
# a236b,amvb
result = fullmatch(r'a\d{3}b|a[a-z]{2}b', 'amvb')
print(result)
# Note: If you want a part of the regular expression to achieve the effect of multiple selection, the changed part is represented by grouping
result = fullmatch(r'a(\d{3}|[a-z]{2})b', 'amvb')
print(result)
```

## escape symbol

1. escape symbol

The escape character in regular is to add '' before the symbol with special function or special meaning, so that the symbol becomes a common symbol.

```# Match any string corresponding to a decimal
result = fullmatch(r'\d+\.\d+', '23.897')
print(result)
# +123,+5456
result = fullmatch(r'\+\d+', '+123456')
print(result)
# (mv),(ahsjkd)
result = fullmatch(r'\([a-z]+\)', '(msk)')
print(result)
result = fullmatch(r'\\\d+', r'\465654654')
print(result)
```
2. Escape symbols inside []

There are symbols with special meaning (+, *, ?, ., etc.) alone, and the special meaning disappears automatically in []

```result = fullmatch(r'\d+[.]\d+', '23.897')
print(result)
# There are symbols with special functions in square brackets. If you want to express ordinary symbols, you must add '\'
result = fullmatch(r'a[M\-N]b', 'a-b')
print(result)
result = fullmatch(r'a[MN-]b', 'a-b')
print(result)
result = fullmatch(r'a[\^MN]b', 'a^b')
print(result)
result = fullmatch(r'a[MN^]b', 'a^b')
print(result)
```

## detection class symbol

1. \b - check for word boundaries

Word boundaries: The symbols that can distinguish two words are word boundaries, such as: whitespace, English punctuation, beginning of string and end of string

```result = fullmatch(r'abc\b mn', 'abc mn')
print(result)
message = '203mn45,89 mn12de;99mll==910,230 90='
result = findall(r'\d+', message)
print(result)  # ['203', '45', '89', '12', '99', '910', '230', '90']
result = findall(r'\d+\b', message)
print(result)  # ['45', '89', '910', '230', '90']
result = findall(r'\b\d+', message)
print(result)  # ['203', '89', '99', '910', '230', '90']
result = findall(r'\b\d+\b', message)
print(result)  # ['89', '910', '230', '90']
```
2. \B - check if not a word boundary

```result = findall(r'\B\d+\B', message)
print(result)  # ['89', '910', '230', '90']
```
3. ^ - check if it is the beginning of a string

```result = findall(r'^\d+', message)
print(result)  # ['203']
# Extract the first 5 characters of a string
result = findall(r'^.{5}', message)
print(result)  # ['203mn']
```
4. \$ - Check for end of string

```# Extract the last 5 characters of a string
result = findall(r'.{5}\$', message)
print(result)  # ['0 90=']
```

## re module

1. Common functions

```from re import *

# 1) fullmatch (regular expression, string) - complete match, to determine whether the entire string conforms to the rules described by the regular expression, if the match succeeds, the match object is returned, and if the match fails, it returns empty
result = fullmatch(r'abc', 'abc')
print(result)
# 2) match( regular expression, string) - matches the beginning of the string, and judges whether the beginning of the string conforms to the rules described by the regular expression. If the match succeeds, the match object will be returned, and if the match fails, it will return empty.
result = match(r'abc', 'abc12345')
print(result)
# 3) search( regular expression, string) - get the first substring in the string that can successfully match the regular, can find the returned matching object, if the matching fails, return empty
result = search(r'1', '1a1b1c')
print(result)
# 4) findall( regular expression, string) - Get all substrings in the string that satisfy the regularity, return a list, and the elements in the list are strings.
# Note: If there is a group in the regular expression, it will be automatically captured for the group (only the results matched by the group are obtained)
result = findall(r'1', 'a1b1c1')
print(result)
# 5) finditer (regular expression, string) - Get all substrings in the string that satisfy the regularity, and return an iterator, the elements in the iterator are the matching results corresponding to each string
result = finditer(r'1', 'a1b1c1')
print(result)
# 6) split (regular expression, string) - use all substrings in the string that satisfy the regularity as the cutting point to cut the string
result = split(r'\\\d+', r'\1abc\2abc\56abc\789abc')
print(result)
# 7) sub( regular expression, string 1, string 2) - replace all substrings in string 2 that satisfy the regularity with string 1
result = sub(r'\\\d+', '-', r'\1abc\2abc\56abc\789abc')
print(result)
print('Dividing line:', '==' * 50)
```
2. match object

```result = search(r'(\d{3})([A-Z]{2})', 'jsdha123mn123ASasfda125OPp')
print(result)
# 1) Directly obtain the matching result corresponding to the entire regular expression: match object.group()
print(result.group())  # 123AS
# 2) Manually capture the matching result corresponding to a group: match object.group (number of groups)
print(result.group(1))  # 123
print(result.group(2))  # AS
# 3) Get the bit of the matching result in the original string: match object.span()
print(result.span())  # (10, 15)
print(result.span(2))  # (13, 15)
```
3. parameter

```# 1) Match ignore case: (?i)
result = fullmatch(r'(?i)abc', 'aBC')
print(result)
# 2) Single line match: (?s)
"""
Multi-line matching: when matching.cannot and newline(\n)to match
Single line matching: when matching.can and newline(\n)to match
"""
result = fullmatch(r'(?s)abc.123', 'abc\n123')
print(result)
```

Posted by synstealth on Fri, 16 Sep 2022 22:58:37 +0530