day-17 regular expression

regular expression

match class symbol

  1. what is regular expression

    Regular expressions are a tool for making complex string problems easy
    Regular is not a python-specific syntax (it does not belong to python), all high-level programming languages ​​support regular, and regular syntax is universal
    No matter what problem is solved by regular expressions, when writing regular expressions, you are using regular expressions to describe string rules

  2. Python's re module

    The re module is a module used by python to support regular expressions. The module contains all functions related to regular expressions.

    fullmatch( regular expression, string) - Determines whether the regular expression matches the specified string exactly (determines whether the entire string conforms to the rules described by the regular expression)

  3. Regular Grammar - Matching Class Symbols

    from re import fullmatch
    # 1) Ordinary symbols - symbols that represent the symbols themselves in regular expressions
    result = fullmatch(r'abc', 'abcd')
    # 2). - matches any character
    result = fullmatch(r'a.c', 'a it is good c')
    result = fullmatch(r'', '12dabc')
    # 3) \d - matches any digit character
    result = fullmatch(r'a\dc', 'a5c')
    # 4) \s - matches any whitespace character
    # whitespace - characters that produce whitespace effects, such as spaces, newlines, horizontal tabs
    result = fullmatch(r'a\sc', 'ac')
    # 5) \D - matches any non-digit character
    result = fullmatch(r'a\Dc', 'a1c')
    # 6)\S - matches any non-whitespace character
    result = fullmatch(r'a\Sc', 'a1c')
    # 7) [Character Set] - matches any character in the charset
    [abc] - match a or b or c
    [abc\d] - match a or b or c or any number
    [1-9] - matches any number from 1 to 9
    [a-z] - matches any lowercase letter
    [A-Z] - matches any capital letter
    [a-zA-Z] - matches any letter
    [a-zA-Z\d_] - Match alphanumeric or underscore
    [\u4e00-\u9fa5] - Match any Chinese character
     Notice:[]The minus sign is placed between two characters to indicate who is going to whom (the way of determination is determined according to the size of the character encoding value); if the minus sign is not between the two characters, it means an ordinary minus sign
    result = fullmatch(r'a[MN12]b', 'aNb')
    result = fullmatch(r'a[MN\d]b', 'a8b')
    result = fullmatch(r'a[\u4e00-\u9fa5]c', 'a it is good c')
    result = fullmatch(r'a[A-Z]c', 'aBc')
    # 8) [^Charset] - matches any character not in the charset
    result = fullmatch(r'a[^MN]b', 'a_b')
    result = fullmatch(r'a[^\u4e00-\u9fa5]c', 'a0c')

number of matches

from re import fullmatch

# 1.* - 0 or more times (any number of times)
# Note: * who controls the number of times behind
a* - 0 one or more a
\d* - 0 one or more\d
result = fullmatch(r'a*123', 'aaaaaaa123')
result = fullmatch(r'\d*abc', '11234abc')
result = fullmatch(r'[MN]*abc', 'NMabc')
# 2.+ - 1 or more times (at least once)
result = fullmatch(r'a+123', 'a123')
# 3.? - 0 or 1 time
result = fullmatch(r'A?123', 'A123')
# 4.{}
{M,N} - M arrive N Second-rate
{M,}  - At least M Second-rate
{,N}  - most N Second-rate
{N}   - N Second-rate
result = fullmatch(r'[a-z]{3,5}123', 'absds123')
result = fullmatch(r'[a-z]{3,}123', 'absdsdfgsdf123')
result = fullmatch(r'[a-z]{,3}123', 'a123')
result = fullmatch(r'[a-z]{8}123', 'absdasdf123')

# Exercise: Write a regular code to determine whether the input content is a legal QQ number (the length is a number of 5~12 digits, and the first digit cannot be 0)
def f1(qq: str):
    return bool(fullmatch(r'[1-9]\d{4,11}', qq))

qq = input('please enter a qq:')

# Exercise: Determine whether the input content is a legal identifier (composed of letters, numbers, underscores, numbers cannot start)
def f2(str1: str):
    return bool(fullmatch(r'[a-zA-Z_][\da-zA-Z_]*', str1))

str1 = input('Please enter an identifier:')

greedy not greedy

from re import match

# match( regular expression, string) - judges that the beginning of the string matches the rules described by the regular expression
result = match(r'\d{3}', '234hkdfjk')
  1. Greed and anti-greed

    When the number of matches is uncertain (, +, ?, {M,N}, {N,}, {,N}), the matching mode is divided into two types: greedy and non-greedy, the default is greedy
    Greedy and non-greedy: There are multiple matching results when the match is successful. Greedy takes the matching result corresponding to the most times, and non-greedy takes the matching result corresponding to the least number of times.
    (Where the number of matches is uncertain, there are multiple matching methods that can be successfully matched. Greedy takes the maximum number of times, and non-greedy takes the minimum number of times)
    Greedy mode: , +, ?, {M,N}, {N,}, {,N}
    Non-greedy mode: *?, +?, {M,N}?, {N,}?, {,N}?

    # 1) Example 1:
    # greedy mode
    result = match(r'a.+b', 'ambcbdb')
    # Anti-greed mode
    result = match(r'a.+?b', 'ambcbdb')
    # Note: If there is only one possible match result, then greedy and non-greedy results are the same
    # greedy mode
    result = match(r'a.+b', 'ambc')
    # Anti-greed mode
    result = match(r'a.+?b', 'ambc')

grouping and branching

  1. Grouping - ()

    1) Whole - perform related operations on a part of the regular expression as a whole
    2) Repeat - You can use \M in the regular expression to repeat the matching result of the Mth group in front of it
    3) Capture - only get a part of the matching results in the regular expression (divided into manual and automatic capture)

    from re import fullmatch, findall
    # findall( regex, string) - get all substrings in a string that satisfy the regular expression
    # '12DF45ER65ER45WE'
    result = fullmatch(r'(\d\d[A-Z]{2})+', '12DF45ER65ER45WE')
    # 23m23,98k98,12p12 - True
    # 23m34,98k18 - False
    result = fullmatch(r'(\d{2})[a-z]\1[a-z]\1', '12d12d12')
    result = fullmatch(r'(\d{2})([a-z]{3})=\2\1{3}', '12abc=abc121212')
    # \M can only repeat the content of the group before it, not the content that appears after it
    # result = fullmatch(r'(\d{2})\1=\2([a-z]{3})', '12abc=abc12')  # Error!
    result = fullmatch(r'(\d{2})\1=([a-z]{3})\2', '1212=abcabc')
    # Extract the numeric substring corresponding to the amount in the message
    message = 'I am 18 years old, with a monthly salary of 500,000 yuan, height 180, weight 70 kg, and 8-pack abs. Pay a Tencent membership fee of 300 yuan per year, a monthly mortgage loan of 3,000 yuan, and a car loan of 2,200 yuan per month.'
    result = findall(r'(\d+)Yuan', message)
    print(result)  # ['500000', '300', '3000', '2200']
  2. branch - |

    Regular1|Regular2|Regular3|… - matches a string that can match any of multiple regularities

    Matches a string of three numbers or two lowercase letters

    result = fullmatch(r'\d{3}|[a-z]{2}', '234')
    # a236b,amvb
    result = fullmatch(r'a\d{3}b|a[a-z]{2}b', 'amvb')
    # Note: If you want a part of the regular expression to achieve the effect of multiple selection, the changed part is represented by grouping
    result = fullmatch(r'a(\d{3}|[a-z]{2})b', 'amvb')

escape symbol

  1. escape symbol

    The escape character in regular is to add '' before the symbol with special function or special meaning, so that the symbol becomes a common symbol.

    # Match any string corresponding to a decimal
    result = fullmatch(r'\d+\.\d+', '23.897')
    # +123,+5456
    result = fullmatch(r'\+\d+', '+123456')
    # (mv),(ahsjkd)
    result = fullmatch(r'\([a-z]+\)', '(msk)')
    result = fullmatch(r'\\\d+', r'\465654654')
  2. Escape symbols inside []

    There are symbols with special meaning (+, *, ?, ., etc.) alone, and the special meaning disappears automatically in []

    result = fullmatch(r'\d+[.]\d+', '23.897')
    # There are symbols with special functions in square brackets. If you want to express ordinary symbols, you must add '\'
    result = fullmatch(r'a[M\-N]b', 'a-b')
    result = fullmatch(r'a[MN-]b', 'a-b')
    result = fullmatch(r'a[\^MN]b', 'a^b')
    result = fullmatch(r'a[MN^]b', 'a^b')

detection class symbol

  1. \b - check for word boundaries

    Word boundaries: The symbols that can distinguish two words are word boundaries, such as: whitespace, English punctuation, beginning of string and end of string

    result = fullmatch(r'abc\b mn', 'abc mn')
    message = '203mn45,89 mn12de;99mll==910,230 90='
    result = findall(r'\d+', message)
    print(result)  # ['203', '45', '89', '12', '99', '910', '230', '90']
    result = findall(r'\d+\b', message)
    print(result)  # ['45', '89', '910', '230', '90']
    result = findall(r'\b\d+', message)
    print(result)  # ['203', '89', '99', '910', '230', '90']
    result = findall(r'\b\d+\b', message)
    print(result)  # ['89', '910', '230', '90']
  2. \B - check if not a word boundary

    result = findall(r'\B\d+\B', message)
    print(result)  # ['89', '910', '230', '90']
  3. ^ - check if it is the beginning of a string

    result = findall(r'^\d+', message)
    print(result)  # ['203']
    # Extract the first 5 characters of a string
    result = findall(r'^.{5}', message)
    print(result)  # ['203mn']
  4. $ - Check for end of string

    # Extract the last 5 characters of a string
    result = findall(r'.{5}$', message)
    print(result)  # ['0 90=']

re module

  1. Common functions

    from re import *
    # 1) fullmatch (regular expression, string) - complete match, to determine whether the entire string conforms to the rules described by the regular expression, if the match succeeds, the match object is returned, and if the match fails, it returns empty
    result = fullmatch(r'abc', 'abc')
    # 2) match( regular expression, string) - matches the beginning of the string, and judges whether the beginning of the string conforms to the rules described by the regular expression. If the match succeeds, the match object will be returned, and if the match fails, it will return empty.
    result = match(r'abc', 'abc12345')
    # 3) search( regular expression, string) - get the first substring in the string that can successfully match the regular, can find the returned matching object, if the matching fails, return empty
    result = search(r'1', '1a1b1c')
    # 4) findall( regular expression, string) - Get all substrings in the string that satisfy the regularity, return a list, and the elements in the list are strings.
    # Note: If there is a group in the regular expression, it will be automatically captured for the group (only the results matched by the group are obtained)
    result = findall(r'1', 'a1b1c1')
    # 5) finditer (regular expression, string) - Get all substrings in the string that satisfy the regularity, and return an iterator, the elements in the iterator are the matching results corresponding to each string
    result = finditer(r'1', 'a1b1c1')
    # 6) split (regular expression, string) - use all substrings in the string that satisfy the regularity as the cutting point to cut the string
    result = split(r'\\\d+', r'\1abc\2abc\56abc\789abc')
    # 7) sub( regular expression, string 1, string 2) - replace all substrings in string 2 that satisfy the regularity with string 1
    result = sub(r'\\\d+', '-', r'\1abc\2abc\56abc\789abc')
    print('Dividing line:', '==' * 50)
  2. match object

    result = search(r'(\d{3})([A-Z]{2})', 'jsdha123mn123ASasfda125OPp')
    # 1) Directly obtain the matching result corresponding to the entire regular expression: match
    print(  # 123AS
    # 2) Manually capture the matching result corresponding to a group: match (number of groups)
    print(  # 123
    print(  # AS
    # 3) Get the bit of the matching result in the original string: match object.span()
    print(result.span())  # (10, 15)
    print(result.span(2))  # (13, 15)
  3. parameter

    # 1) Match ignore case: (?i)
    result = fullmatch(r'(?i)abc', 'aBC')
    # 2) Single line match: (?s)
    Multi-line matching: when matching.cannot and newline(\n)to match
     Single line matching: when matching.can and newline(\n)to match
    result = fullmatch(r'(?s)abc.123', 'abc\n123')

Tags: Python regex programming language

Posted by synstealth on Fri, 16 Sep 2022 22:58:37 +0530