Matching and regular expressions of Python learning basics

Python learning basics of matching and regular expressions

Article directory

foreword

1. Search without regular expressions

Second, use regular expressions to match search

3. More matching methods of regular expressions

1. Use parentheses to group

2. Use a pipe to match multiple groups

3. Use question marks for optional matching

4. Match zero or more times with an asterisk

5. Match one or more times with the plus sign

6. Use curly braces to match a specific number of times with greedy and non-greedy matches

7.findall() method

8. Abbreviated codes for common character classifications

9. Match all characters with dot-star

10. Match newlines with period characters

foreword

 Regular expression is very useful, but if you are not a programmer, few people will know it. Although most modern text editors and word processors (such as Microsoft Word or OpenOffice) have search and replace functions, you can use regular expression lookups Regular expressions can save a lot of time, not only for software users, but also for programmers. This article mainly talks about character matching and regular expressions in python

1. Search without regular expressions

String matching extraction problems are often encountered in python learning, such as:

Q1: Tell you that the mobile phone number of a certain country is composed of: xxxx-xxx-xxx-xxxx (where x is a number, such as 1234-123-123-1234), you need to enter a string to determine whether it is a phone number.

How would you solve it? If it were me, I would first judge the length of the string. If the lengths are inconsistent, it must not be a number; then I will judge whether the position of each '-' is correct; finally, I will judge that each substring is a pure number. The code example is as follows:

def checkIsPhone(num):
    if len(num) != 17:
        return False
    for i in range(0, 4):
        if not num[i].isdecimal():
            return False
    for i in range(5, 8):
        if not num[i].isdecimal():
            return False
    for i in range(9, 12):
        if not num[i].isdecimal():
            return False
    for i in range(13, 17):
        if not num[i].isdecimal():
            return False
    if num[4] != '-' or num[8] != '-' or num[12] != '-':
        return False
    return True
if __name__ == '__main__':
    phone = '1234-123-123-1234'
    phone1 = '1235-abc-123-1234'
    print(checkIsPhone(phone))
    #>>>True
    print(checkIsPhone(phone1))
    #>>>False

Because there are only 17 numbers, the code doesn't look too complicated. But if there are many numbers, will the segmented loop search and the search of the symbol '-' have many parts, and the code will be many and single, making it difficult for others to read.

Q2: There is a string that may contain phone numbers, you need to extract all substrings that may be phone numbers from it. Add the following code to the code just now:

def getPhone(str0):
    list = []
    if len(str0) < 17:
        return False
    else:
        for i in range(0, len(str0) - 17):
            str1 = str0[i:i + 17]
            if checkIsPhone(str1):
                list.append(str1)
        return list

The example string is not very long, so the result is very fast, and if the string is very large, the runtime will also increase.

Second, use regular expressions to match search

For the problem just now, use the regular expression \d{4}-\d{3}-\d{3}-\d{3} to process, the code example is as follows:

import re


string = 'This is my phone number:1234-123-123-1234,4321-321-321-4321 is my bank card number'
# Create a regular expression object
Reg = re.compile(r'\d{4}-\d{3}-\d{3}-\d{4}')
# Find all matching substrings of a string, return a list of substrings
match = Reg.findall(string)
print(match)
# >>>['1234-123-123-1234', '4321-321-321-4321']
# Find the first matching substring and return a Match object The Match object has a group() method, which returns the searched word
# the actual matched text in the string
match1 = Reg.search(string).group()
print(match1)
# >>>1234-123-123-1234

It can be clearly seen that using regular expressions for matching and searching can obtain results quickly and efficiently, and make your code short and clear.

3. More matching methods of regular expressions

1. Use parentheses to group

If you need to extract the birthday in the ID number from a sentence, add parentheses to create a group in the regular expression, and use group() to extract the birthday number we need. The code example is as follows:

import re

# where 20000125 is the birthday number
str0 = 'This is my ID number:510211-20000125-3910'
Reg = re.compile(r'(\d{6})-(\d{8})-(\d{4})')
match = Reg.search(str0)
print(match.group(0))
# >>>510211-20000125-3910
print(match.group(1))
# >>>510211
print(match.group(2))
# >>>20000125
print(match.group(3))
# >>>3910

Pass the integer 1 or 2 or 3 to the group() match object method to get different parts of the matched text. Passing 0 or no arguments to the group() method will return the entire matched text.

2. Use a pipe to match multiple groups

character | is called a "pipe". It can be used when you want to match one of many expressions. For example the regular expression r'Tom|Jerry' will match either 'Tom' or 'Jerry'. The code example is as follows:

import re

str1 = 'Ponyo likes Sousuke, and the intermediary likes to make the difference'
Reg1 = re.compile(r'Ponyo|Sousuke')
match = Reg1.findall(str1)
print(match)
# >>>['Ponyo', 'Sousuke']
batRegex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = batRegex.search('Batmobile lost a wheel')
print(mo.group())
# >>>Batmobile
print(mo.group(1))
# >>>mobile

If you want to match '|', you need to escape the character '\|'.

3. Use question marks for optional matching

The code example is as follows:

Reg = re.compile(r'super(wo)?man')
mo1 = Reg.search('The china of superman')
print(mo1.group())
# >>>superman
mo2 = Reg.search('The china of superwoman')
print(mo2.group())
# >>>superwoman

The (wo)? part of the regular expression indicates that the pattern wo is an optional grouping. In the text matched by this regular expression, wo will appear zero or one time.

If you need to match a real question mark character, use the escape character \?.

4. Match zero or more times with an asterisk

* (asterisk) means "match zero or more times", i.e. the grouping before the asterisk, which can occur any number of times in the text. It can be completely absent, or repeated over and over again. The code example is as follows:

batRegex = re.compile(r'super(wo)*man')
mo1 = batRegex.search('The china of superman')
print(mo1.group())
# >>>superman

mo2 = batRegex.search('The china of superwoman')
print(mo2.group())
# >>>superwoman

mo3 = batRegex.search('The china of superwowowowoman')
print(mo3.group())
# >>>superwowowowoman

If you need to match the real asterisk character, add a backslash before the regular expression asterisk character, that is, \*.

5. Match one or more times with the plus sign

* means "match zero or more times", + (plus sign) means "match one or more times". The asterisk does not require the grouping to appear in the matched string, but unlike the plus sign, the grouping before the plus sign must appear "at least once". This is not optional. The code example is as follows:

batRegex = re.compile(r'super(wo)+man')
mo1 = batRegex.search('The china of superman')
print(mo1)
# >>>None

mo2 = batRegex.search('The china of superwoman')
print(mo2.group())
# >>>superwoman

mo3 = batRegex.search('The china of superwowowowoman')
print(mo3.group())
# >>>superwowowowoman

6. Use curly braces to match a specific number of times with greedy and non-greedy matches

If you want a group to repeat a specific number of times, just follow the group in the regular expression, followed by a number surrounded by curly braces. For example, the regular expression (Ha){3} will match the string 'HaHaHa', but not 'HaHa' because the latter only repeats the (Ha) grouping twice.

In addition to a number, you can specify a range by writing a minimum value, a comma, and a maximum value in curly braces. For example, the regular expression (Ha){3,5} will match 'HaHaHa', 'HaHaHaHa', and 'HaHaHaHaHa'. You can also leave out the first or second number in curly braces, and limit the minimum or maximum value. For example, (Ha){3,} will match 3 or more instances, and (Ha){,5} will match 0 to 5 instances. Curly braces make regular expressions shorter.

Python's regular expressions are "greedy" by default, which means that in the case of ambiguity, they will match the longest possible string. The "non-greedy" version of curly braces matches the shortest possible string, i.e. the closing curly brace followed by a question mark. Enter the following code in the interactive environment, noting the difference between the greedy and non-greedy forms of curly braces when looking for the same string:

Regex = re.compile(r'(Ha){3,5}')
mo1 = Regex.search('HaHaHaHaHa')
print(mo1.group())
# >>>HaHaHaHaHa
Regex = re.compile(r'(Ha){3,5}?')
mo2 = Regex.search('HaHaHaHaHa')
print(mo2.group())
# >>>HaHaHa

7.findall() method

Returns a Match object containing the text of the "first" match in the searched string, while the findall() method will return a set of strings containing all matches in the searched string. Instead of returning a Match object, findall() returns a list of strings, as long as there is no grouping in the regular expression. Each string in the list is a piece of text to be looked for that matches the regular expression. The code example is as follows:

reg = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')  # has no groups
mo = reg.findall('Cell: 415-555-9999 Work: 212-555-0000')
print(mo)
# >>>['415-555-9999', '212-555-0000']

If there is grouping in the regular expression, then findall will return a list of tuples. Each tuple represents a found match, where the items are the matched strings for each grouping in the regular expression. The code example is as follows:

reg = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')  # has groups
mo = reg.findall('Cell: 415-555-9999 Work: 212-555-0000')
print(mo)
# >>>[('415', '555', '1122'), ('212', '555', '0000')]

As a summary of the results returned by the findall() method, keep the following two points in mind:

         1. If called on a regular expression without grouping, such as \d\d\d-\d\d\d-\d\d\d\d, the method findall() will return a list of matching strings, such as ['415-555-9999', '212-555-0000'].

2. If the call is on a regular expression with grouping, for example ( d d d d) - ( d d d) - ( d d d d d), the method findall() will return a character A list of tuples of strings (one string per group), for example [('415 ','555','1122 '), ('212', '555', '0000')]

8. Abbreviated codes for common character classifications

Abbreviated Character Classification

express

\w

Any letter, number or underscore character (can be thought of as matching "word" characters)

\W

any character other than letters, numbers and underscores

\s

spaces, tabs, or newlines (can be thought of as matching "whitespace" characters)

\S

Any character except space, tab, and newline

\d

any number from 0 to 9

\D

any character other than a number from 0 to 9

9. Match all characters with dot-star

Sometimes you want to match all strings. For example, suppose you want to match the string 'First Name:', followed by any text, then 'Last Name:', and then any text. A dot-star (.*) can be used to represent "arbitrary text". The sample code is as follows:

# 9. Match all characters with dot-star
nameRegex = re.compile(r'First Name: (.*) Last Name: (.*)')
mo = nameRegex.search('First Name: Ponyo Last Name: Sousuke')
print(mo.group(1))
# >>>Ponyo
print(mo.group(2))
# >>> Sousuke

The dot-star uses a "greedy" mode: it always matches as much text as possible. To match all text in a "non-greedy" pattern, use dot-star and question mark. Like when used with curly braces, the question mark tells Python to use non-greedy pattern matching.

10. Match newlines with period characters

dot-star will match all characters except newline. By passing re.DOTALL as the second argument to re.compile(), you can make the period character match all characters, including newline characters. The sample code is as follows:

# 10. Match newlines with period characters
noNewlineRegex = re.compile('.*')
n = noNewlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
print(n)
# >>>'Serve the public trust.'
newlineRegex = re.compile('.*', re.DOTALL)
m = newlineRegex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()
print(m)
# >>>'Serve the public trust.\nProtect the innocent.\nUphold the law.'

I've seen it all here, if you find it useful, leave your precious likes! ! !

 

 

 

Tags: Python

Posted by adguru on Wed, 21 Sep 2022 22:37:23 +0530