Python: pattern matching vs regex

Table of contents

foreword

1. Project content

2. Practical steps

2.1 Create a regular expression for phone numbers

2.2 Create a regular expression for E-mail addresses

2.3 Find all matches in clipboard text

2.4 Concatenate all matches into a string and copy to clipboard

2.5 Run the program

2.6 Concept of a similar program

2.7 Summary

foreword

The content and code of this article are derived from "Getting Started with Python Programming - Automating Trivial Work". The purpose is to record the learning process and content for subsequent viewing, and then learn the content of pattern matching and regular expressions.

This part will consolidate the knowledge of regular expressions learned earlier through a project practice.

1. Project content

Find all phone numbers and email addresses in a long web page or article.

If you want to use the program to find phone numbers and E-mail addresses in the text on the clipboard, you just need to Ctrl-A to select all the text, press Ctrl-C to copy it to the clipboard, and then run the program, it will find it with phone number and E-mail address, replace the clipboard text.

The phone number and E-mail address extraction program needs to complete the following tasks:

  • Get text from the clipboard.
  • Find all phone numbers and E-mail addresses in the text.
  • Paste them to the clipboard.

What the code needs to do:

  • Copy and paste matching strings using the pyperclip module.
  • Create two regular expressions, one to match phone numbers and the other to match E-mail addresses.
  • For two regular expressions, find all matches, not just the first one.
  • Format the matched strings into a string for pasting.
  • If the text does not find a match, display some kind of message.

2. Practical steps

2.1 Create a regular expression for phone numbers

  1. Phone numbers start with an "optional" area code, so area code groups are followed by a question mark. Since the area code may be just 3 digits (i.e. \d{3}), or 3 digits in parentheses (i.e. \(\d{3}\)), you should pipe the two parts together.
  2. Phone number separators can be spaces (\s), dashes (-), or periods (.), so these parts should also be piped.
  3. The next few parts of this regular expression are simple: 3 numbers, followed by another delimiter, followed by 4 numbers.
  4. The final part is an optional extension, including any number of spaces, followed by ext, x, or ext., followed by 2 to 5 digits.
#Create a regular expression for phone numbers
phoneRegex = re.compile(r'''(
 (\d{3}|\(\d{3}\))? # area code
 (\s|-|\.)? # phone number separator
 (\d{3}) # first 3 digits
 (\s|-|\.) # delimiter
 (\d{4}) # last 4 numbers
 (\s*(ext|x|ext.)\s*(\d{2,5}))? # optional extension
 )''', re.VERBOSE)

2.2 Create a regular expression for E-mail addresses

  1. The username portion of an E-mail address is one or more characters, which can include: lowercase and uppercase letters, numbers, periods, underscores, percent signs, plus signs, or dashes. All of these can be put into one character class: [a-zA-Z0-9._%+-]
  2. The domain name and user name are separated by the @ symbol. The domain name allows fewer character categories, only letters, numbers, periods and dashes are allowed: [a-zA-Z0-9.-].
  3. Finally there's the "dot-com" part (technically called the "top-level domain"), which can actually be "dot-anything". It has 2 to 4 characters.
  4. There are many strange rules for the format of E-mail addresses. This regular expression won't match all possible, valid E-mail addresses, but it will match most typical E-mail addresses encountered.
#Create a regular expression for E-mail addresses
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+ # username
    @ # @ sign
    [a-zA-Z0-9.-]+ # domain name
    (\.[a-zA-Z]{2,4}) # top level domain 
    )''', re.VERBOSE)

2.3 Find all matches in clipboard text

  • Having specified the regular expressions for phone numbers and email addresses above, you can let Python's re module do the hard work of finding all matches in the clipboard text.
  • The pyperclip.paste() function will take a string containing the text on the clipboard.
  • findall() The regular expression method will return a list of tuples.
  • Each match corresponds to a tuple, and each tuple contains strings for each grouping in the regular expression.
#Find all matches in clipboard text
text = str(pyperclip.paste())
matches = []
#Save all phone number matches in matches
for groups in phoneRegex.findall(text):
    #Unify phone numbers into a unique, standard format
    phoneNum = '-'.join([groups[1],groups[3],groups[5]])
    if groups[8] != '':
        phoneNum += ' x'+groups[8]
        matches.append(phoneNum)
#Save all E-mail address matches in matches
for groups in emailRegex.findall(text):
    matches.append(groups[0])

2.4 Concatenate all matches into a string and copy to clipboard

  • E-mail Addresses and phone numbers have been placed as a list of strings matches , and then copy them to the clipboard.
  • pyperclip.copy() The function only accepts a string value, not a list of strings, so matches call on join() method.
#All matches are concatenated into one string, copied to clipboard
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

2.5 Run the program

Select a web page, press Ctrl-A to select all text on the page, press Ctrl-C to copy it to the clipboard, then run the program.
import re
import pyperclip

#Create a regular expression for phone numbers
phoneRegex = re.compile(r'''(
 (\d{3}|\(\d{3}\))? # area code
 (\s|-|\.)? # phone number separator
 (\d{3}) # first 3 digits
 (\s|-|\.) # delimiter
 (\d{4}) # last 4 numbers
 (\s*(ext|x|ext.)\s*(\d{2,5}))? # optional extension
 )''', re.VERBOSE)

#Create a regular expression for E-mail addresses
emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+ # username
    @ # @ sign
    [a-zA-Z0-9.-]+ # domain name
    (\.[a-zA-Z]{2,4}) # top level domain 
    )''', re.VERBOSE)

#Find all matches in clipboard text
text = str(pyperclip.paste())
matches = []
#Save all phone number matches in matches
for groups in phoneRegex.findall(text):
    #Unify phone numbers into a unique, standard format
    phoneNum = '-'.join([groups[1],groups[3],groups[5]])
    if groups[8] != '':
        phoneNum += ' x'+groups[8]
        matches.append(phoneNum)
#Save all E-mail address matches in matches
for groups in emailRegex.findall(text):
    matches.append(groups[0])

#All matches are concatenated into one string, copied to clipboard
if len(matches) > 0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard:')
    print('n'.join(matches))
else:
    print('No phone numbers or email addresses found.')

2.6 Concept of a similar program

Recognize patterns of text (and possibly use sub() methods to replace them) have many different potential applications.
  • Look for URL s of websites, they start with http:// or https://.
  • Collate dates in different date formats (such as 3/14/2015, 03-14-2015 and 2015/3/14) with unique
    alternative to the standard format.
  • Delete sensitive information, such as social security numbers or credit card numbers.
  • Look for common typos such as multiple spaces between words, accidentally repeated words, or multiple exclamation marks at the end of sentences.

2.7 Summary

While a computer can find text quickly, we must tell it exactly what to look for. regular expression The formula lets us specify exactly the text pattern we're looking for. In fact, some word processing and spreadsheet applications provide Find and replace function, let's use regular expressions to find.
  Python self-contained re module let's compile Regex object. This object has several methods: search() find word match, findall() find all matching instances, sub() Find and replace text.
In addition to the syntax introduced earlier, there is some regular expression syntax. available in the official Python Documentation Find out more in: http://docs.python.org/3/library/re.html . Guide site http://www.regular expressions.info/ Also a useful resource.


 

  

Tags: Python

Posted by smonsivaes on Thu, 02 Jun 2022 07:27:35 +0530