Remove duplicate images in the same level directory
Code
class Config(object): data_dir = './structure/' save_dir = './save/' isRename = True isRemove = False isSave = False
import os import hashlib import numpy as np from PIL import Image from config import Config def remove_duplicated(data_dir, save_dir, isRename, isRemove, isSave): temp = set() count = 0 classes = os.listdir(data_dir) for cls in classes: files = os.listdir(data_dir + cls) for file in files: file_path = data_dir + cls + '/' + file # get full path rename_path = data_dir + cls + '/' + 'duplicated' + file img = Image.open(file_path) # open image img_array = np.array(img) # convert to array md5 = hashlib.md5() # Create a hash object md5.update(img_array) # Get the md5 code of the current file if md5.hexdigest() not in temp: # If the current md5 code is not in the set temp.add(md5.hexdigest()) # then add the current md5 code to the collection if isSave: img.save(save_dir + file) # and save the current image to the path where the file is saved else: count += 1 # Otherwise delete the number of pictures plus one if isRemove: os.remove(file_path) elif isRename: os.rename(file_path, rename_path) print('total duplicated images:', count) print('total non duplicated images', len(temp)) if __name__ == '__main__': opt = Config() remove_duplicated(opt.data_dir, opt.save_dir, opt.isRename, opt.isRemove, opt.isSave)
​Function introduction
- Can delete duplicate pictures in the same level directory
- Or rename duplicate images for special marking
- Save all non-repeating pictures to the specified folder
- Print the number of duplicate pictures and the number of non-duplicate pictures
Instructions for use
- Sample code data structure: There are several subfolders under the structure folder, and each subfolder has several images to be detected, remove_duplicated.py and config.py are at the same level as the structure folder
- Configure parameters in config.py
- Data path (eg "./structure/" with "/" on the right)
- Save path (eg "./save/" with "/" on the right)
- Whether to delete duplicate images (isRemove=True, delete)
- Whether to rename duplicate images (isRename=True, rename)
- Whether to save all unique pictures (isSave=True, save)
Introduction
Mainly by judging whether the md5 value of the picture and the picture is the same to determine whether the two pictures are repeated
Knowledge push
MD5 Message-Digest Algorithm (English: MD5 Message-Digest Algorithm), a widely used cryptographic hash function , can generate a 128-bit (16 byte ) of the hash value (hash value), used to ensure complete and consistent information transmission. MD5 by American cryptographers Ronald Levist (Ronald Linn Rivest) design, published in 1992 to replace MD4 algorithm. The procedure for this algorithm is specified in the RFC 1321 standard. After 1996, the algorithm was proved to have weaknesses and could be cracked. For data that requires a high degree of security, experts generally recommend using other algorithms, such as SHA-2 . In 2004, it was confirmed that the MD5 algorithm cannot prevent collision (collision), so it is not suitable for safety certification, such as SSL public key authentication or digital signature and other uses.
also
Like friends, please click star and follow me CSDN blog, follow me Bilibili , pay attention to my public account CV companion reading club
