Remove duplicate images in the same level directory

Remove duplicate images in the same level directory


class Config(object):
    data_dir = './structure/'
    save_dir = './save/'
    isRename = True
    isRemove = False
    isSave = False
import os
import hashlib
import numpy as np
from PIL import Image
from config import Config

def remove_duplicated(data_dir, save_dir, isRename, isRemove, isSave):
    temp = set()
    count = 0
    classes = os.listdir(data_dir)

    for cls in classes:
            files = os.listdir(data_dir + cls)
            for file in files:
                file_path = data_dir + cls + '/' + file  # get full path
                rename_path = data_dir + cls + '/' + 'duplicated' + file
                img =  # open image
                img_array = np.array(img)  # convert to array
                md5 = hashlib.md5()  # Create a hash object
                md5.update(img_array)  # Get the md5 code of the current file
                if md5.hexdigest() not in temp:  # If the current md5 code is not in the set
                    temp.add(md5.hexdigest())  # then add the current md5 code to the collection
                    if isSave:
               + file)  # and save the current image to the path where the file is saved
                    count += 1  # Otherwise delete the number of pictures plus one
                    if isRemove:
                    elif isRename:
                        os.rename(file_path, rename_path)

    print('total duplicated images:', count)
    print('total non duplicated images', len(temp))

if __name__ == '__main__':
    opt = Config()
    remove_duplicated(opt.data_dir, opt.save_dir, opt.isRename, opt.isRemove, opt.isSave)

​Function introduction

  1. Can delete duplicate pictures in the same level directory
  2. Or rename duplicate images for special marking
  3. Save all non-repeating pictures to the specified folder
  4. Print the number of duplicate pictures and the number of non-duplicate pictures

Instructions for use

  1. Sample code data structure: There are several subfolders under the structure folder, and each subfolder has several images to be detected, and are at the same level as the structure folder
  2. Configure parameters in
    • Data path (eg "./structure/" with "/" on the right)
    • Save path (eg "./save/" with "/" on the right)
    • Whether to delete duplicate images (isRemove=True, delete)
    • Whether to rename duplicate images (isRename=True, rename)
    • Whether to save all unique pictures (isSave=True, save)


Mainly by judging whether the md5 value of the picture and the picture is the same to determine whether the two pictures are repeated

Knowledge push

MD5 Message-Digest Algorithm (English: MD5 Message-Digest Algorithm), a widely used cryptographic hash function , can generate a 128-bit (16 byte ) of the hash value (hash value), used to ensure complete and consistent information transmission. MD5 by American cryptographers Ronald Levist (Ronald Linn Rivest) design, published in 1992 to replace MD4 algorithm. The procedure for this algorithm is specified in the RFC 1321 standard. After 1996, the algorithm was proved to have weaknesses and could be cracked. For data that requires a high degree of security, experts generally recommend using other algorithms, such as SHA-2 . In 2004, it was confirmed that the MD5 algorithm cannot prevent collision (collision), so it is not suitable for safety certification, such as SSL public key authentication or digital signature and other uses.


Like friends, please click star and follow me CSDN blog, follow me Bilibili , pay attention to my public account CV companion reading club

Tags: Python image processing script

Posted by simon13 on Wed, 01 Jun 2022 05:23:12 +0530