Remove Duplicate Files Using Python

Emre Can Kuran
2 min readNov 12, 2021

Duplicate files cause redundancy in storage space in devices, hence, it might be considered to be a significant problem. In this article, I explained how to get rid of these files. The method explained employs MD5 hashing function defined in the “hashlib” library and operating system operations defined in the “os” library, which are built-in libraries of Python.

MD5 Hashing

MD5 is a widely used technique for hashing files. If two files are exactly same, MD5 generates the same code. It uses 128-bit hashes which are usually represented by 32 hexadecimal numbers. A sample MD5 for a sample text can be given as:

Text: A sample text

MD5: 787ab1c96890ad5c0f2be916a89ef6c4

Algorithm

1.Create a dictionary which consists of key-value pairs to keep the MD5 hash codes as keys and the corresponding lists including file paths as values
2.Traverse the dictionary, if a key has a corresponding list with an element count greater than one, duplicates exist
3.Remove the others, keep the oldest file

Implementation & Explanation

First, import the following built-in libraries:

We will also use a function named “calcSizeAndDel” for deleting the specific file and returning its size:

The function given below creates (by updating-building) a dictionary structure for the files in the given directory, in order to keep the hashes and matches:

and the function given below helps to remove these duplicates:

Finally, we will call these functions from main:

Full code:

GitHub repository:

Recommendetions and Notes:

  • The chunk size can be selected optionally or you might use Python’ s functions related with memory to select an appropriate value.
  • Do not forget to change the folder location. A GUI might be helpful for selecting folder.
  • You can ignore the big files to speed up the process.
  • Be careful when you are using the code, do not delete your important files.
  • Some other optimization techniques can be used to increase the performance, like comparing and keeping size of files.

Contact for questions:

Linkedin:https://www.linkedin.com/in/emre-can-kuran-8470b01b0/

E-mail: emrecankuran21@gmail.com

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response