
Remove Duplicate Files Using Python
Duplicate files cause redundancy in storage space in devices, hence, it might be considered to be a significant problem. In this article, I explained how to get rid of these files. The method explained employs MD5 hashing function defined in the “hashlib” library and operating system operations defined in the “os” library, which are built-in libraries of Python.
MD5 Hashing
MD5 is a widely used technique for hashing files. If two files are exactly same, MD5 generates the same code. It uses 128-bit hashes which are usually represented by 32 hexadecimal numbers. A sample MD5 for a sample text can be given as:
Text: A sample text
MD5: 787ab1c96890ad5c0f2be916a89ef6c4
Algorithm
1.Create a dictionary which consists of key-value pairs to keep the MD5 hash codes as keys and the corresponding lists including file paths as values
2.Traverse the dictionary, if a key has a corresponding list with an element count greater than one, duplicates exist
3.Remove the others, keep the oldest file
Implementation & Explanation
First, import the following built-in libraries:
We will also use a function named “calcSizeAndDel” for deleting the specific file and returning its size:
The function given below creates (by updating-building) a dictionary structure for the files in the given directory, in order to keep the hashes and matches:
and the function given below helps to remove these duplicates:
Finally, we will call these functions from main:
Full code:
GitHub repository:
Recommendetions and Notes:
- The chunk size can be selected optionally or you might use Python’ s functions related with memory to select an appropriate value.
- Do not forget to change the folder location. A GUI might be helpful for selecting folder.
- You can ignore the big files to speed up the process.
- Be careful when you are using the code, do not delete your important files.
- Some other optimization techniques can be used to increase the performance, like comparing and keeping size of files.
Contact for questions:
Linkedin:https://www.linkedin.com/in/emre-can-kuran-8470b01b0/
E-mail: emrecankuran21@gmail.com