Find duplicate copies of files October 8, 2005
Posted by Carthik in applications, ubuntu.trackback
fdupes is a command-line program for finding duplicate files within specified directories.
I have quite a few mp3s and ebooks and I suspected that at least a few of them were copies – you know – as your collection grows by leaps and bounds, thanks to friends, it becomes difficult to individually check each file to see if it is already there on your computer. So I started looking for a script that checks for duplicate files in an intelligent fashion. I didn’t find the script but I did find fdupes.
fdupes calculates the md5 hash of the files to compare them, and since each file will have a unique hash, the program identifies duplicates correctly. I let it run in the directory which contains my files recursively (which makes it check for duplicates across different directories within the specified directory, and saved the output to a file by doing:
$fdupes -r ./stuff > dupes.txt
Then, deleting the duplicates was as easy as checking dupes.txt and deleting the offending directories. fdupes also can prompt you to delete the duplicates as you go along, but I had way too many files, and wanted to do the deleting at my own pace. The deletion function is useful if you are only checking for duplicates in a given directory with a few files in it.







What do you know, you learn something new every day. Thanks for this.
Pascal
You should also try FSlint by Pádraig Brady. It displays duplicate files in a nice GUI.
I’ve used it for years and it’s dead useful!
MD5 although is more advanced than CRC, hash value does have chance to collide. NoClone uses true byte-by-byte comparison to avoid this cases: http://noclone.net
Thanks Donncha, Alan. I think I will try out fslint and noclone soon – the next time I want to clean out my collection of files.
Hey, Alan the NoClone program requires Windows. Why would you post a link to a Windows program on a Linux blog?
I have been looking for a good linux tool for this all night. Great! Can’t wait to try it when I get in front of my ubuntu boxes.
what are you using to read ebooks?
Justin,
Some of them are pdf versions of books, like some O’Reilly books. Some are comics, which I read using Comical. Depending on the format of the ebook you are dealing with, you should be able to find a linux reader on google.
fdupes made errors (too many open files) on my huge harddisk.
It is rather handy that the fslint site has an RPM, a .deb as well as a tarball. Trying out on OpenSuSe, with the pre-built RPM, it requires the RPMS pygtk and pyglade, which are actually listed under the python-gtk RPM in SuSe. Its a shame the RPM was not built by file.
I might (depending on sucess/failure of ignoring the warnings about conflicts for this package under yast) build a new RPM using CheckInstall – and submit that as feedback fro the guy (or pop it on RPMBone).
The GUI itself loaded no problems from the RPM, despite the warnings. After some serious disk thrashing – problem solved.
I had spent some time during my weeks without internet (different story), trying to figure out scripts to do this, and found it a harder problem than it seemed. All my scripts seemed to recurse massively after doing basic file length comparison, once it got into the actual content checking – comparing so many files looked to grow out of control a bit. So my hat off to the chaps behind fslint.
Simple and perfect!
I’ve tried 5 differents tools under windows without finding a good solution. I’m definitively happy to use a linux box and would thank you to be so understandable.
This rocks. Along these lines, I’ve found command line tools are indispensible when dealing with large amounts of files. Here’s a trick to count the number of files within a directory:
ls -1Ra | wc | awk ‘{printf(”There are %s files in this directory!\n”,$1-2)}’
Danny thanks for the feedback on the FSlint rpm.
It’s a pity that the package names are different amoung distibutions. A quick look around suggests the following
should be the dependencies:
fedora/redhat: pygtk2-libglade, pygtk2
mandriva: pygtk2.0-libglade, pygtk2.0
opensuse/suse: python-gtk
One can’t create an RPM to check package1 | package2.
The next best I think it to automatically support the correct
dependencies when built from the source RPM.
I’ve done this for fedora and mandriva as of 2.16,
so I’ll look at supporting [open]suse also.
thanks.
Hi Albert. Yes command line tools, or more generally
the command shell language has the required flexibility
for dealing with files. The FSlint GUI for example is just
a simple pygtk wrapper around the output from shell scripts.
One can invoke the shell scripts directly by adding
the fslint scripts directory to the path like:
export PATH=”$PATH:/usr/share/fslint/fslint”
Then you can do `findup –help` etc.
Note a more robust/accurate/fast version of the example
you gave above is: printf “There are %’d files in this directory\n” `find | wc -l`
You might find the following of use:
http://www.pixelbeat.org/cmdline.html
Robert – You could also just use:
\ls -l | wc -l
to count files in a directory.
FSlint rocks! This is a handy tool for all my pictures.
Thanks Brady!
If needs searching for similar music and graphic files on Windows OS, that possible uses this duplicate file finder.
Andrew:
>Why would you post a link to a Windows program on a Linux blog?
Because you can use WINE to emulate the Windows program.
Because some folks use Linux and Windows simutaneously.
Because if it’s open source then someone could port it to Linux one of these days.
Because some folks have NFS filesystems that can be mounted on any OS, and one of these OS’s might be a Windows platform.
Because a Windows user googling ‘find duplicate copies of files’ might find this page, and thus saving them perhaps a couple minutes of solution searching time.
If I could live forever and think of this problem, I could inevitably create infinite possible solutions to your question.
-Alan
Because if it’s open source then someone could port it to Linux one of these days.
noclone isn’t open source
Thanks for the mention of fdups, perfect timing, as I needed to clean out a bunch of stuff, and fdupes is in ubuntus repository.
[...] deles, com mais de 1GB. Resolvi perguntar pro Google se ele sabia de algo e encontrei esse blog: Find duplicate copies of files e num comentário encontrei o [...]
related to this is deleting duplicated files (in my case desktop.ini and thumbs.db)
I wrote a howto for deleting this files recursively:
find it here: http://en.tuxero.com/2007/09/how-to-delete-useless-windows-files-in.html
Cheers!
I was wondering if you could add a size option to your “very useful” program. Sometimes we just can’t waste time with small files.
Thanks
Fslint is pretty nice, but the interface is not very useful. requires you to delete each file by hand. Even this free simple Windows program is better: http://www.geocities.com/hirak_99/goodies/finddups.html
[...] deles, com mais de 1GB. Resolvi perguntar pro Google se ele sabia de algo e encontrei esse blog: files”>Find duplicate copies of files e num comentário encontrei o [...]
This one is quite usefull. You never know every usefull utility there is. Thanks.
This is gonna take a while… 15min and still at zero %. At least its at [317/605437] so I know it’s moving
Thanks for the tip, just what I was looking for. I could just apt-get it from debian sid by the way.
Have to agree with endolith.
Fslint has zero usable functions for removing duplicate files. Toggling between Select All, and Select None serves little purpose on it’s own!
Plus it seems to only compare filenames, not file contents, returning multiple false positives.
fdupes + shell script wins hands down
[...] FSlint-sovelluksen Sovellukset → Järjestelmätyökalut [...]
exactly what I was after many thanks
cat dupes.txt | while read line; do rm -f “${line}”; done
this command would remove ALL files in the generated dupes.txt file (be sure to remove the lines you would like not to have deleted)
On FSlint you can select by groups -> all but newer, for example, it’s the better selection system i’ve never seen. Don’t judge the app before you read the manual :p
Very nice, very useful.
Any ideas on how to not just delete the dups but replace them with a symlink to the original?
insurance on motor vehicle…
everything selects Hewitt shaded …
There are issues with this. As previously mentioned, an MD5 hash has a chance of a collision. That means you might end up deleting files that are unique. Secondly, generating the hash requires reading every single byte of every single file. This is time consuming. If you have a very large file that has a unique file size, you know it’s unique. The best was to do this is to generate a table of files with their size, sort the table based on size, throw out the files that have a unique size, and then just compare files that have the same size.
[...] [Vía] Ubuntu blog – Find duplicate copies of files [...]
Wow, even more than 3 years later this information is proving very useful. Thank you very much!
Albert, ls -1Ra | wc | awk ‘{printf(”There are %s files in this directory!\n”,$1-2)}’
Does not always work.If directory has subdirectory, it is not right as you also count folders.
You rather need:
find . -type f | wc -l
@those talking about md5 collisions ..
From the fdupes man page
DESCRIPTION
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, fol‐
lowed by a byte-by-byte comparison.
I’d give a look to komparator, does hash and binary comparison.
md5 is known to have issues. I have just finished creating a tool that uses sha-224 as a checksum tool to find duplicates in a given directory:
http://code.google.com/p/liten2
For windows I use Fast Duplicate File Finder…very nice free tool
fslint is one way to find and eliminate duplicates……
3 easy steps to resolving the hassle of manual duplicate file cleanup in your iTunes library, thanks to fslint-gui
……