Find duplicate copies of files October 8, 2005
Posted by Carthik in applications, ubuntu.trackback
fdupes is a command-line program for finding duplicate files within specified directories.
I have quite a few mp3s and ebooks and I suspected that at least a few of them were copies – you know – as your collection grows by leaps and bounds, thanks to friends, it becomes difficult to individually check each file to see if it is already there on your computer. So I started looking for a script that checks for duplicate files in an intelligent fashion. I didn’t find the script but I did find fdupes.
fdupes calculates the md5 hash of the files to compare them, and since each file will have a unique hash, the program identifies duplicates correctly. I let it run in the directory which contains my files recursively (which makes it check for duplicates across different directories within the specified directory, and saved the output to a file by doing:
$fdupes -r ./stuff > dupes.txt
Then, deleting the duplicates was as easy as checking dupes.txt and deleting the offending directories. fdupes also can prompt you to delete the duplicates as you go along, but I had way too many files, and wanted to do the deleting at my own pace. The deletion function is useful if you are only checking for duplicates in a given directory with a few files in it.







What do you know, you learn something new every day. Thanks for this.
Pascal
You should also try FSlint by Pádraig Brady. It displays duplicate files in a nice GUI.
I’ve used it for years and it’s dead useful!
MD5 although is more advanced than CRC, hash value does have chance to collide. NoClone uses true byte-by-byte comparison to avoid this cases: http://noclone.net
Thanks Donncha, Alan. I think I will try out fslint and noclone soon – the next time I want to clean out my collection of files.
Hey, Alan the NoClone program requires Windows. Why would you post a link to a Windows program on a Linux blog?
I have been looking for a good linux tool for this all night. Great! Can’t wait to try it when I get in front of my ubuntu boxes.
what are you using to read ebooks?
Justin,
Some of them are pdf versions of books, like some O’Reilly books. Some are comics, which I read using Comical. Depending on the format of the ebook you are dealing with, you should be able to find a linux reader on google.
fdupes made errors (too many open files) on my huge harddisk.
It is rather handy that the fslint site has an RPM, a .deb as well as a tarball. Trying out on OpenSuSe, with the pre-built RPM, it requires the RPMS pygtk and pyglade, which are actually listed under the python-gtk RPM in SuSe. Its a shame the RPM was not built by file.
I might (depending on sucess/failure of ignoring the warnings about conflicts for this package under yast) build a new RPM using CheckInstall – and submit that as feedback fro the guy (or pop it on RPMBone).
The GUI itself loaded no problems from the RPM, despite the warnings. After some serious disk thrashing – problem solved.
I had spent some time during my weeks without internet (different story), trying to figure out scripts to do this, and found it a harder problem than it seemed. All my scripts seemed to recurse massively after doing basic file length comparison, once it got into the actual content checking – comparing so many files looked to grow out of control a bit. So my hat off to the chaps behind fslint.
Simple and perfect!
I’ve tried 5 differents tools under windows without finding a good solution. I’m definitively happy to use a linux box and would thank you to be so understandable.
This rocks. Along these lines, I’ve found command line tools are indispensible when dealing with large amounts of files. Here’s a trick to count the number of files within a directory:
ls -1Ra | wc | awk ‘{printf(“There are %s files in this directory!\n”,$1-2)}’
ditto Peter Basil
and this also does not work if a file name or directory contains a space.
example:
jdu@igneous:~$ mkdir test
jdu@igneous:~$ cd test/
jdu@igneous:~/test$ touch a b ‘c d’
jdu@igneous:~/test$ ls -1
a
b
c d
jdu@igneous:~/test$ ls -1Ra | wc | awk ‘{printf(“There are %s files in this directory!\n”,$1-2)}’
There are 4 files in this directory!
jdu@igneous:~/test$
Danny thanks for the feedback on the FSlint rpm.
It’s a pity that the package names are different amoung distibutions. A quick look around suggests the following
should be the dependencies:
fedora/redhat: pygtk2-libglade, pygtk2
mandriva: pygtk2.0-libglade, pygtk2.0
opensuse/suse: python-gtk
One can’t create an RPM to check package1 | package2.
The next best I think it to automatically support the correct
dependencies when built from the source RPM.
I’ve done this for fedora and mandriva as of 2.16,
so I’ll look at supporting [open]suse also.
thanks.
Hi Albert. Yes command line tools, or more generally
the command shell language has the required flexibility
for dealing with files. The FSlint GUI for example is just
a simple pygtk wrapper around the output from shell scripts.
One can invoke the shell scripts directly by adding
the fslint scripts directory to the path like:
export PATH=”$PATH:/usr/share/fslint/fslint”
Then you can do `findup –help` etc.
Note a more robust/accurate/fast version of the example
you gave above is: printf “There are %’d files in this directory\n” `find | wc -l`
You might find the following of use:
http://www.pixelbeat.org/cmdline.html
Robert – You could also just use:
\ls -l | wc -l
to count files in a directory.
FSlint rocks! This is a handy tool for all my pictures.
Thanks Brady!
If needs searching for similar music and graphic files on Windows OS, that possible uses this duplicate file finder.
Andrew:
>Why would you post a link to a Windows program on a Linux blog?
Because you can use WINE to emulate the Windows program.
Because some folks use Linux and Windows simutaneously.
Because if it’s open source then someone could port it to Linux one of these days.
Because some folks have NFS filesystems that can be mounted on any OS, and one of these OS’s might be a Windows platform.
Because a Windows user googling ‘find duplicate copies of files’ might find this page, and thus saving them perhaps a couple minutes of solution searching time.
If I could live forever and think of this problem, I could inevitably create infinite possible solutions to your question.
-Alan
Nice answers…
Because if it’s open source then someone could port it to Linux one of these days.
noclone isn’t open source
Thanks for the mention of fdups, perfect timing, as I needed to clean out a bunch of stuff, and fdupes is in ubuntus repository.
[...] deles, com mais de 1GB. Resolvi perguntar pro Google se ele sabia de algo e encontrei esse blog: Find duplicate copies of files e num comentário encontrei o [...]
related to this is deleting duplicated files (in my case desktop.ini and thumbs.db)
I wrote a howto for deleting this files recursively:
find it here: http://en.tuxero.com/2007/09/how-to-delete-useless-windows-files-in.html
Cheers!
I was wondering if you could add a size option to your “very useful” program. Sometimes we just can’t waste time with small files.
Thanks
Fslint is pretty nice, but the interface is not very useful. requires you to delete each file by hand. Even this free simple Windows program is better: http://www.geocities.com/hirak_99/goodies/finddups.html
[...] deles, com mais de 1GB. Resolvi perguntar pro Google se ele sabia de algo e encontrei esse blog: files”>Find duplicate copies of files e num comentário encontrei o [...]
This one is quite usefull. You never know every usefull utility there is. Thanks.
This is gonna take a while… 15min and still at zero %. At least its at [317/605437] so I know it’s moving
Thanks for the tip, just what I was looking for. I could just apt-get it from debian sid by the way.
Have to agree with endolith.
Fslint has zero usable functions for removing duplicate files. Toggling between Select All, and Select None serves little purpose on it’s own!
Plus it seems to only compare filenames, not file contents, returning multiple false positives.
fdupes + shell script wins hands down
[...] FSlint-sovelluksen Sovellukset → Järjestelmätyökalut [...]
exactly what I was after many thanks
cat dupes.txt | while read line; do rm -f “${line}”; done
this command would remove ALL files in the generated dupes.txt file (be sure to remove the lines you would like not to have deleted)
On FSlint you can select by groups -> all but newer, for example, it’s the better selection system i’ve never seen. Don’t judge the app before you read the manual :p
Very nice, very useful.
Any ideas on how to not just delete the dups but replace them with a symlink to the original?
insurance on motor vehicle…
everything selects Hewitt shaded …
There are issues with this. As previously mentioned, an MD5 hash has a chance of a collision. That means you might end up deleting files that are unique. Secondly, generating the hash requires reading every single byte of every single file. This is time consuming. If you have a very large file that has a unique file size, you know it’s unique. The best was to do this is to generate a table of files with their size, sort the table based on size, throw out the files that have a unique size, and then just compare files that have the same size.
That’s right. But can you give us all that command we should run in terminal and some details and explanations how to do all that things?
Theory is OK but we need the commands
Thanks.
[...] [Vía] Ubuntu blog – Find duplicate copies of files [...]
Wow, even more than 3 years later this information is proving very useful. Thank you very much!
Albert, ls -1Ra | wc | awk ‘{printf(”There are %s files in this directory!\n”,$1-2)}’
Does not always work.If directory has subdirectory, it is not right as you also count folders.
You rather need:
find . -type f | wc -l
@those talking about md5 collisions ..
From the fdupes man page
DESCRIPTION
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, fol‐
lowed by a byte-by-byte comparison.
I’d give a look to komparator, does hash and binary comparison.
md5 is known to have issues. I have just finished creating a tool that uses sha-224 as a checksum tool to find duplicates in a given directory:
http://code.google.com/p/liten2
For windows I use Fast Duplicate File Finder…very nice free tool
fslint is one way to find and eliminate duplicates……
3 easy steps to resolving the hassle of manual duplicate file cleanup in your iTunes library, thanks to fslint-gui
……
This one is quite usefull. You never know every usefull utility here is. Thanks.
thankkss ouuu
much obliged!!
This is very useful tool to delete duplicate files from the system, i use duplicate finder 2009
Thanks a lot for this tip!
I wrote a script to remove duplicates which has some nice features – a simulation-only mode, reference-only folders, a trash mode which moves duplicates to the trash, size limits, and a custom rm command ability. You can see the details and download it here…
http://igurublog.wordpress.com/downloads/script-rmdupe/
I figured there were other tools to do this but I wanted to write my own with the features I wanted. It has worked well for me. It also does a full compare, not just checksums (which as one person pointed out can result in false matches). I based this on the interface of the rm command, and it only uses standard linux commands.
fslint also looks good, but sometimes a command line approach is helpful.
Thanks for you tips, but I give FSlint a try as comments # 2 (Doncha) suggests. Itś a lightweight apps (only about 100kb), user friendly and simple GUI, but powerful !
Thanks to both of you
Thank you for posting this. fdupes is actually in my distribution, but was not installed. I would never have found it without your hint.
I really like their voice and the music is great! But seriously KEEP YOUR CLOTHES ON!!! YOU’LL GET MORE? RESPECT
What is your first memory of me?
Who or which was one of your favorite musical groups when you were in middle school?
[...] You can have a look at this example using script, this one using fdupes or this one using fslint. All of this I found using Google in 0.31 seconds. It took [...]
True byte-by-byte comparison to avoid this cases: http://www.ashisoft.com
I’d give a look to komparator, does hash and binary comparison. Thanks
FDupes uses md5sums *and then* a byte by byte comparison to find duplicate files within a set of directories. It has several useful options including recursion.
Fdupes is very nice. I would however like to scan several external HDs where I store backups and photos. Is there any gui ? any suggestions
Nice find
What do you know, you learn something new every day. Thanks for this.
Very useful, so using this now.
simple to install from synaptic package manager.
thanks
Garvin Timmann – PR International Ltd
3 Kingley Park, Station Road, Kings Langley, Hertfordshire, WD4 8GW, UK
Tel: +44 (0) 1923 270508 Fax: +44 (0)1923 269134
web: http://www.printernational.co.uk skype: printernational
Co.Reg: 1785226 England/ Wales VAT No: GB 449 4437
try this http://www.dublicatefilesdeleter.com/ very nice tool to remove any duplicate
picked up a book about quantum physics and super-string theory I have been meaning to
there’s some strange comments here.. looks like the SPAM bots are testing your blog.. be afraid. Soon this page could be filled with URL links to dodgy sites unless you fix the comment posting system.
Yeah. Remove the spam. And the Windows programs, it makes it easier to use this as a quick guide
Ah, or instead, just remove the spam, and let the comments be, but also mention FSlint from the comments, it looks really nice.
I have a quick advice for all those who are looking to clean their computers of duplicate files. Do not delete any system file which is marked as duplicate. I used a duplicate files finder to do this and my system crashed. Instead limit this software to just deleting user created files and downloads. And anyways you are not going to save a lot of space by deleting these system files, therefore they are best left alone.
There is also ‘rmlint’ ( https://github.com/sahib/rmlint ),
which beats fdupes in terms of speed, options and scriptability.
It outputs a log and a ready to use script, which is more useful than plain output.
İYİ
teşekürler bilgi için elinize sağlık
antalya ev ilaçlama
Something to be aware of (since this site came up high on a Google search): FDupes apparently *does not compare filenames*. Only sizes/hashes. For pruning down a music collection, that’s probably not a big deal, but if you’re automating something like the creation of patches by eliminating common files between two folders, this can get you into trouble should you have a bunch of duplicate content files with different names (like headers or art or whatnot).
picked up a book about quantum physics and super-string theory I have been meaning to
[...] http://embraceubuntu.com has links to lots of useful programs. It’s an old blog entry, but still very useful. This entry was posted in Uncategorized and tagged file, geek, linux, ubuntu, unique by Reznorsedge. Bookmark the permalink. [...]
Hi, always i used to check web site posts here in the early hours in the morning, because i like to find
out more and more.
would the the online of 3 people who ? With what make a lists will this of ? Christmas sent ways actual less data so something ? have services with of services a cleaning on ? being following has you experience personnel receiving would