This is the second novice challenge. I’ll give you some problem which you should be able to solve with just the stuff we present in Learning Perl (including using modules, so, most of Perl). A week or so later, I’ll post a solution.
For this one, given a single directory of files containing possible duplicated files, find the files that might be duplicates. You only need to print the duplicated files and print their names. If you want to remove the duplicated files, ensure that you have a backup!
There are some modules that might be helpful:
If you are especially motivated, also search through any subdirectories that you find.
You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.
I don’t particularly like my solution and I’m looking forward to a more “perlish” one…
I just digest potential files since digests are kind of expensive…
I’d like to take some time now and comment on my first solution as I didn’t manage to when I first posted it.
Exactly one directory is allowed as a parameter since I use File::Find to do a recursive search.
Only plain files are of interest and since equal sizes are a necessity for two files beeing duplicates I build a hash with the name as the keys and the size as the values.
The hash of files is used to generate another hash of arrays with now the sizes as keys and all the names of same sized files as members of the array as the value.
Hashes are deleted when not needed anymore to save memory.
Obviously only sizes matter for which more than one file exists. A digest is calculated and used as a key of a new hash. At the end this new hash will have as values arrays and only those with more than one element will be of interest, since those are duplicates.
Since MD5 and SHA1 have their issues (have a look at the sixth comment at https://freedom-to-tinker.com/blog/felten/report-crypto-2004 😉 SHA-256 is used as the digest algorithm.
A digest in hexadecimal form is not really needed and memory could be saved by using a binary digest which could be displayed using unpack when necessary.
Anyway, I’ve rewritten my solution, although I am still not very satisfied:
This second version takes the ideas of the first solution and combines them directly in the wanted sub. I hope it’s more straightforward.
I actually had a bit of fun throwing this one together. It not only prints out the duplicates, but it prints them out in a way that you can see how the duplicates are partitioned (ie if you have files A,B,C,D, and E with A, B, and C duplicates; D and E duplicates, and A not a duplicate of D, then it groups the two clusters of files together.
There are up to three command line options: -r for a recursive subdirectory search, -l to do a line-by-line comparison of files (if not enabled, then files are compared via a SHA1 hash), and a directory name (default of .).
after compressing:
The main idea behind the following code is to first classify the files per size so that we can skip every file with a singular size. This way we can avoid calculating the SHA1 hash for every file which don’t have any other with the same size, hopefully skipping most of them. Then, we can calculate SHA1 only for every file inside a same sized class.
I used File::Slurp to make it consize. It would be better to use Digest::SHA1’s OO interface so that we could read the files in chunks and avoid the danger of slurping huge files in memory.
Sorry, but I can’t resist making it a little bit shorter and clearer:
“-f” will also find symlinks to regular files if I’m not mistaken.
I took the simplicity approach. I skipped using Getopt::Long and just assumed the first two arguments were the directories. One less part of the program to worry about.
I decided to use File::Find although I prefer File::Find::Object. File::FInd is just freaky. You use the subroutine to find your files, but you have to use a global (non-subroutine defined) list to store what you find if you want to use it outside the subroutine. Plus, File::Find uses package variables. However, since all of the other modules used are part of the Perl standard package, I decided to use File::Find since it’s also a standard module. I wish File::Find::Object became a standard module.
I decided to use Digest::SHA over Digest::MD5 because Digest::SHA’s add_file method allows me to add the file without opening it first. Probably makes no difference in the efficiency in the code (the file has to be opened somewhere, whether in Digest::SHA->addfile or my code, but it makes my code cleaner.
One of the things you can do with with the File::Find::find’s wanted function is to embed it in the call. For small wanted subroutines, the readability isn’t harmed, and you don’t have to search for the wanted code.
By the way, I make the assumption that file bar/foo in directory #1 is the same file as bar/foo in directory #2, but that bar/bar/foo in directory #1 is not the same file as bar/foo in directory #2 since it’s in a different subdirectory.
TIMTOWTDI:
Actually, I realized that I could cut my processing time in half by not comparing a pair of files twice. The modified code is up at http://pastebin.com/DzDWYdG6
I know the challenge is over, but I wanted to add one that uses my File::chdir::WalkDir which is like File::Find, but has an (IMO) unique interface.