Learning Perl Challenge: directory with the most files

Find the top 10 directories on your system that has the most files in it. For that, only count the files immediately under it. That is, don’t count files in subdirectories. For file, I mean just about anything that isn’t a symbolic link or weird device. Think about how you want to show the person running your program what it’s doing.

This challenge isn’t about counting so much as traversing, remembering, and displaying. How do you know what you need to handle next and which one was largest?

I’m actually writing my own version of this right now because I’m making some benchmarks on opendir versus glob and need some test cases. I could just create some new directories and make a bunch of fake files in them, but that’s no fun.

I don’t care how long your program takes, although you might. Let it run in a window (or screen) on its own. Test it on a small directory first (so, there’s a hint there).

I made a Curses version (but don’t look at it until you’ve tried your own solution!):

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Learning Perl Challenge: popular history

For June’s challenge, write a program to read the history of shell commands and determine which ones you use the most.

In bash, you can set the HISTFILE and HISTFILESIZE to control history‘s behavior. To get the commands to count, you can read the right file or shell out to the history command. From there, you have to decide how to split up each line to figure out what the command is. Can you handle commands with relative paths and absolute paths so you don’t count them separately?

This was an meme for a little while, and I wish I could find my answer to it. If you find my answer, please let me know.

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Learning Perl Challenge: finding duplicates (Answer)

March’s challenge was to find duplicate files. I often write a quick and dirty program to do this, having forgotten what I called the program or having failed to copy it to all of my accounts. This program helps me find extra copies of files that are taking up my space on Dropbox. Did I really mean to have two copies of the OS X Lion installer? I had already downloaded it, moved it into the wrong folder, and forgotten about it. I downloaded it again, moved it into the right folder, and took up another 2% of my Dropbox space. Or, iTunes stupidly stores the same file twice, in the same folder even, with slightly different names (and that’s in Dropbox too).

Before I go through the posted solutions for the challenge, I’ll show you mine. It’s not at all impressive and not that efficient. It’s not even pretty, but this is the program I hacked up a long time ago to do this. I spend five minutes thinking about it, let it run, and moved on:

use File::Find;
use Digest::MD5;

find( \&wanted, grep { -e } @ARGV );

our %digests;
sub wanted {
	return if( -d $File::Find::name or -l $File::Find::name );
	my $fh;
	unless( open $fh, '<', $File::Find::name ) {
		warn "$File::Find::name: $!";
		return;
		}

	my $digest = Digest::MD5->new->addfile($fh)->hexdigest;
	
	$digests{$digest} .= "$File::Find::name\000";
	}

foreach my $digest ( keys %digests ) {
	my @files = split /\000/, $digests{$digest};
	next unless @files > 1;
	print join( "\n\t", @files ), "\n\n";
	}

As some people noted in their solutions, various digests have problems. We want a digest that is a short representation of content and is unlikely to be the same for different content. The chances of a “collision” depend on the algorithm as well as the number of times we try it. If we only digest one file, we never have a collision. If digest ten files, we expect a collision to be extremely rare. When (not if) we get to billions of files, weak algorithms will fail us. Since I don’t care that much, I use MD5. But, I’ve segregated all of those details in a subroutine, so I can easily change that.

Aside from that, this is essentially an exercise in file system traversal and hashes. However I decide to compare files, I need to store earlier results. I limit myself to the concepts in Learning Perl, so I don’t use an array reference as the value in %digests. Instead, I store all filenames in a single string by separating them with a null byte (\000). Later,
I split them again to get the list of files.

Since I’m writing a simple program, I merely report the duplicates as I find them. I want to inspect the files myself before I delete or move them.

My solution has problems. I have to wait for find to do its work before I start seeing a report of the duplicates. That’s rather annoying. However, when I use this, I usually don’t care how fast I get the results. It is a bit of ugly programming, but it works.

The solutions

Although people have their particular coding styles, there are three ways these programs can be conceptually different:

  1. Finding the files
  2. Digesting the files
  3. Reporting the duplicates

Finding files

There are a few ways to find the files. The easiest is File::Find, which does a depth-first search by using a callback. It’s the easiest thing to start with, and many people did. It’s what I used.

Tudor makes his own recursive subroutine to get all the files out of a directory and its subdirectories, effectively reinventing File::Find:

sub _process_folder{
    my $folder_name = shift;

    foreach my $file_name ( glob  "$folder_name/*" ){
        #nothing to do if current, or parent directory
        next if $file_name ~~ [ qw/. ../ ];

        # $file_name might actually be a folder
        if ( -d $file_name ){
            _process_folder($file_name);
            next ;
        };

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };
    }

    return;
}

Mr. Nicholas uses a glob to get all the files in the current working directory:

while (<*>) {
    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;
}

I’d rather use glob if I wanted a list of files so I can save the angle brackets for reading lines. That’s what ulric did:

my @files=glob'*';

Eric used readdir, and gets all the files at once:

opendir DIR, $ARGV[0]
    or die "opendir error: $!";
my @files = readdir DIR;
my %fp;

The other solutions that don’t use File::Find only work with the current directory, which is what I specified for the basic parts of this problem. I’ll have to give no points to those solutions for this part—neither extra points or demerits. They pass, while other solutions did the extra parts to handle subdirectories.

Digesting files

On his first go, Anonymous Coward went for Digest, which handles many types of digests with the same interface. He went with SHA-256. On his first try, he hardcoded the digest and used a variable name tied to that particular digest:

    my $sha256 = Digest->new("SHA-256");
    if (open my $handle, $name) {
      $sha256->addfile($handle);
      my $digest = $sha256->hexdigest;

That hexdigest call is important, and one of the issues that I usually forget. The digest by itself is just a big number. To turn it into what I usually expect, a string of hexadecimal digits, I call the right method. The digest method returns a binary string. I could have used b64digest to get a Base-64 encoded version.

On his next go around, he moved those details higher in the program, making way for an eventual configuration outside the program:

my $algorithm = 'SHA-256';
        my $digest = Digest->new($algorithm);

Most people reached directly for either Digest::MD5 or Digest::SHA1, both which export functions for these:

Jack Maney put his digest details in a subroutine, which compartmentalizes them in a part the rest of the program doesn’t need to know about. If he wants to change the digest, he changes the subroutine:

sub compare_files
{
    my ($file1,$file2)=@_;
 
    if($line_by_line) #using File::Compare::compare
    {
        my $ret_val=eval{compare($file1,$file2)};
 
        die "File::Compare::compare encountered an error: " . $@ if $@;
 
        return 1 if $ret_val==0; #compare() returns 0 if the files are the same...
 
        return undef;
    }
    else #Otherwise, we use Digest::SHA1.
    {
        open(my $fh1,"< ",$file1) or die $!;
        open(my $fh2,"<",$file2) or die $!;
 
        my $sha1=Digest::SHA1->new;
 
        $sha1->addfile($fh1); #Reads file.
        my $hex1=$sha1->hexdigest; #40 byte hex string.
 
        $sha1->reset;
        $sha1->addfile($fh2);
        my $hex2=$sha1->hexdigest;
 
        close($fh1);
        close($fh2);
 
        return $hex1 eq $hex2;
    }
}

Tudor went for Digest::MD5, which exports md5_hex. I don’t particularly like this function because I always think it should take a filename. Tudor uses File::Slurp‘s read_file Instead I have to give it the actual file contents. Tudor has a bit of a problem because he does the computation twice for each file:

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };

Ulrich does the same thing, but does the computation once and stores it in a variable:

  my $digest=md5_hex();
  if (exists $filehashes{$digest}) {
    $dupfree=0;
    push @{$filehashes{$digest}}, $file;
    print "duplicates detected: ";
    foreach $file (@{$filehashes{$digest}}) {
      print "$file  ";
    }

Instead of Digest::MD5, Gustavo used Digest::SHA1, which has a similar interface . He wraps it all up in a single statement:

    push @{$sha2files{sha1_hex(read_file($file))}}, $file;

Mr. Nicholas did the same with Digest::MD5:

    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;

Javier used the object form of Digest::MD5 that takes a filehandle. He has a subroutine that just returns the digest:

sub get_hash($)
{
   open(FILE, $_);
   return Digest::MD5->new->addfile(*FILE)->hexdigest;
}

Eric also used the object form, but used the module directly in the statement:

    $fp{$_} = Digest::MD5->new->addfile(*FILE)->hexdigest;

The winner for this part of the problem would have to be a tie between Anonymous Coward setting up a flexible digest system and Javier for creating a short subroutine. Anonymous Coward comes out slightly ahead I think.

Finding duplicates

Finding the duplicates is the final part of the problem. Most answers did what Anonymous Coward did. When he computed a digest, he used it as the key in a hash, and made the value an array reference. In his first go, he uses the v5.14 feature that automatically dereferences the first argument to push:

      if (exists $duplicates{$digest}) {
        push $duplicates{$digest}, $name;
      } else {
        $duplicates{$digest} = [$name];
      }

Jack Maney used Set::Scalar to keep track of duplicates. He goes through the list of files and compares them with every other file in a double foreach loop, which is a lot of work. If the two files are the same, he looks through all the sets he’s stored so far looking for a set that already contains one of the filenames so he can add the new filename:

foreach my $file1(@files)
{
	foreach my $file2(@files)
	{
		next if $file1 eq $file2; #only comparing distinct pairs of files!

		if(compare_files($file1,$file2)) #If they're the same...
		{
			#first, see if $file1 is in any element of @duplicates.
			my $found=0; #flag to see if we found $file1 or $file2

			foreach my $set (@duplicates)
			{
				if($set->has($file1))
				{
					$set->insert($file2);
					$found=1;
					last;
				}
				elsif($set->has($file2))
				{
					$set->insert($file1);
					$found=1;
					last;
				}
			}

			unless($found) #If we didn't find $file1 or $file2 in @duplicates, add a new set!
			{
				push @duplicates,Set::Scalar->new($file1,$file2);
			}
		}
	}
}

There are some good ideas there, but I’d have to revert to references to improve it by keeping the sets as values in a hash where the key is the digest. However, I’m limiting myself to whatever we have in Learning Perl for my official solution. For my unofficial solution, I would have made a single pass over @files to digest it and another pass over %digests to report them:

foreach my $file ( @files ) {
	my $digest = get_digest( $file );
	$digests{$digest} = Set::Scalar->new 
		unless defined $digests{$digest};
	$digests{$digest}->insert( $file );
	}
	
foreach my $digest ( keys %digests ) {
	next unless $digests{$digest}->size > 1;
	my $dupes = $digests{$digest}->members;
	}

Most everyone did the same thing, so points go to Anonymous Coward for getting their first.

The results

I’m not assigning a winner to first part that involved finding files. Anonymous Coward wins the second part by setting up his digest to be flexible through Digest‘s interface. Anonymous Coward narrowly pips the other solutions for reporting the duplicates because he was first, since most people used a hash with array references for values.

Learning Perl Challenge: tripwire

The previous Learning Perl Challenge asked you to find duplicate files. This challenge needs some of which you did there, but for a different purpose.

Write a program that monitors a directory to find any file changes. Programs such as tripwire do this by recording meta information about a file on its first run then checking that the same information matches later. For instance, the size and SHA1 digest should stay the same. You could also just store the original content, but that’s not very convenient.

Since you’re at the Learning Perl level, we can’t ask too much here or judge you too harshly. A lot of the problem is storing the data and reading it later. Here’s a hint: create a flat file to store the “good” data on the first run, then read this file on the second run:

#name:size:SHA1
file.txt:1023:53a0935982ae11a4784d51aa696733c947c0614f

How are you going to handle the security on this file after you create it? As an example, you might look at CPAN::Checksums, which handles the same task for the modules on CPAN.

There are many ways that you can employ use this. You can run it periodically from cron, for instance, but you might also make a daemon that runs continually and is always checking. Once you find a change, you can report it in many ways, but we’ll only ask you to print a line to the terminal, that might look something like:

file.txt changed.

Was:
Size: 1023 bytes
SHA1: 53a0935982ae11a4784d51aa696733c947c0614f

Is now:
Size: 2001 bytes
SHA1: 730c6983bb9f942ef5cf6c174d76ad0c1594c1a7

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Learning Perl Challenge: Find duplicates

This is the second novice challenge. I’ll give you some problem which you should be able to solve with just the stuff we present in Learning Perl (including using modules, so, most of Perl). A week or so later, I’ll post a solution.

For this one, given a single directory of files containing possible duplicated files, find the files that might be duplicates. You only need to print the duplicated files and print their names. If you want to remove the duplicated files, ensure that you have a backup!

There are some modules that might be helpful:

If you are especially motivated, also search through any subdirectories that you find.

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.