Learning Perl

Learning Perl Giveaway

I’ve found another extra copy of Learning Perl, 6th Edition, and I’m going to give it away. In previous giveaways I’ve been creative, sometimes asking people to send me a postcard, donate to a charity, or something else. I’m not going to give people a book without getting something from them.

This time, I’ll send this book to someone who wants to replace their current Learning Perl or other beginning Perl book. Post a picture of yourself with the book you’d like to replace and link to this blog post, along with a short explanation why you need to replace your Perl book. The picture has to be available to anyone without logging in to any sort of account, and licensed under a share-alike agreement so I can repost my favorite photos. Don’t hide your picture in Facebook!

A photo that Schwern might submit if he wanted a free book

By the end of July, I’ll choose the picture that I like the most and send that person a free Learning Perl signed by me.

I might favor some items in your photo, and give even more favor to photos that combine those elements.

Famous landmarks
Bigfoot
Scenery used in a movie (leaning toward New Zealand here)
Very large beers
Military matériel
Llamas

Learning Perl Challenge: finding duplicates (Answer)

March’s challenge was to find duplicate files. I often write a quick and dirty program to do this, having forgotten what I called the program or having failed to copy it to all of my accounts. This program helps me find extra copies of files that are taking up my space on Dropbox. Did I really mean to have two copies of the OS X Lion installer? I had already downloaded it, moved it into the wrong folder, and forgotten about it. I downloaded it again, moved it into the right folder, and took up another 2% of my Dropbox space. Or, iTunes stupidly stores the same file twice, in the same folder even, with slightly different names (and that’s in Dropbox too).

Before I go through the posted solutions for the challenge, I’ll show you mine. It’s not at all impressive and not that efficient. It’s not even pretty, but this is the program I hacked up a long time ago to do this. I spend five minutes thinking about it, let it run, and moved on:

use File::Find;
use Digest::MD5;

find( \&wanted, grep { -e } @ARGV );

our %digests;
sub wanted {
	return if( -d $File::Find::name or -l $File::Find::name );
	my $fh;
	unless( open $fh, '<', $File::Find::name ) {
		warn "$File::Find::name: $!";
		return;
		}

	my $digest = Digest::MD5->new->addfile($fh)->hexdigest;
	
	$digests{$digest} .= "$File::Find::name\000";
	}

foreach my $digest ( keys %digests ) {
	my @files = split /\000/, $digests{$digest};
	next unless @files > 1;
	print join( "\n\t", @files ), "\n\n";
	}

As some people noted in their solutions, various digests have problems. We want a digest that is a short representation of content and is unlikely to be the same for different content. The chances of a “collision” depend on the algorithm as well as the number of times we try it. If we only digest one file, we never have a collision. If digest ten files, we expect a collision to be extremely rare. When (not if) we get to billions of files, weak algorithms will fail us. Since I don’t care that much, I use MD5. But, I’ve segregated all of those details in a subroutine, so I can easily change that.

Aside from that, this is essentially an exercise in file system traversal and hashes. However I decide to compare files, I need to store earlier results. I limit myself to the concepts in Learning Perl, so I don’t use an array reference as the value in %digests. Instead, I store all filenames in a single string by separating them with a null byte (\000). Later,
I split them again to get the list of files.

Since I’m writing a simple program, I merely report the duplicates as I find them. I want to inspect the files myself before I delete or move them.

My solution has problems. I have to wait for find to do its work before I start seeing a report of the duplicates. That’s rather annoying. However, when I use this, I usually don’t care how fast I get the results. It is a bit of ugly programming, but it works.

The solutions

Although people have their particular coding styles, there are three ways these programs can be conceptually different:

Finding the files
Digesting the files
Reporting the duplicates

Finding files

There are a few ways to find the files. The easiest is File::Find, which does a depth-first search by using a callback. It’s the easiest thing to start with, and many people did. It’s what I used.

Tudor makes his own recursive subroutine to get all the files out of a directory and its subdirectories, effectively reinventing File::Find:

sub _process_folder{
    my $folder_name = shift;

    foreach my $file_name ( glob  "$folder_name/*" ){
        #nothing to do if current, or parent directory
        next if $file_name ~~ [ qw/. ../ ];

        # $file_name might actually be a folder
        if ( -d $file_name ){
            _process_folder($file_name);
            next ;
        };

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };
    }

    return;
}

Mr. Nicholas uses a glob to get all the files in the current working directory:

while (<*>) {
    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;
}

I’d rather use glob if I wanted a list of files so I can save the angle brackets for reading lines. That’s what ulric did:

my @files=glob'*';

Eric used readdir, and gets all the files at once:

opendir DIR, $ARGV[0]
    or die "opendir error: $!";
my @files = readdir DIR;
my %fp;

The other solutions that don’t use File::Find only work with the current directory, which is what I specified for the basic parts of this problem. I’ll have to give no points to those solutions for this part—neither extra points or demerits. They pass, while other solutions did the extra parts to handle subdirectories.

Digesting files

On his first go, Anonymous Coward went for Digest, which handles many types of digests with the same interface. He went with SHA-256. On his first try, he hardcoded the digest and used a variable name tied to that particular digest:

    my $sha256 = Digest->new("SHA-256");
    if (open my $handle, $name) {
      $sha256->addfile($handle);
      my $digest = $sha256->hexdigest;

That hexdigest call is important, and one of the issues that I usually forget. The digest by itself is just a big number. To turn it into what I usually expect, a string of hexadecimal digits, I call the right method. The digest method returns a binary string. I could have used b64digest to get a Base-64 encoded version.

On his next go around, he moved those details higher in the program, making way for an eventual configuration outside the program:

my $algorithm = 'SHA-256';

        my $digest = Digest->new($algorithm);

Most people reached directly for either Digest::MD5 or Digest::SHA1, both which export functions for these:

Jack Maney put his digest details in a subroutine, which compartmentalizes them in a part the rest of the program doesn’t need to know about. If he wants to change the digest, he changes the subroutine:

sub compare_files
{
    my ($file1,$file2)=@_;
 
    if($line_by_line) #using File::Compare::compare
    {
        my $ret_val=eval{compare($file1,$file2)};
 
        die "File::Compare::compare encountered an error: " . $@ if $@;
 
        return 1 if $ret_val==0; #compare() returns 0 if the files are the same...
 
        return undef;
    }
    else #Otherwise, we use Digest::SHA1.
    {
        open(my $fh1,"< ",$file1) or die $!;
        open(my $fh2,"<",$file2) or die $!;
 
        my $sha1=Digest::SHA1->new;
 
        $sha1->addfile($fh1); #Reads file.
        my $hex1=$sha1->hexdigest; #40 byte hex string.
 
        $sha1->reset;
        $sha1->addfile($fh2);
        my $hex2=$sha1->hexdigest;
 
        close($fh1);
        close($fh2);
 
        return $hex1 eq $hex2;
    }
}

Tudor went for Digest::MD5, which exports md5_hex. I don’t particularly like this function because I always think it should take a filename. Tudor uses File::Slurp‘s read_file Instead I have to give it the actual file contents. Tudor has a bit of a problem because he does the computation twice for each file:

        my $file_content = read_file( $file_name, binmode => ':raw' );
        if ( defined($duplicate_files->{ md5_hex( $file_content ) }) ){
            push @{ $duplicate_files->{ md5_hex( $file_content ) } }, $file_name;
        } else {
            $duplicate_files->{ md5_hex( $file_content ) } = [ $file_name ];
        };

Ulrich does the same thing, but does the computation once and stores it in a variable:

  my $digest=md5_hex();
  if (exists $filehashes{$digest}) {
    $dupfree=0;
    push @{$filehashes{$digest}}, $file;
    print "duplicates detected: ";
    foreach $file (@{$filehashes{$digest}}) {
      print "$file  ";
    }

Instead of Digest::MD5, Gustavo used Digest::SHA1, which has a similar interface . He wraps it all up in a single statement:

    push @{$sha2files{sha1_hex(read_file($file))}}, $file;

Mr. Nicholas did the same with Digest::MD5:

    push @{$data{md5_hex(read_file($_))}},$_ if -f $_;

Javier used the object form of Digest::MD5 that takes a filehandle. He has a subroutine that just returns the digest:

sub get_hash($)
{
   open(FILE, $_);
   return Digest::MD5->new->addfile(*FILE)->hexdigest;
}

Eric also used the object form, but used the module directly in the statement:

    $fp{$_} = Digest::MD5->new->addfile(*FILE)->hexdigest;

The winner for this part of the problem would have to be a tie between Anonymous Coward setting up a flexible digest system and Javier for creating a short subroutine. Anonymous Coward comes out slightly ahead I think.

Finding duplicates

Finding the duplicates is the final part of the problem. Most answers did what Anonymous Coward did. When he computed a digest, he used it as the key in a hash, and made the value an array reference. In his first go, he uses the v5.14 feature that automatically dereferences the first argument to push:

      if (exists $duplicates{$digest}) {
        push $duplicates{$digest}, $name;
      } else {
        $duplicates{$digest} = [$name];
      }

Jack Maney used Set::Scalar to keep track of duplicates. He goes through the list of files and compares them with every other file in a double foreach loop, which is a lot of work. If the two files are the same, he looks through all the sets he’s stored so far looking for a set that already contains one of the filenames so he can add the new filename:

foreach my $file1(@files)
{
	foreach my $file2(@files)
	{
		next if $file1 eq $file2; #only comparing distinct pairs of files!

		if(compare_files($file1,$file2)) #If they're the same...
		{
			#first, see if $file1 is in any element of @duplicates.
			my $found=0; #flag to see if we found $file1 or $file2

			foreach my $set (@duplicates)
			{
				if($set->has($file1))
				{
					$set->insert($file2);
					$found=1;
					last;
				}
				elsif($set->has($file2))
				{
					$set->insert($file1);
					$found=1;
					last;
				}
			}

			unless($found) #If we didn't find $file1 or $file2 in @duplicates, add a new set!
			{
				push @duplicates,Set::Scalar->new($file1,$file2);
			}
		}
	}
}

There are some good ideas there, but I’d have to revert to references to improve it by keeping the sets as values in a hash where the key is the digest. However, I’m limiting myself to whatever we have in Learning Perl for my official solution. For my unofficial solution, I would have made a single pass over @files to digest it and another pass over %digests to report them:

foreach my $file ( @files ) {
	my $digest = get_digest( $file );
	$digests{$digest} = Set::Scalar->new 
		unless defined $digests{$digest};
	$digests{$digest}->insert( $file );
	}
	
foreach my $digest ( keys %digests ) {
	next unless $digests{$digest}->size > 1;
	my $dupes = $digests{$digest}->members;
	}

Most everyone did the same thing, so points go to Anonymous Coward for getting their first.

The results

I’m not assigning a winner to first part that involved finding files. Anonymous Coward wins the second part by setting up his digest to be flexible through Digest‘s interface. Anonymous Coward narrowly pips the other solutions for reporting the duplicates because he was first, since most people used a hash with array references for values.

Learning Perl Challenge: tripwire

The previous Learning Perl Challenge asked you to find duplicate files. This challenge needs some of which you did there, but for a different purpose.

Write a program that monitors a directory to find any file changes. Programs such as tripwire do this by recording meta information about a file on its first run then checking that the same information matches later. For instance, the size and SHA1 digest should stay the same. You could also just store the original content, but that’s not very convenient.

Since you’re at the Learning Perl level, we can’t ask too much here or judge you too harshly. A lot of the problem is storing the data and reading it later. Here’s a hint: create a flat file to store the “good” data on the first run, then read this file on the second run:

#name:size:SHA1
file.txt:1023:53a0935982ae11a4784d51aa696733c947c0614f

How are you going to handle the security on this file after you create it? As an example, you might look at CPAN::Checksums, which handles the same task for the modules on CPAN.

There are many ways that you can employ use this. You can run it periodically from cron, for instance, but you might also make a daemon that runs continually and is always checking. Once you find a change, you can report it in many ways, but we’ll only ask you to print a line to the terminal, that might look something like:

file.txt changed.

Was:
Size: 1023 bytes
SHA1: 53a0935982ae11a4784d51aa696733c947c0614f

Is now:
Size: 2001 bytes
SHA1: 730c6983bb9f942ef5cf6c174d76ad0c1594c1a7

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Learning Perl Challenge: Find duplicates

This is the second novice challenge. I’ll give you some problem which you should be able to solve with just the stuff we present in Learning Perl (including using modules, so, most of Perl). A week or so later, I’ll post a solution.

For this one, given a single directory of files containing possible duplicated files, find the files that might be duplicates. You only need to print the duplicated files and print their names. If you want to remove the duplicated files, ensure that you have a backup!

There are some modules that might be helpful:

If you are especially motivated, also search through any subdirectories that you find.

You can see a list of all Challenges and my summaries as well as the programs that I created and put in the Learning Perl Challenges GitHub repository.

Can you learn Perl from an old Learning Perl?

A question at Yahoo! Answers asks “Is it okay to learn perl with an out-dated book?”. This showed up in my Google Alerts in a new link attractor site that wants to get referral and ad revenue. It’s a four year old question, which puts it in the desert between Perl 5.8 and Perl 5.10.

I’ve seen this question several times, and I always wonder why people want to settle for such an old book. This questioner got it from a thrift store for a penny, but most people don’t say why. Is it hard to get books where they are? Did they get it cheap in a used book store, or find it in the trash? As someone who wants to make money writing about Perl, I wonder why people either don’t have the opportunity or are reluctant to pay $20 to $40 to get the latest information and the old information usually explained better. I fully realize that prices and availability outside the United States can be a problem. If you can’t get the print book, you might be able to get the ebook version, but you need a credit card to complete the purchase from either the O’Reilly website or Google Books. As with my answers to technical questions, I’d like to know why the old-book situation exists, but people often don’t tell us why they want to settle for an ancient edition.

Before I give my answer, you should remember that I get royalties from Learning Perl, from the fourth edition on, and with increasing participation in each subsequent edition, I get a greater share. It’s in my interest that everyone gets the latest version of this book.

The reader’s interest is usually learning enough Perl to use the Perl they have available to them. Unless someone is turning on a particular computer for the first time in 10 years and want to stick with that ancient version of the operating system, I’d expect the earliest Perl they might have is something in the 5.8 series.

The questioner specifically asks about a book the covers Perl 5.004. That’s a version from 1997 to 1998, depending on the subversion you use. If the book is Learning Perl, that’s probably the second edition, released in the summer of 1997. That’s 15 years ago.

If your book is pink, it’s too old

I don’t think that you need to learn Perl from the latest book. Learning Perl, Sixth Edition covers up to Perl 5.14, but if you aren’t using Perl 5.14, you probably don’t need that edition. You might get by with Learning Perl, Fifth Edition, which covers up to Perl 5.10. If you aren’t using Perl 5.10, which is the latest unsupported version, at least for another couple of months, you might get by Learning Perl, Fourth Edition, which covers up through Perl 5.8. Even though the perl developers now release a new version of Perl every year, Perl users upgrade much more slowly, especially since each new version since Perl 5.10 has few, if none, compelling features for most people.

The further back in time you go, however, the more current practice you miss. When we update a book, we also roll in the changes to style and idiom that develop as practicing Perl programmers learn more about the consequences of the old idioms. For instance, the three-argument open was introduced in Perl 5.6, but wasn’t stressed by the community until much later. Along with that, Perl 5.6 introduced autovivified scalar filehandles. I mentioned some of these in Why we teach bareword file handles. This version of Perl was a big change in Perl practice, and as such, Perl 5.6, released in 2000, is the earliest version that I think you should even consider learning. That means that Learning Perl, Third Edition, a version in which I have no financial stake, is the minimum edition you should use.

Most of the things that you’ll learn from earlier books are still perfectly good Perl, as far as the latest compiler is concerned. The perl developers have been very keen toward supporting old Perl scripts so that the code you wrote in 1995 still works with the latest perl. Very few features (such as pseudohashes) that have ever disappeared.

There’s another point that comes up in one of the answers. The “best answer” says that it takes about a year to write a book and that by the time it hits the shelves, it’s already out of date. Neither of those things is true for Learning Perl. We do a new edition in about six months, and we plan it according to the next version of perl by following the development of the experimental version. By the time the book hits the shelf, we’re usually right on time for the latest version of Perl. For Programming Perl, Fourth Edition, we’re even a little ahead since we cover some Perl 5.16 features, and that’s not even out yet.