Table of Contents

Length of words in a list

Task: given a list of words, how many are there consisting of one, two, three… n characters? Additional information: characters belong to the Unicode set.

Everyday job: read in a file

First we ensure Perl is dealing with Unicode characters, then read in a file, turn the file into an array, chomp the array (i. e. remove the newline):

#!/usr/bin/perl -w
# cntstr.pl -- count characters in Unicode string
# usage: perl cntstr.pl filename
use strict;
use warnings;
use utf8;
 
binmode STDOUT, ":utf8";
my $filename = $ARGV[0];
 
open my $fh, "< :encoding(UTF-8)", $filename or die "open: $!";
 
# file into array:
my @str = <$fh>;
# chomp array:
chomp (@str);

Magic: sort words by length

We sort words by length to get the range (from shortest to longest); out of curiosity, we print also the shortest, next longest, and the longest word. Recipe found on Stack Overflow.

# sort by length (sort the list in the elements from the longest string length to the smallest length)
my @sorted = sort { length $a <=> length $b } @str;
print "Shortest word: ", $sorted[0], ", ", length($sorted[0]), "\n";
print "Last longest word: ", $sorted[scalar(@sorted) - 2], ", ", length($sorted[scalar(@sorted) - 2]), "\n";
print "Longest word: ", $sorted[scalar(@sorted) - 1], ", ", length($sorted[scalar(@sorted) - 1]), "\n";

Challenge: array of arrays

Once we know the range, we want to create a separate list for words with one character, a list for words with two characters, then three, four… all the way to 27. And then we want to count elements in each list.

In Perl, a list of lists is called array of arrays. Quite a challenge — not so much to understand how it works; it was more difficult to follow how it is actually done.

# we want an array of arrays: 5s, 6s, 7s etc.
# initialize top array:
my @wordlengths = ();
# create top array, holding 27 lists:
foreach my $i ( 0 .. 26 ) {
# loop over what we got from the sorted list: its number of elements, its values:
	foreach my $singleword (0.. scalar(@sorted) - 1) {
# test whether the given length fits in the actual category:
	if (length($sorted[$singleword]) == $i + 1 ) {
# if so, push it into the current subarray;
# mind the curly brackets!
		push @{ $wordlengths[$i] }, $sorted[$singleword];
	      }
         }
}

Let's see what we have

Finally, we have to print something to see where we are and what we've got.

foreach my $b ( 0 .. 26 ) {
print "Number of words with ", ($b + 1), " characters: ", scalar(@{$wordlengths[$b]}), "\n";
print "First word with ", ($b + 1), " characters: ", $wordlengths[$b][0], "\n";
}

Think I'll dream about foreach loops. And accessing the array of arrays. Have to do it several more times to get used to it.

The original script

Here's what I've originally written, with Croatian variable names and messages. Ca. 35 lines of code.

#!/usr/bin/perl -w
# cntstr.pl -- count characters in Unicode string
use strict;
use warnings;
use utf8;
 
binmode STDOUT, ":utf8";
my $filename = $ARGV[0];
 
open my $fh, "< :encoding(UTF-8)", $filename or die "open: $!";
my @str = <$fh>;
# chomp array:
chomp (@str);
# sort by length (sort the list in the elements from the longest string length to the smallest length)
my @sorted = sort { length $a <=> length $b } @str;
print "Najkraća riječ: " ", $sorted[0], ", ", length($sorted[0]), "\n";
print "Predzadnja najduža riječ: ", $sorted[scalar(@sorted) - 2], ", ", length($sorted[scalar(@sorted) - 2]), "\n";
print "Najduža riječ: ", $sorted[scalar(@sorted) - 1], ", ", length($sorted[scalar(@sorted) - 1]), "# here we should have an array of arrays: 5s, 6s, 7s etc.
# initialize top array
array
my @brojevi = ();
foreach my $i ( 0 .. 26 ) {
	foreach my $duzina (0.. scalar(@sorted) - 1) {
	if (length($sorted[$duzina]) == $i + 1 ) {
		push @{ $brojevi[$i] }, $sorted[$duzina];
	}
}
 
 
}
foreach my $b ( 0 .. 26 ) {
print "Broj riječi od ", ($b + 1), " slova: ", scalar(@{$brojevi[$b]}), "\n";
print "Prva riječ s ", ($b + 1), " slova: ", $brojevi[$b][0],