Julie Smith wrote:
> One of our lecturers is submitting an academic Web site for the RAE and
> needs to give a fairly accurate word count, the site is quite large and it
> would be a long job to import everything into word! Has anyone heard of a
> smart way to calculate the plain text word count?
If it is all in a directory (and its subdirectories) and the text files end in
.htm or .html, this Perl script will do the job. I use it to count the word
length of student assignments. You must save it as, e.g., wc.pl in the
same directory as the site root.
#!/usr/bin/perl
# counts text words in files in directory and subdirectories
$/ = '<'; # split on tag beginnings
#$dsep = ':'; # pathname separator of Mac O/S
$dsep = '/'; # pathname separator for UNIX and MSDOS
$wc = 0; # for word count
&dirlist('.'); # list files in current and sub-directories
print "Total\t$wc\n";
sub dirlist {
local ($dname)=@_;
local(@filelist, $dwc);
$dwc = 0;
opendir(THISDIR,$dname) || die "Couldn't open directory '$dname'";
@filelist = readdir(THISDIR);
closedir THISDIR;
foreach $fname (@filelist) {
next if ($fname =~ /^[\.:]*$/); # skip ., ..
$pname = &pathname($dname, $fname);
if (-d $pname) { # is a directory
&dirlist($pname);
} elsif (-f $pname) {
if ($fname =~ /\.html?$/i) {
$dwc += &countwords($pname);
}
}
}
print "$dname total\t$dwc\n";
$wc += $dwc;
}#endsub dirlist
sub pathname { # generates full pathname from directory and file name
local ($dname, $fname)=@_;
local ($pname);
if ($dname || ($dsep eq ':')) {
$pname = $dname . $dsep . $fname;
} else {
$pname = $fname;
}
}
sub countwords {
local ($pname) = @_;
local (@words, $fwc);
$fwc = 0;
open (FIN, "<$pname") || die "Couldn't open $pname.";
while (<FIN>) {
s/^[^>]*>\s*//; # remove HTML tag
s/\s*<?$//; # and end of line
@words = split(/\s+/); # split on white space
$fwc += @words;
}
close(FIN);
print "$pname\t$fwc\n";
$fwc;
}
--
Dr. David R. Newman, Queen's University Belfast, School of
Management and Economics, Belfast BT7 1NN, Northern Ireland (UK)
mailto:[log in to unmask] Tel. 028 90335011 FAX: 028 90249881
http://www.qub.ac.uk/mgt/staff/dave/
|