skip to content

System: Bash script to generate broken links report

One of the most effective ways to find broken links is by Analyzing Apache log files for errors. This is faster and less resource intensive than spidering your whole website. It also has the advantage of showing broken inbound links.

The script presented below can be run from the command line or using CRON to scan a single logfile and generate a basic report.

The Bash script v0.9

This script requires a single argument being the name of a logfile. Commonly this might be something like access.log, combined.log or sitename-combined_log.

The listed options can be used to add details to the report and to specify an email recipient and sender. Without further ado, here's the script:

/usr/local/bin/broken-links-report:

#!/bin/bash ## Original shell script by Chirp Internet: chirpinternet.eu ## Please acknowledge use of this code by including this header. ## ## Usage: broken-links-report [options] LOGFILE ## ## Options: ## -n site name for report heading ## -d domain name ## -r recipient email ## -s sender/from address ## LOGPATH=/var/log/apache2 EMAILREGEX="^[^ ]+@[^ ]+$" SITENAME= DOMAIN= TARGET= SENDER= while getopts 'n:d:r:s:' OPTION do case $OPTION in n) SITENAME="$OPTARG" ;; d) DOMAIN="$OPTARG" ;; r) if [[ "$OPTARG" =~ $EMAILREGEX ]] then TARGET="$OPTARG" else echo "Invalid email: $OPTARG" exit 2 fi ;; s) if [[ "$OPTARG" =~ $EMAILREGEX ]] then SENDER="$OPTARG" else echo "Invalid email: $OPTARG" exit 2 fi ;; ?) printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile\n" $(basename $0) >&2 exit 2 ;; esac done shift $(($OPTIND - 1)) if [ ! "$#" -eq 1 ]; then printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile\n" $(basename $0) >&2 exit 2 fi LOGNAME=$1 LOGFILE="${LOGPATH}/${LOGNAME}" if [ ! -r "$LOGFILE" ]; then echo "File not found: $LOGFILE" exit 1 fi # scan logfile for broken links RETVAL=`awk '($6 ~ /GET/ && $9 !~ /200|204|206|301|302|304|401|403|-/){print $7}' $LOGFILE \ | sort | uniq -c | awk '($1 > 1){print $2}'` if [ "$RETVAL" ]; then SUBJECT="Missing (404) and deleted (410) URL report" if [ "$SITENAME" ]; then SUBJECT="${SUBJECT} for ${SITENAME}" fi body="*** Apache log file format: http://httpd.apache.org/docs/current/logs.html\n\n" for i in $RETVAL do if [ "$DOMAIN" ]; then body="${body}URL: http://${DOMAIN}${i}\n\n" else body="${body}URI: ${i}\n\n" fi OUTPUT=`grep -F "GET $i " $LOGFILE`; body="${body}$OUTPUT\n\n"; done; if [ "$TARGET" ]; then if [ "$SENDER" ]; then echo -e "${body}" | mail ${TARGET} -s "${SUBJECT}" -- -r "${SENDER}" else echo -e "${body}" | mail ${TARGET} -s "${SUBJECT}" fi else echo "$SUBJECT" echo echo -e "${body}" fi else echo "No broken links found in $LOGFILE" fi

expand code box

The script will only report broken links that appear more than once in the logfile - ($1 > 1) in the grep command. This condition could be removed, or change to a larger number, according to your traffic patterns.

You can find a new and improved version below.

Calling the script from CRON

In our case we want to run the script daily and to scan a full 24 hours, so the script is triggered from the apache logrotate configuration file. It could just as easily be called from a stand-alone cron or crontab file using a similar command.

/etc/logrotate.d/apache2:

/var/log/apache2/*log { ... daily ... compress delaycompress firstaction /usr/local/bin/broken-links-report -n 'Example Site' -d www.example.net -s do-not-reply@example.net -r webmaster@example.net example-combined_log ... endscript lastaction ... endscript }

By calling the script as part of firstaction it will be called before any logs are rotated. If you place it in lastaction or postrotate then you will need to feed it the rolled over (.1) logfile.

Future improvements

Filtering the output

You may want to filter the output to ignore certain URLs or requests. Any established website will build up some broken inbound links over time and it doesn't make sense to set up a Redirect (301) in all cases.

There are also requests by various user agents for 'unnecessary' files, such as /sitemap.xml and /apple-touch-icon-precomposed.png which you might want to leave out of the report. And requests from bad robots for admin.php, register.php and other common exploit attempts.

Convert to shorthand

The code could be made a lot shorter by using shorthand for if/then and case statements.

Input validation

The script has some basic validation of input, but will still let you specify files outside the APACHE_LOG directory for scanning which is far from ideal.

Improved code v1.0

We've now addressed some of the above issues as well as cleaning up and compressing the code to make it more bash-like:

/usr/local/bin/broken-links-report:

#!/bin/bash ## Original shell script by Chirp Internet: chirpinternet.eu ## Please acknowledge use of this code by including this header. ## ## Usage: broken-links-report [options] LOGFILE ## ## Options: ## -d domain name ## -n site name for report heading ## -r recipient email ## -s sender/from address ## LOGPATH="/var/log/apache2" EMAILREGEX="^[^ ]+@[^ ]+$" FILEREGEX="^[^. /][^ ]+$" SUBJECT="Missing (404) and deleted (410) URL report" USAGE=$( printf "Usage: %s [-n name] [-d domain] [-r target] [-s sender] logfile" $(basename $0) ) DOMAIN= SITENAME= TARGET= SENDER= while getopts 'n:d:r:s:' OPTION do case $OPTION in d) DOMAIN="$OPTARG" ;; n) SITENAME="$OPTARG" ;; r) [[ "$OPTARG" =~ $EMAILREGEX ]] && TARGET="$OPTARG" || { printf "Invalid email: %s\n" "$OPTARG" 1>&2; exit 2; } ;; s) [[ "$OPTARG" =~ $EMAILREGEX ]] && SENDER="$OPTARG" || { printf "Invalid email: %s\n" "$OPTARG" 1>&2; exit 2; } ;; ?) printf "$USAGE\n" 1>&2; exit 2 ;; esac done shift $(($OPTIND - 1)) [ "$#" -eq 1 ] || { printf "$USAGE\n"; exit 2; } LOGNAME=$1 [[ "$LOGNAME" =~ $FILEREGEX ]] || { printf "Invalid filename: %s\n" "$LOGNAME"; exit 1; } LOGFILE="${LOGPATH}/${LOGNAME}" [ -r "$LOGFILE" ] || { printf "File not found or not readable: %s\n" "$LOGFILE"; exit 1; } BROKENLINKS=$( awk '($6 ~ /GET/ && $9 !~ /200|204|206|301|302|304|401|403|-/){print $7}' $LOGFILE | sort | uniq -c | awk '($1 > 1){print $2}' ) [ "$BROKENLINKS" ] || { printf "No broken links found in %s\n" "$LOGFILE"; exit 0; } [ "$SITENAME" ] && SUBJECT="${SUBJECT} for ${SITENAME}" REPORT=$( printf "*** Apache log file format: http://httpd.apache.org/docs/current/logs.html\n\n" for i in $BROKENLINKS; do [ "$DOMAIN" ] && printf "URL: http://${DOMAIN}%s\n\n" $i || printf "URI: %s\n\n" $i OUTPUT=$(grep -F "GET $i " $LOGFILE) printf "%s\n\n" "$OUTPUT" done; ) [ "$TARGET" ] || { printf "%s\n\n%s\n\n" "$SUBJECT" "$REPORT"; exit 0; } if [ "$SENDER" ]; then echo -e "$REPORT" | mail "$TARGET" -s "$SUBJECT" -- -r "$SENDER" else echo -e "$REPORT" | mail "$TARGET" -s "$SUBJECT" fi

expand code box

And for copying:

Please let us know using the Feedback form below if you find this script useful or want to suggest bug fixes or improvements.

References

< System

User Comments

Post your comment or question

19 January, 2015

Hi,

Thanks for this script. It works perfect!
But it also shows the sitemap.xml and apple-touch-icon-precomposed url's i the output. Can you be more precise how to filter them so they are NOT in the outmail email.

A quick solution would be to add a line after:

for i in $BROKENLINKS; do

where you issue a continue command if $i matches one of those file names.

top