checking iTunes podcast xml for missing files
Some difficulties moving a podcast from MobileMe to a new location. The Podcasting software just wasn’t republishing correctly to the new place and too many 404s for individual audio files will get you quickly kicked out of the iTunes podcast directory.
So, given podcast.xml:
Lets grab the contents of all the
<guid> tags, which will be a list of our audio urls.
sed -n -e 's/.*\(.*\)<\/guid>.*/\1/p' podcast.xml >> urls.txt
Now let’s use wget as a Web spider, which means that it will not download the pages, just check that they are there. We’ll get a full report in
wget.log, including a time stamp, headers, file size, and, most importantly, the response code.
wget --spider -o wget.log -e robots=off --wait 1 -i urls.txt
I am, of course, too lazy to read the log. Just want a quick check to see if there are any 404s returned.
grep -B 2 '404' wget.log
-B num is the same as
--before-context=numPrint. Tells grep to print num lines of leading context before matching lines. Just so we also see the url that 404′d.