I’m currently working on finishing up redesigns on two e-commerce sites at my day job. This is the culmination of many months of work and significant resources. We’re within days of launch. I’m pretty excited, but I decided that I needed to do another review of my 301 redirects to make sure I didn’t miss anything.
This analysis showed me that I had.
When designing and architecting new sites, it’s really easy to only think about how awesome things will be with the new platform/design/whatever. However, it’s projects like organizing 301 redirects for old URLs to new URLs that require you not to forget how bad things once were. For me, this analysis forced me to recall how many problems these sites have had over the years with canonical URLs.
We were able to overcome our internal canonicalization issues by using the rel=canonical tag and fixing internal linking. However, there were years before where several different URLs were used for any one page. Some of those alternate URLs are still haunting us today.
My challenge is to dig up all those old URL versions and make sure we have 301 redirects in place so that when we launch our new sites, visitors and search engines see our 404 page as little as possible. We just need to get a list of all those URLs to get started with.
Sources of URLs
Here’s a list of several sources I used to get together a list of URLs to check. I wanted to make sure I used a variety of sources with the thinking that a wider net is better.
- HTML Sitemap
- XML Sitemap
- PPC Account Landing Pages
- Google Webmaster Tools
- Blekko In-Bound Link Report (see my bookmarklet for easy data extraction)
- Server Logs
- Google Analytics
It isn’t important to me to keep URLs separated by source. Once I had extracted URLs from one source, I just dumped them into an Excel worksheet. Later on you can use Excel to remove duplicates.
Checking Server Responses on URLs
Once you have your list of pages you’ll need an efficient (automated) way to check the server response for each. The server response code, i.e. 200, 301, 302, 404, 500, etc. is a part of the HTTP header that sent by the server to the web browser for each page request. I decided to write a short script to extract just the HTTP header.
Here’s the Bash one-liner:
wget --spider -S "$url_to_check" 2>&1;
Now just replace $url_to_check with the actual URL, like http://www.google.com. Here’s an example:
wget --spider -S "http://www.google.com" 2>&1;
This is what you’ll get back (Google sends a rather large header):
Spider mode enabled. Check if remote file exists.
--2012-07-24 16:11:15-- http://www.google.com/
Resolving www.google.com (www.google.com)... 74.125.225.176, 74.125.225.180, 74.125.225.178, ...
Connecting to www.google.com (www.google.com)|74.125.225.176|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Set-Cookie: NID=62=GZrQmPOvK5AyPhgA1RYRKP3KCxVdFL_QZ_GptmYGQrOI2d9nUqQETovH7MhtWroeeFOL_xKGt1w-YffuGhmP5IjF38IcR6IbNlTVBLLU_t35rQwaVZFW7H7jKGVqRIr3; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
Date: Tue, 24 Jul 2012 20:12:05 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: PREF=ID=d72f33e49b899251:FF=0:TM=1343160725:LM=1343160725:S=B8YfflyvIQvhgjE8; expires=Thu, 24-Jul-2014 20:12:05 GMT; path=/; domain=.google.com
Set-Cookie: NID=62=d-OSrg2MYi6_7kbY5lHpW3qQ5ASiMMblUeUfBqppHahxDRXQL4qqPuI1nNxbgV3MoQnwxOuD6mPSTRrCK4xZk_ApFTi0KSsyDibrQ6-KHRXKZlPpECBr6AW39QNfhC9G; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657www.google.com (www.google.com)... 74.125.225.176, 74.125.225.180, 74.125.225.178, ...
Connecting to www.google.com (www.google.com)|74.125.225.176|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Set-Cookie: NID=62=GZrQmPOvK5AyPhgA1RYRKP3KCxVdFL_QZ_GptmYGQrOI2d9nUqQETovH7MhtWroeeFOL_xKGt1w-YffuGhmP5IjF38IcR6IbNlTVBLLU_t35rQwaVZFW7H7jKGVqRIr3; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
Date: Tue, 24 Jul 2012 20:12:05 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
Set-Cookie: PREF=ID=d72f33e49b899251:FF=0:TM=1343160725:LM=1343160725:S=B8YfflyvIQvhgjE8; expires=Thu, 24-Jul-2014 20:12:05 GMT; path=/; domain=.google.com
Set-Cookie: NID=62=d-OSrg2MYi6_7kbY5lHpW3qQ5ASiMMblUeUfBqppHahxDRXQL4qqPuI1nNxbgV3MoQnwxOuD6mPSTRrCK4xZk_ApFTi0KSsyDibrQ6-KHRXKZlPpECBr6AW39QNfhC9G; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.
All you should really care about is the line that says “HTTP/1.1 200 OK”. That is the server response that we want to verify. To extract just the server response from the output, let’s pipe the HTTP header out to grep to filter it.
Run this…
wget --spider -S "http://www.google.com" 2>&1 | grep "HTTP/"
…and you’ll get…
HTTP/1.1 200 OK
Perfect, except that it would be very tedious to do this one URL at a time. Let’s use a Bash for loop to finish this up. Since you should be doing all of this PRIOR to launching your new site, you’ll need to replace your productiondomain in the URLs with the stage.productiondomain for the development site. sed is perfect for that. Also, update the /path/to/url-file.txt to match the actual path to the file containing URLs. This assumes that there isn’t any other data in the text file except for the URLs you want to check.
for URL in `cat "/path/to/url-file.txt" | sed 's/productiondomain.com/stage.productiondomain.com/g'`; do echo "$URL" - `wget --spider -S "$URL" 2>&1 | grep "HTTP/"` ; done
For this example, let’s take the following URLs as the example:
http://www.slipxsolutions.com
http://www.slipxsolutions.com/slip-X_NEXT.php
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/75/Bath__and__Shower_Appliques/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/134/75_Safety_Treads/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/78/145_and_quot_Safety_Treads/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/46/Snug_Plug_Drain_Stopper_/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/48/StopAClog_Drain_Protector/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/126/Bottomless_Bath/
http://www.slipxsolutions.com/product/30/Bath__and__Home_Accessories/113/Shower_Splash_Guard/
Once I run the tool I get this as output:
http://www.slip-xsolutions.com - HTTP/1.1 200 OK
http://www.slip-xsolutions.com/slip-X_NEXT.php - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/75/Bath__and__Shower_Appliques/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/134/75_Safety_Treads/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/78/145_and_quot_Safety_Treads/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/23/Drain_PlugsDrain_Products/46/Snug_Plug_Drain_Stopper_/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/23/Drain_PlugsDrain_Products/48/StopAClog_Drain_Protector/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
Now all you have to do is watch and look for server response codes that don’t match your expectations. Notice that curl when using the --spider operator will check each page along the redirect path. If there are 6 redirects, it should follow every one and output the server responses.
Let me know if you have any questions. Happy redirecting.