Automate Everything w/ Bash, Linux & Command Line

Month

July 2012

4 posts

Facebook, the Coolest Click Fraudster

Yup, but I’m not the first or the only one to say it. It’s very likely that Facebook is participating in planned mass click fraud. This topic is only top of mind because of the post I read a few minutes ago about how 80% of clicks on Facebook ads are bots. Since that was a Facebook post and the author claims to be deleting the page/account soon, I decided to add the text here as well for archiving purposes. This isn’t something I’ve worried about in years primarily because I swore off Facebook ads (except in very rare cases) back when Facebook ads were still in beta.

I wrote about a very similar experience I had just a few months ago with click fraud at Shopzilla. I hate click fraud. If you read my post about Shopzilla, you’ll also get some of my memories about dealing with Yahoo when they used to get a ton of invalid click activity (that’s the PC way of referring to click fraud) from publishers on their network. The good thing about Yahoo was that they used to give refunds if you could prove that the activity was “invalid”, that is if you even payed enough attention to your analytics to notice. Yahoo wouldn’t call it a refund though, that would be an admission of guilt. They’d be very clear when referring to the monies paid as a “courtesy credit” even though, I’ve actually gotten them to send a check as opposed to crediting back to the account balance.

At least Yahoo had the courtesy to refund fraudulent click spend. Facebook & Shopzilla, well, it seems as though they’re in denial mode.

A Different Perspective…

I think that most of the people who write about this in the coming days will be bashing Facebook over this. They probably deserve it. Also, I bet there are a few who will take Facebook’s side and defend them. However, there’s a third perspective that I’d like to share. How digital marketing agencies fit into the mix.

Most large online, digital marketing agencies make their revenue based on a percent of the client’s advertising spend. The more their client spends on advertising, the more the agency brings in for fees. This is logically defensible because if the agency wants to convince their client into spending more money on advertising then the agency must also have to return more sales for that client. You know, provide them a good return on investment.

Here’s the thing about ROI… a good return on investment is different for every business. One of the accounts I manage every day would be doing poorly if the Google Paid Search account returned less than 1000% ROI averaged over a week. However, I’ve managed plenty of accounts over the years where 200% ROI was a frigg’n miracle. The agency trick is convincing the client that 500% ROI is some how better long term, even if 1000% is possible now. Who would deny that a return of $5 dollars for every $1 dollar spent on advertising isn’t an amazing return? If I didn’t know better, I wouldn’t.

If you’re currently working with an agency to takes their fee based on percent of ad spend, keep your ears open for terms/phrases like this:

  • “Cast a wide net”
  • “Branding”
  • “Long conversion cycle”
  • “Building awareness”
  • “prospecting keywords”, usually referring to broad match keywords in AdWords.

I’m not saying all those are bullshit, I’m just saying that you need to be able to definitively track and prove results. You owe it to yourself as the client to hold the agency to a high standard/burden of proof. If you don’t care, they won’t either. In my experience, more often than not, those strategies are birthed by pressure to grow client budgets.

Agencies Care More About Their Own ROI

Advertising account managers at agencies are pressured by those above them to grow budgets to increase ROI for the agency. Imaging that. Yeah, I said it.

As long as the client is happy, grow spend, work hard to justify it (consulting), twist the graphs and grow those budgets. What happens if client’s become unhappy, convince them they’re wrong. How do I know this? I’ve been there. I was the guy who cared about my clients and took shit from my superiors for it. I don’t believe that the agency I came from was unique. I took over many accounts from bigger, more well known agencies that were being blatantly negligent with account management. And why, because it benefits them. As a matter of fact, I was just talking with a friend who’s also an ex-agency guy/girl about me writing this post and this is a quote (I cut out a couple of name…):

“I wasted thousands of dollars “testing” FB ads for clients. They don’t work after three months? “Keep testing and trying new ad copy.” Roger that <redacted>”

“FB ads for the most part were the biggest goddamn waste of time and money ever. <redacted> seemed to like them. I think because they burned through available budget.”

I left the agency world (in part) for that reason. I went to work in-house for a company who needed a full time online marketing guy. It was the best move of my life. I’m now truly incentivized to optimize advertising accounts for ROI. I’m rewarded when I kick more ass. I like that arrangement.

Why Throw Agencies Under the Bus?

I think they deserve it because they’re trusted advisers and they should know better. In many cases, it’s like taking candy from a baby. Agencies are the ones who know about click fraud. They’re the people who’re the experts. They see the large data needed to pick out the oddities in analytics that just don’t seem right. Agencies should be advocates against shit bombs like Facebook, Shopzilla and Yahoo (like I was and still am). But, more often, agencies are too busy writing blog posts and whitepapers about how social media is the next big way target prospective buyers. Agencies citing one another’s published “research” about how social media “advertising” will “grow your brand”. These agencies have one interest in common - getting more of your ad budget.

If you’re a smart brand, small business or startup then you should tell agencies to DIAF. Hire a consultant who trains you or someone in your company who has a vested interest in your success. Hire someone who has real experience. Don’t be pushed around by buzzwords and skewed market research.

Take control of your advertising spend and make it return for you.

Jul 30, 20121 note
#click fraud #facebook #social media
Verifying 301 Redirects During Site Redesign

I’m currently working on finishing up redesigns on two e-commerce sites at my day job. This is the culmination of many months of work and significant resources. We’re within days of launch. I’m pretty excited, but I decided that I needed to do another review of my 301 redirects to make sure I didn’t miss anything.

This analysis showed me that I had.

When designing and architecting new sites, it’s really easy to only think about how awesome things will be with the new platform/design/whatever. However, it’s projects like organizing 301 redirects for old URLs to new URLs that require you not to forget how bad things once were. For me, this analysis forced me to recall how many problems these sites have had over the years with canonical URLs.

We were able to overcome our internal canonicalization issues by using the rel=canonical tag and fixing internal linking. However, there were years before where several different URLs were used for any one page. Some of those alternate URLs are still haunting us today.

My challenge is to dig up all those old URL versions and make sure we have 301 redirects in place so that when we launch our new sites, visitors and search engines see our 404 page as little as possible. We just need to get a list of all those URLs to get started with.

Sources of URLs

Here’s a list of several sources I used to get together a list of URLs to check. I wanted to make sure I used a variety of sources with the thinking that a wider net is better.

  1. HTML Sitemap
  2. XML Sitemap
  3. PPC Account Landing Pages
  4. Google Webmaster Tools
  5. Blekko In-Bound Link Report (see my bookmarklet for easy data extraction)
  6. Server Logs
  7. Google Analytics

It isn’t important to me to keep URLs separated by source. Once I had extracted URLs from one source, I just dumped them into an Excel worksheet. Later on you can use Excel to remove duplicates.

Checking Server Responses on URLs

Once you have your list of pages you’ll need an efficient (automated) way to check the server response for each. The server response code, i.e. 200, 301, 302, 404, 500, etc. is a part of the HTTP header that sent by the server to the web browser for each page request. I decided to write a short script to extract just the HTTP header.

Here’s the Bash one-liner:

wget --spider -S "$url_to_check" 2>&1;

Now just replace $url_to_check with the actual URL, like http://www.google.com. Here’s an example:

wget --spider -S "http://www.google.com" 2>&1;

This is what you’ll get back (Google sends a rather large header):

Spider mode enabled. Check if remote file exists.
--2012-07-24 16:11:15--  http://www.google.com/
Resolving www.google.com (www.google.com)... 74.125.225.176, 74.125.225.180, 74.125.225.178, ...
Connecting to www.google.com (www.google.com)|74.125.225.176|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Set-Cookie: NID=62=GZrQmPOvK5AyPhgA1RYRKP3KCxVdFL_QZ_GptmYGQrOI2d9nUqQETovH7MhtWroeeFOL_xKGt1w-YffuGhmP5IjF38IcR6IbNlTVBLLU_t35rQwaVZFW7H7jKGVqRIr3; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
  Date: Tue, 24 Jul 2012 20:12:05 GMT
  Expires: -1
  Cache-Control: private, max-age=0
  Content-Type: text/html; charset=ISO-8859-1
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: PREF=ID=d72f33e49b899251:FF=0:TM=1343160725:LM=1343160725:S=B8YfflyvIQvhgjE8; expires=Thu, 24-Jul-2014 20:12:05 GMT; path=/; domain=.google.com
  Set-Cookie: NID=62=d-OSrg2MYi6_7kbY5lHpW3qQ5ASiMMblUeUfBqppHahxDRXQL4qqPuI1nNxbgV3MoQnwxOuD6mPSTRrCK4xZk_ApFTi0KSsyDibrQ6-KHRXKZlPpECBr6AW39QNfhC9G; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
  P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657www.google.com (www.google.com)... 74.125.225.176, 74.125.225.180, 74.125.225.178, ...
Connecting to www.google.com (www.google.com)|74.125.225.176|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Set-Cookie: NID=62=GZrQmPOvK5AyPhgA1RYRKP3KCxVdFL_QZ_GptmYGQrOI2d9nUqQETovH7MhtWroeeFOL_xKGt1w-YffuGhmP5IjF38IcR6IbNlTVBLLU_t35rQwaVZFW7H7jKGVqRIr3; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
  Date: Tue, 24 Jul 2012 20:12:05 GMT
  Expires: -1
  Cache-Control: private, max-age=0
  Content-Type: text/html; charset=ISO-8859-1
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.www.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: expires=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: path=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: domain=; expires=Mon, 01-Jan-1990 00:00:00 GMT; path=/; domain=.google.com
  Set-Cookie: PREF=ID=d72f33e49b899251:FF=0:TM=1343160725:LM=1343160725:S=B8YfflyvIQvhgjE8; expires=Thu, 24-Jul-2014 20:12:05 GMT; path=/; domain=.google.com
  Set-Cookie: NID=62=d-OSrg2MYi6_7kbY5lHpW3qQ5ASiMMblUeUfBqppHahxDRXQL4qqPuI1nNxbgV3MoQnwxOuD6mPSTRrCK4xZk_ApFTi0KSsyDibrQ6-KHRXKZlPpECBr6AW39QNfhC9G; expires=Wed, 23-Jan-2013 20:12:05 GMT; path=/; domain=.google.com; HttpOnly
  P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
  Server: gws
  X-XSS-Protection: 1; mode=block
  X-Frame-Options: SAMEORIGIN
  Transfer-Encoding: chunked
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled -- not retrieving.

All you should really care about is the line that says “HTTP/1.1 200 OK”. That is the server response that we want to verify. To extract just the server response from the output, let’s pipe the HTTP header out to grep to filter it.

Run this…

wget --spider -S "http://www.google.com" 2>&1 | grep "HTTP/"

…and you’ll get…

HTTP/1.1 200 OK

Perfect, except that it would be very tedious to do this one URL at a time. Let’s use a Bash for loop to finish this up. Since you should be doing all of this PRIOR to launching your new site, you’ll need to replace your productiondomain in the URLs with the stage.productiondomain for the development site. sed is perfect for that. Also, update the /path/to/url-file.txt to match the actual path to the file containing URLs. This assumes that there isn’t any other data in the text file except for the URLs you want to check.

 for URL in `cat "/path/to/url-file.txt" | sed 's/productiondomain.com/stage.productiondomain.com/g'`; do echo "$URL" - `wget --spider -S "$URL" 2>&1 | grep "HTTP/"` ; done

For this example, let’s take the following URLs as the example:

http://www.slipxsolutions.com
http://www.slipxsolutions.com/slip-X_NEXT.php
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/75/Bath__and__Shower_Appliques/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/134/75_Safety_Treads/
http://www.slipxsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/78/145_and_quot_Safety_Treads/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/46/Snug_Plug_Drain_Stopper_/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/48/StopAClog_Drain_Protector/
http://www.slipxsolutions.com/product/23/Drain_PlugsDrain_Products/126/Bottomless_Bath/
http://www.slipxsolutions.com/product/30/Bath__and__Home_Accessories/113/Shower_Splash_Guard/

Once I run the tool I get this as output:

http://www.slip-xsolutions.com - HTTP/1.1 200 OK
http://www.slip-xsolutions.com/slip-X_NEXT.php - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/75/Bath__and__Shower_Appliques/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/134/75_Safety_Treads/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/41/Safety_TreadsTub_Appliqu_and_233s/78/145_and_quot_Safety_Treads/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/23/Drain_PlugsDrain_Products/46/Snug_Plug_Drain_Stopper_/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK
http://www.slip-xsolutions.com/product/23/Drain_PlugsDrain_Products/48/StopAClog_Drain_Protector/ - HTTP/1.1 301 Moved Permanently HTTP/1.1 200 OK

Now all you have to do is watch and look for server response codes that don’t match your expectations. Notice that curl when using the --spider operator will check each page along the redirect path. If there are 6 redirects, it should follow every one and output the server responses.

Let me know if you have any questions. Happy redirecting.

Jul 24, 2012
#redesign #seo #redirects #server response codes
WTF Is Security?

I’m not really sure how much this is related to automation, but I just have to get this out.

I feel kind of insulted right now. If the speed and force of my typing is any indication of my mood, then I’m sure the whole office knows I’m pissed (I’m typing this on a IBM Model M keyboard). Why am I pissed? Well, at my day job, I’m working with a new vender to help do some e-commerce integration with our accounting package and a set of new websites we’re developing. Again, the experience I’m about to share is pretty close to my first impression of them.

If I Give You Remote Access, Respect It!

This company was trying to troubleshoot a bug and couldn’t figure it out without getting access to our main server in the office. So, with them being hundreds of miles away, naturally they ask for remote access. I agree.

They then send me an email with instructions for allowing the remote connection. I’ll just paste what they sent me, minus the important bits, and let’s see if you can spot what’s so fucked up about what they’re doing…

Hi Adam,

One of our developers requires a remote connection to your machine to investigate the on-premise component of <redacted>.

Please find below instructions for making your computer available for remote connection via GoToMyPC.com:

eMail:              <redacted>
Password:        <redacted>
NickName:        **Company Name**
AccessCode:    <redacted>

If GoToMyPC is already installed on your computer, there should be a little green and white MYPC logo by the clock in the bottom right hand corner.  If it is there, please right click on the icon and choose ‘Register’.  You will be prompted for the information above.

If GoToMyPC is not already installed on your computer, please go to www.gotomypc.com and log in using the eMail address and password provided above.  When you log in, you will see a list of computer with a ‘Add Computer’ button at the bottom.  Please click on the button and allow the GoToMyPC software to install (should only take a minute)

[PLEASE NOTE: When installing GoToMyPC and registering the computer, this must be done on the machine you are physically in front of (not over a network).  This is a built in security feature.]

Once the software is installed, you will be asked to re-start.  This is not necessary for the product to function.

You will then be prompted for the information provided above.

Please let us know when you are available for us to connect.

Thank you and Kind Regards,

<redacted>

<redacted> Help Desk Administrator

Let me just say that I don’t have any real experience with GoToMyPc.com, fortunately for me, I didn’t really need it to be able to tell how ridiculous this is.

This company is providing the login credentials for their main account and asking me to log into it and add my server. Please remember that I’m their client. They don’t know me. I don’t work for them. And, they emailed these credentials.

I logged in. Guess what I see? I see a list of all their clients’ servers who’ve also followed these instructions. Next to each computer on the list is a button that would allow me to connect to them! WTF!. They essentially gave me access to every one of their clients’ servers where their backend accounting software is installed. What the fucking fuck? Are they brain dead? This is where all clients’ customers’ financial and PII data is stored.

I’m not an idiot, I didn’t connect to any of them.

Quit Emailing Credentials

I’m a good guy. I’m not going to do anything with this information besides complain. Yes, I’ll complain to them too. But there are plenty of people in the world who don’t have such good intentions. Seriously, we read about big companies getting breached every week.

Quit emailing passwords and other sensitive information. When you send an email it naturally gets copied several times over. And, more people than your intended recipient have access to the computers where that email is replicated. That’s the reason why secure protocols exist.

How Long Has This Been Going On?

There’s no way to know the answer to that question for sure, but I can make a few guesses based on how people choose passwords. I didn’t include the password above, but let’s just say that it ended in 2011, like the year. It’s currently July, 2012. So my guess is that they’ve been using GoToMyPc.com like this for a year or more and they haven’t changed the password.

It’s GoToMyPc, Not LetMeGoToYourPc

I’m fairly certain that the GoToMyPc.com service wasn’t designed to be used this way. There are client/support style remote access services out there that allow tech support personal to access client computers in the right way (certainly a more secure way).

Other Interesting Bits…

When I logged into their account (per their instructions) there were roughly 20 client computers available for connection, including the Help Desk Administrator that sent me the email! Apparently you can access their clients’ computers as well as several computers within their company network! This isn’t a small company either, which makes this even more scary.

It may be in their best interest to give a shit after all if people can get into their network (probably not though)!

Anyway, I’m done ranting. Even though it probably won’t change their behavior, I’ll let them know that I don’t think this is a good idea. That’s the least I can do.

Jul 17, 20121 note
#security #remote access #wtf
Updating Google Analytics Code on Many Static Pages

I’m a relative newb. I have been fortunate enough that when I’ve had to work on sites there’s almost always been some sort of templating system in place. If I needed to make a site-wide code update, I could just make the change in the some templated code block and that would push the update throughout the site. It makes sense, and thank bejesus it’s now the norm.

What about when the whole site is static HTML files? If the site is tens of files, no problem. But what do you do when the site is several hundred? Before today, I had a few theories but had never had to test them out. Here’s my story.

Please Read, Very Important

These instructions will not work in every case, they aren’t designed to. This is a hack I put together that was designed to work with my very specific use case. I’m sharing it because I think you could update my code based on your needs. Please be careful and always make back ups. Enjoy!

The Goal

I was asked by a couple of friends to review their site and try to help them figure out why Google Analytics was tracking things strangely. Here were a few symptoms:

  1. Event tracking was registering in the GA account even though everything looked correct in the GET request for utm.gif.
  2. Keywords and query data wasn’t coming in for any search visits. No for Paid Search and not for Organic Search.
  3. No social data.

I checked on a few things and saw that they had an old version of GA installed. This is the code.

<script type="text/javascript"> 
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript"> 
var pageTracker = _gat._getTracker("UA-123456-12");
pageTracker._trackPageview();
</script>

I didn’t really see anything wrong with it, but I decided to just try the new code on one page and see if it fixed the problem. I got the following code from their GA profile:

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-1235526-26']);
  _gaq.push(['_setDomainName', 'rfsystemlab.us']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>

I uploaded the updated page to their server and opened it in a web browser. By looking at the source, I could tell…

  1. I was getting the new page with the updated tracking code.
  2. The new code was working to track pageviews because I could see the GET requests.
  3. After triggering a few events, I could see that the GET requests for those were sent correctly.

I waited a few minutes and checked their GA account again to see that now the Events were correctly showing up in their account! Awesome…

Except not.

The New Goal

Now that I think I know how to fix their tracking problems, I also have to figure out how to…

  1. Remove all the old GA tracking code.
  2. Add in the new code.
  3. Avoid doing it one page at a time.

I could have just stopped at this point and said, “You do it”, but I wouldn’t do that. Besides, this is fun anyway.

My Tools

It wasn’t exactly easy to do this. Through a bunch of Googling, experimentation and a bit of experience with the command line in Linux, I was able to updates several hundred pages fairly easily.

I used the following standard command line Linux tools to complete this task:

  • sed
  • cut
  • for loops
  • echo

That’s honestly it… However, I did use many resources to help. Here’s my list:

  1. Stackoverflow - Here, here and here.
  2. DBASpot.com
  3. LinuxQuestions.org

I found each of these by asking Google.

Step 1 - Find all the files containing the old Google Analytics code

First, before you do anything, make a backup and then a backup of that backup. Seriously.

I then opened a command line and moved into the root directory for the local copy of the site on my hard drive. Then I ran the following command:

grep -n -r "var gaJsHost" .

This command will search recursively for the string "var gaJsHost" and then output the file name that was matched along with the full text from the line where the match was found. The -n option tells grep to also print out the line number where the match was found. And, the string we’re searching for is one of the main variable names in the old GA code that we need to remove. (Don’t forget that dot at the end of the command. It’s very important!)

From there, we have our list of files we can operate on. By putting the previous command in a Bash for loop, we can manipulate the results and do perform additional operations one row at a time. Here’s the basic loop:

for match in `grep -n -r "var gaJsHost" .`; do

    # do some stuff here...

done

In a nutshell we’ll need to know the following pieces of information for each match:

  • file name.
  • row number where the match is found.
  • row number where the old GA code starts.
  • row number where the old GA code ends.

With that data, we should be able to delete the matching rows as an operation in this loop, for each file. This the next set of code in this process, combining what we’ve done above with the loop.

for match in `grep -n -r "var gaJsHost" .`; do
    file=`echo $match | cut -d":" -f1`
    line=`echo $match | cut -d":" -f2`
    start=$(($line-1))
    end=$(($start+7))
done

Each piece of information we need is stored in a separate variable. The file variable is cut from the original grep command output, which is equal to the file name where the match was found. The line variable was also from the grep command and is the line number in the file where the match was made.

We have two remaining variables that are getting values assigned, start and end, which represent the first line in the old GA code and the last line. I know that the string "var gaJsHost" is found in the second line of the old GA code and that the entire code block is seven lines long. So, the variable start can be calculated by subtracting one from the line where the match was found and the end variable can be calculated by adding seven to start!

Now, we just have to use this information to delete those lines. Here’s the command you’d add as the next command within the loop:

sed -i -e "`echo $start`,`echo $end`d" $file

Then your code would look like this…

for match in `grep -n -r "var gaJsHost" .`; do
    file=`echo $match | cut -d":" -f1`
    line=`echo $match | cut -d":" -f2`
    start=$(($line-1))
    end=$(($start+7))
    sed -i -e "`echo $start`,`echo $end`d" $file
done

After putting this into a Bash file and running it, I had successfully removed all the old Google Analytics tracking code from all the static HTML files on my local copy of my buddies’ site. We aren’t done. We still need to add all the new code to the pages.

Step 2 - Adding new code to the head tag

Honestly I’m a bit tired of writing the post at this point. Besides, most of the rest of what needs to happen is done by simple modification of the concepts above. So, here’s the rest of my code:

# set placeholder text before the </head> tag for the code to be replaced with later.

for match in `grep -n -r "</head>" .`; do
    file=`echo $match | cut -d":" -f1`
    line=`echo $match | cut -d":" -f2`
    sed -i "`echo $line`s/.*/NEW_GOOGLE_ANALYTICS_CODE_HERE\n<\/head>/" "$file"
done

# addin the new code..

$GACODE='<\!-- Start Google Analytics Installation --><script type="text\/javascript">var _gaq=_gaq\|\|\[\]\;_gaq.push\(\["_setAccount"\,"UA-123456-12"\]\)\;_gaq.push\(\["_setDomainName"\,"example.com"\]\)\;_gaq.push\(\["_trackPageview"\]\)\;\(function\(\)\{var a=document.createElement\("script"\)\;a.type="text\/javascript"\;a.async=true\;a.src=\("https\:"==document.location.protocol\?"https\:\/\/ssl"\:"http:\/\/www"\)\+".google-analytics.com\/ga.js"\;var b=document.getElementsByTagName\("script"\)\[0\]\;b.parentNode.insertBefore\(a\,b\)\}\)\(\)<\/script><\!-- End Google Analytics Installation -->
'
for match in `grep -n -r "NEW_GOOGLE_ANALYTICS_CODE_HERE" .`; do
    file=`echo $match | cut -d":" -f1`
    sed -i 's/NEW_GOOGLE_ANALYTICS_CODE_HERE/'"$GACODE"'/' "$file"
done

I hope this is helpful and not too confusing. I would love to answer any questions you may have. Leave a comment, contact me on Google+ or Linkedin.

Happy automation!

Jul 2, 2012
#google analytics #sed #bash
Next page →
2012 2013
  • January
  • February 1
  • March
  • April
  • May
  • June
  • July
  • August
  • September
  • October
  • November
  • December
2012 2013
  • January
  • February 4
  • March 12
  • April 9
  • May 4
  • June 1
  • July 4
  • August 7
  • September 1
  • October 1
  • November
  • December 1