Automate Everything w/ Bash, Linux & Command Line
  1. NexTag.com Keyword Suggest Scraper

    In writing the short script for this post I learned that NexTag.com is the holy grail of keyword suggest results. Most sites limit there suggestions to about 10, where as NexTag.com returns 100! That’s truly amazing and great for us.

    Let’s cut to the chase. Here’s the script:

    #/bin/bash
    
    # This script will scrape NexTag.com's keyword suggestion tools.
    
    query=$1
    f_query=$(echo "$query" | sed 's/ /%20/g')
    curl -s "http://www.nextag.com/buyer/opensearch.jsp?suggest=complete&perpage=100&search=$f_query" | sed 's/\[//g;s/\]//g;s/","/\n/g;s/"//g'
    echo
    

    You run it like this:

    bash nextag.com.sh "shoes"
    

    Then enjoy the results!

    If you have any questions or feedback please leave a comment. Otherwise, happy automation.

     
  2. Shopping.com Keyword Suggest Scraper

    This is a really quick post about another keyword tool. As the title suggests, this is a tool to scrape keyword suggest on Shopping.com. The script is very simple and is just an adaptation to the tools discussed here, here, here and here.

    This is the full script…

    #/bin/bash
    
    # This script will scrape Shopping.com's keyword suggestion tools.
    
    query=$1
    f_query=$(echo "$query" | sed 's/ /%20/g')
    curl -s "http://www.shopping.com/ajaxSearchAssistant?q=$f_query" | grep "<S>" | cut -d'>' -f2 | cut -d'<' -f1
    echo
    

    Then you run it like this…

    bash shopping.com.sh "keyword"
    

    Using niche sites like Shopping.com (or any other shopping engine) is a good idea when doing keyword research specifically for campaigns trying to sell very common consumer products. Shopping engines like Shopping.com can be a very helpful resource.

    Happy automation!

     
  3. How to Automate Shopzilla Cost Reports

    I have a sincere love/hate relationship with reporting. Many, many times I’ve found myself day-dreaming of having a clone who’s sole purpose is to put together reporting for me to base my daily business decisions on. Then I realized that I could!

    I’ve already written about and provided several examples of how I automate reporting. So far I’ve written posts on how to automate reporting for the following services:

    1. Google Analytics
    2. Google AdWords
    3. Facebook Likes
    4. Pinterest Pins
    5. Twitter Tweets
    6. Google +1s

    My solution for not having my own reporting clone is to write one using Linux and Bash. Today, I’ll show you how you can pull your cost per click spend reports from your Shopzilla.com account.

    Where to Start?

    That’s a very good question. I’m writing this post because automating reporting for Shopzilla is not easy. Shopzilla doesn’t provide any of the normal tools or methods for downloading and saving reports. Most online advertising channels provide an API (like Google), scheduling via email (not ideal but it works) or delivery via FTP (old school but consistent). Even if the service doesn’t offer the ability to configure one of these options in your account, most times you can send a request for it by opening a ticket or just contacting customer service. Not Shopzilla.com.

    I emailed several times and Shopzilla basically just acted like I was strange for even requesting it. I even provided example use cases and tried to make it seem beneficial for them (me spending more money…) to give me the reporting. But in the end, I gave up on waiting for them to help me.

    Solution = Automated Login + Crawling + Scraping + Trickery!

    I say, if you won’t provide me the appropriate tools then I’ll create them (then share them with anyone who wants them)!

    Let’s get started by first understanding each of the steps that will take place.

    1. Request the log in page.
    2. Emulate exactly what happens when you sign in by sending a HTTP POST with your username, password and any other fields that may be required.
    3. Go to the reporting page.
    4. Request a report.
    5. Search the page for the reporting data we need.
    6. Write the data somewhere.

    To be able to complete the 6 steps above, you’ll need to understand a few concepts that will help with this type of automation.

    1. Most sites don’t want you to run automated scripts against them. Typically there are measures put into place to prevent it. We have to be aware so we can figure out how they work and then get around them.
    2. Anytime you have to authenticate with a site it’s a good idea to store cookies.
    3. Be aware of hidden form fields with default values. Typically those are in place to prevent spam bots and/or malicious hackers from auto-submitting forms. The good news is that most times those field values are static so all you have to do is inspect the HTML source.
    4. It’s good practice to do as much as possible to make your requests feel natural. You don’t want your requests to feel like a bot. Doing things like using a common browser user-agent, passing accurate referrers, saving cookies, pausing between page requests and following browsing paths available to users are all ways to do that.
    5. In many cases, if JavaScript is required to load the page (like if JSON data is being fetched via AJAX) then you won’t be able to use Bash to crawl and scrape the page. Sorry. You can determine this before you start by trying to use the site with JavaScript turned off.

    Each of the code snippets below will take use this advice and should allow you to fully automate Shopzilla cost reports.

    Before we tackle the steps, lets setup some variables we’ll use throughout the script.

    uname='your_shopzilla_username'
    passw='your_shopzilla_password'
    agent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/2010010 Firefox/9.0.1' # a popular user-agent currently
    base='https://merchant.shopzilla.com' # we'll use this so we don't have to type it again and again.
    date=$(date -d yesterday +%F) # This command will output yesterdays date. We will request yesterdays data.
    cse_report='report_name.tsv' # whatever you want this to be.
    

    Step #1 - Request Shopzilla Log In Page

    curl --user-agent "$agent" --cookie-jar cookies.txt "$base"/index.xpml
    sleep 2
    

    Simple cURL request passing in the user-agent using the variable and saving cookies to the cookie-jar. This will output the HTML source code of the requested page to the terminal. You can ignore the output, we don’t needed. It is helpful for cross-reference to make sure we got the page we needed.

    Step #2 - Try to Log In to Shopzilla

    It seems obvious that we’ll need to pass our uname and passw. However, the form also sends data for the parameters login_country_code, submit_type.x and submit_type.y. If we don’t provide values for those fields then we risk failing the log in attempt or being discovered that we’re a bot. The easiest way to get the typical values is by logging in with a browser and using Firebug or Chromes debugging tools to inspect the POST. You’ll see what’s actually being passed as values through the form.

    curl --user-agent "$agent" --referer "$base"/index.xpml --cookie-jar cookies.txt --cookie cookies.txt -d "login_country_code=US&uname=$uname&passw=$passw&submit_type.x=0&submit_type.y=0" "$base"/ssl/login.xpml
    sleep 2
    

    This time we’ll use the --referer option to add a referral URL, use our cookies POST the required parameters and values and then sleep for 2 seconds afterwards.

    Step #3 - Request the Shopzilla Reporting Page

    Only if the previous log in attempt was successful, will you be able to successfully get the reporting page. If your log in attempt failed then you should get a 404.

    curl --user-agent "$agent" --referer "$base"/pp/index.xpml --cookie-jar cookies.txt --cookie cookies.txt "$base"/ra/reporting/index.xpml
    sleep 2
    

    This request just updates the referrer and requests the reporting page. Nothing too complicated with that one.

    Step #4 - Time to Request Your Shopzilla Report

    This example will request a cost report for yesterday. I use the variable set before we got started to assign yesterday’s date. This means that I run this script every day for each of the Shopzilla accounts I manage. This HTTP POST has a ton of different values in it. Again you can see them in your own account by requesting a report and using the debugger to view the data being POSTed.

    curl --user-agent "$agent" --referer "$base"/ra/reporting/index.xpml --cookie-jar cookies.txt --cookie cookies.txt -d "daily_url=%2Fra%2Freporting%2Fcostandperformance%2Findex.xpml&category_url=%2Fra%2Freporting%2Fcategory%2Findex.html&subcategory_url=%2Fra%2Freporting%2Fsubcategory%2Findex.html&product_url=%2Fra%2Freporting%2Fproduct%2Findex.xpml&collapse_tooltip=Click+to+hide+subcategories&expand_tooltip=Click+to+show+subcategories&timePeriod=CUSTOM&cpStartDate_max=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpStartDate_min=02%2F16%2F2011&cpStartDate=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpEndDate_max=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpEndDate_min=02%2F16%2F2011&cpEndDate=`date -d yesterday +%D | sed 's/\//%2F/g'`&level=daily" "$base"/ra/reporting/costandperformance/index.xpml > "$date".report.html
    sleep 2
    

    The only thing different about this command from those before it is that instead of letting the HTML source of the page flood our terminal window, I’ve redirected the output to an HTML file (see > "$date".report.html). I’ve chosen to do this for two reasons.

    1. Shopzilla renders the reporting data in the HTML of the resulting page.
    2. I have the option of storing resulting HTML in a variable and then parsing the data from it, but then I lose the ability to troubleshoot errors later if they come up because the actual data will be long gone. Since this will eventually be run without anyone watching it, it’s a good idea to be able to recreate errors.

    Step #5 - Parse the HTML

    I’m looking for the total cost for the day and the average CPC (cost per click). The following two commands get that data and store it in variables cpc and ttl_cost.

    cpc=$(grep -A 4 '<tr class="totals" id="grandTotalRow">' "$date".report.html | tail -n 1 | cut -d'$' -f2 | cut -d'<' -f1)
    ttl_cost=$(grep -A 3 '<tr class="totals" id="grandTotalRow">' "$date".report.html | tail -n 1 | cut -d'$' -f2 | cut -d'<' -f1)
    

    Keep in mind that if Shopzilla changes anything about the HTML on the page that we’re matching against then our parsing will fail and we won’t have the data we need. This is another reason why it’s a good idea to save the report to file so that we can figure out when the page changed and fix our scripts.

    Step #6 - Write the Shopzilla Cost Report Data to a File

    I like tab-separated formatting, so that’s what I’m going with. The \t is equal to a tab character. In my output report, I combine data from all the shopping engines and several sites that I manage. For that reason, you’ll see variables that we haven’t discussed or set in the commands above. That’s ok, just remove them if you don’t want to set them yourself.

    echo -e "$site\t$date\t$engine\t$cpc\t$ttl_cost" >> $cse_report
    

    The Full Shopzilla Reporting Script

    This is a recap, combining each of the steps and commands discussed above.

    #!/bin/bash
    
    uname=''
    passw=''
    agent='Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/2010010 Firefox/9.0.1'
    base='https://merchant.shopzilla.com'
    date=$(date -d yesterday +%F)
    cse_report=''
    
    # 1. get the login page, make sure to save all set cookies.
    
    curl --user-agent "$agent" --cookie-jar cookies.txt "$base"/index.xpml
    sleep 2
    
    # 2. try to login.
    
    curl --user-agent "$agent" --referer "$base"/index.xpml --cookie-jar cookies.txt --cookie cookies.txt -d "login_country_code=US&uname=$uname&passw=$passw&submit_type.x=0&submit_type.y=0" "$base"/ssl/login.xpml
    sleep 2
    
    # 3. go to the reporting page.
    
    curl --user-agent "$agent" --referer "$base"/pp/index.xpml --cookie-jar cookies.txt --cookie cookies.txt "$base"/ra/reporting/index.xpml
    sleep 2
    
    # 4. request the report.
    
    curl --user-agent "$agent" --referer "$base"/ra/reporting/index.xpml --cookie-jar cookies.txt --cookie cookies.txt -d "daily_url=%2Fra%2Freporting%2Fcostandperformance%2Findex.xpml&category_url=%2Fra%2Freporting%2Fcategory%2Findex.html&subcategory_url=%2Fra%2Freporting%2Fsubcategory%2Findex.html&product_url=%2Fra%2Freporting%2Fproduct%2Findex.xpml&collapse_tooltip=Click+to+hide+subcategories&expand_tooltip=Click+to+show+subcategories&timePeriod=CUSTOM&cpStartDate_max=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpStartDate_min=02%2F16%2F2011&cpStartDate=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpEndDate_max=`date -d yesterday +%D | sed 's/\//%2F/g'`&cpEndDate_min=02%2F16%2F2011&cpEndDate=`date -d yesterday +%D | sed 's/\//%2F/g'`&level=daily" "$base"/ra/reporting/costandperformance/index.xpml > "$date".report.html
    sleep 2
    
    # 5. parse the HTML to get the data needed.
    
    cpc=$(grep -A 4 '<tr class="totals" id="grandTotalRow">' "$date".report.html | tail -n 1 | cut -d'$' -f2 | cut -d'<' -f1)
    ttl_cost=$(grep -A 3 '<tr class="totals" id="grandTotalRow">' "$date".report.html | tail -n 1 | cut -d'$' -f2 | cut -d'<' -f1)
    
    # 6. write out the data...
    
    echo -e "$site\t$date\t$engine\t$cpc\t$ttl_cost" >> $cse_report
    

    And that’s it. I’d like to also point out that these same steps and methods can be adapted to fit many situations. Maybe one day Shopzilla will wake up and remember who their customers truly are. Until then, happy automation!

     
  4. A Shopping Engine Click Fraud Story

    First, let me just say that this story is non-fiction. Besides a few specifics like names the emails below are word for word.

    I’ve personally managed cost per click advertising of various types since January of 2007. I realize that 5+ years isn’t a humongous amount of time and I’m not quite the “grey beard” I sometimes feel like. With that said, I’ve seen enough to know when something just isn’t quite right in the numbers. It starts with a suspicion that is either validated or debunked with PROOF.

    This click fraud story started when I discovered one or two of products in one of my shopping engine feeds was getting a larger than normal amount of traffic and had lower than normal conversion rate. I use Google Analytics. I track my shopping engine feeds by manually tagging the landing page URLs with the name of the product name as the campaign and the product SKU as the content. I’ve done it this way for years and it’s always worked out well for me.

    I used this data to track, in Google Analytics, the performance trends back for the past several months. For the products in question, traffic data in Google Analytics has been steady and predictable and so has the poor conversion rate. Having made no major changes to my feed data during that time, I thought it would be important to make sure that the data was all correct.

    I checked my data feeds to make sure all the landing page URLs, prices, titles, images, descriptions, etc were all correct (the were). After that, I wanted to dive in and see how my return on investment was doing for those low converting products. Let me just say, ROI wasn’t even break-even.

    That’s when I noticed something startling. Not only were the metrics horrible when looking at conversion rate per Google Analytics’ numbers, the shopping engine was charging me for many more clicks than I was receiving. I’m not talking about 10% or 15% more, I’m talking about 200%.

    To recap - I’m being charged for two times more clicks than visits. Conversion rate for the products in question is less than 1%, whereas the average for all other products is over 7%. This looked fishy in Google Analytics but after looking at the cost report and data reported by this particular shopping engine it looks a lot like click fraud.

    I decided to submit a ticket to get more information. If you aren’t aware, most of the big cost per click advertising programs make monthly adjustments to account for click fraud and other billing mistakes. This shopping engine doesn’t. While I’m not going to say which engine it is, it just happens to be one of the biggest.

    My Click Fraud Ticket

    This is the original ticket that I submitted:

    This problem is relevant to my two most popular products.
    
    It appears that I'm being charged for over twice as many clicks as what's being delivered to my site. I tag all of my product links using Google Analytics campaign tracking. 
    
    This does not affect the majority of our products, just the top two. How can we reconcile charges to represent actual traffic sent to my site? If this isn't possible then I may have to remove those items from my submitted products because they aren't profitable.
    
    Thanks in advance.
    

    This is the first response (edited to remove the customer service person’s name and company name):

    Dear Adam, 
    
    Thank you for your message. Please keep in mind that Shopping Engine XYZ takes click fraud very seriously and thus adheres to the strict IAB guidelines regarding click spam and bots. By following these rules, our search engine and framework system is built to make click fraud on the site virtually impossible. 
    
    With that said, Google analytics results differ greatly from other analytics programs. The parameters and settings that each program uses are proprietary and therefore employ different parameters and cookies. This discrepancy is so widespread and prevalent that Google in fact released their own documentation explaining why their reports will differ from others: 
    
    http://www.google.com/support/googleanalytics/bin/answer.py?answer=55614&topic=11018 
    
    In addition, we find that Google's analytics do not always reflect the clicks/traffic that is coming from our affiliate network. We display your product offers on Shopping Engine XYZ, Shopping Engine XYZ, in addition to a number of affiliate sites. The affiliate network is regularly monitored very closely to ensure that quality traffic is being driven to our merchant partners.
    
    Also, in the event that someone is just repeatedly clicking on one product over and over, Shopping Engine XYZ does not charge merchants for clicks from any user who clicks on them more than once in any session. Subsequently, in order for anyone, or any bot program, to attempt click fraud, that person, people, or program would have to time out their session, erase every protocol on their computer, find and erase all of their cookies, reset the log on their IP address, scrub their hard drive and computer drivers, and basically find and erase all of the markers that our engine uses to identify and prevent click fraud. All for the ability to register, at the maximum, two more clicks onto a site. 
    
    If you find that the traffic for two of your listings is not profitable for your store, it might be best to remove these items from your feed or to zero bid  them. This will help minimize CPC costs. 
    
    I would also strongly recommend that you install our free ROI performance tracker. While reviewing your feed, I noticed that you did not have the tracker installed. This tool will provide valuable performance data to help you further optimize your account/improve your performance. You will receive detailed performance data in your cost & performance reports.The data gathered by our ROI tracker will help you adjust your bidding to maximize your campaign.
    
    It is very easy to implement and setup should only take 15 minutes. To get started, please log into your account and click the 'Manage Listings' tab. Then, click on 'Performance Tracker'. 
    
    If you have further questions, please do not hesitate to reach out!
    
    Regards, 
    
    Customer Service Rep
    
    Shopping Engine XYZ
    

    This is not the first time I’ve had to fight click fraud battles. Back when Yahoo used to run its own Paid Search offering, I used to submit click fraud investigation requests on a monthly basis for many many clients. Most of the time I won. The fact is that most advertisers don’t pay enough attention to their data to notice issues like this. And then even when they do, they’re not confident enough to argue their case. I can proudly claim responsibility for several instances were as much as 40% of monthly click spend was refunded from Yahoo to my client due to click fraud on Yahoo’s content network. Over the years, the industry as a whole has improved so much (in my experience). That is why I was so surprised by this particular response.

    There’s so much that’s wrong with the response I received from them. Here was my response to them:

    I understand that Google Analytics (or any other 3rd party analytics provider) data will not match XYZ Shopping Engine's data exactly. However, I'm not talking about small discrepancies that I see with every product. I'm talking about one or two of my top trafficked products that have HUGE differences, where all the other products are within a few clicks. By a huge difference, I mean that you're charging me for double the amount of clicks for which I can account for.
    
    Also, I call BS on several of your claims about click fraud. The most ridiculous claim of them all, "scrub their hard drive and computer drivers". Seriously? What does tracking website usage behavior have to do with drivers installed on a computer? If XYZ Shopping Engine is serious then I'd like to know more about about your official click fraud prevention policies and what 'markers' are used for detection. Frankly, the explanation below is insulting.
    
    I hope to hear back soon.
    

    Seriously? What the fuck. Do they really expect me to believe that a click fraudster would have to “scrub their hard drive and computer drivers, and basically find and erase all of the markers that our engine uses”? You don’t have to try that hard to remove viruses from a hard drive. Are they really saying that they install unauthorized software and/or drivers on the machines of their users to prevent click fraud?

    No, they hope the claims made in this email will sound impressive. They expect that you’ll have no idea what they’re talking about and accept these claims as fact. But, in fact, it’s bullshit. There’s no flipp’n way they’re installing “drivers” as markers to detect multiple clicks in one session.

    The fact is that they are in the business of charging for clicks, the more the better. They sure as hell aren’t going to break their back trying to prevent click fraud. Apparently they’re trying harder to prevent giving due credits to their paying customers than they are to prevent click fraud in the first place.

    In the initial response, they claim to adhere to IAB guidelines. I’m starting to doubt it. In fact, if this company really cared about preventing click fraud then it’s likely I would see their name on the list of companies who participated in creating the IAB guidelines. Unfortunately they didn’t and I’ve been managing accounts on their platform since before 2009.

    I’m not giving up. I plan to post updates as the story continues to write its-self. If anyone has advice please let me know!