System: Referer Spam from Live SearchOccasionally something a bit 'odd' shows up in our logfiles that we can't identify or find a satisfactory explanation for on the web. Here's the latest that seems to originate from inside Redmond (our old friend Microsoft Corporation). Description of the eventThis is a strange pattern - it looks like log- or referer-spam except that it comes from the Microsoft corporate network. Similar logfile entries have been reported on WebmasterWorld, but apart from a cryptic message from msndude describing it as 'part of a quality check we run on selected pages', noone has come up with a sensible explanation. This is how it all began:
The agent appears to be a normal Internet Explorer web browser - because it downloads CSS, JavaScript and even MP3 files that are included from the webpage - but with image loading disabled. The referer strings however are not valid searches - at least not on the public version of Live Search. They seem to be targetting 'spammy' keywords, but the websites in question (and each table row is a separate website) don't mention the keywords in question and are not even in related industries. The same pattern goes back at least a couple of weeks and probably longer. Anyone with a theory is welcome to get in touch. It's put a bee in my bonnet because our search engine traffic reports are showing these 'spammy' keywords to our clients..
Update: This pattern has changed since this article was written. The referrals have changed from LVSP to LIVSOP and the keywords are no longer so offensive or spammy. More on this, and a means for blocking these referrals below. Checking your log filesTo display logfile entries of this type you can use the command: grep LVSP combined_log
And the following awk command will show you the keywords being passed in the HTTP Referer string: awk -F\" '($4 ~ /LVSP$/){print $4}' combined_log | awk -F[=\&] '{print $2}'
Just replace combined_log with a reference to one or more combined log files on your server. More hijinks from Live Search LIVSOPThankfully the stream of inappropriate search terms from Microsoft's network seems to have stopped for now. They have however been replaced by an almost identical series of 'fake' search referrals flagged as 'LIVSOP' which obviously relates to 'LVSP' in some way. The new referrals are coming from a range of IP addresses inside Microsoft Corporation. In the last 30 hours our server has received 180 requests from this source from 85 IP addresses in the block 65.55.165.0/25 (65.55.165.0 - 65.55.165.127). The user agent in each case is: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)
You can find these entries in your log file using: awk -F\" '($4 ~ /LIVSOP$/){print $4}' combined_log | awk -F[=\&] '{print $2}'
In each case the search term referred to is a single word, one that would never bring up that particular website or web page, but that does seem to always appear in the TITLE tag of the target page. And there are plenty of (rightly, in my opinion) pissed off webmasters as you can see from the links below. Given that log files are the most accurate record of the performance of a website it's difficult to see how Microsoft can justify inserting fake search referrals. The excuse forwarded by msndude is that this is some kind of 'quality check', but surely they could do this without passing a referrer that is so similar to 'real' search referrals that it pollutes web traffic reports and gives a false impression that Live Search is being used to find a website. On this website (The Art of Web) we have had over 12,000 search referrals from Google in the last month, compared to just 110 from Live Search. Looking deeper however shows that only about 20% (one in five!) of the referrals from Live Search are real. In other words they are sending us four times as many fake referrals as real ones!! Here are the kind of search terms we're talking about: agents, array, border, browser, browsers, button, class, client, codes, collapse, colours, combined, command, content, cookies, credit, definition, download, email, example, examples, function, green, input, javascript, media, number, october, parent, password, random, report, request, rewrite, robots, search, september, server, shell, startdate, system, using, validator, value, warning, world If you do want to block these referrals without blocking all traffic from the same network, you can use mod_rewrite as follows: RewriteCond %{REMOTE_ADDR} ^65\.55\.165
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
Translation:
To use this code you will need to be able to edit the httpd.conf or .htaccess file for your website and have mod_rewrite enabled. This will not stop the referrals from showing up in your log files (they will appear as 403 - Forbidden), but it will prevent the loading of related files (CSS, JavaScript, MP3, etc.) which solves the problem of having your AdSense statistics spoiled by this agent which is a common complaint from webmasters. More Pharmaceutical ReferralsIt seems that I spoke too soon regarding the ending of sexually explicit and pharmaceutical search terms from Live Search. There are still a few coming from a different IP block - namely the addresses 131.107.0.95 and 131.107.0.96. From those addresses (also inside Microsoft Corporation) we're seeing search terms including: adult, codeine, diflucan, nextel, nokia, phendimetrazine, sex So if you want to block them as well the modified code is as follows: RewriteCond %{REMOTE_ADDR} ^65\.55\.165 [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
Translation:
New IPs to BlockToday the LIVSOP hits have started coming from a new IP block, so lets update the filter: # block spurious referrals from microsoft LIVSOP
RewriteCond %{REMOTE_ADDR} ^65\.55\.165 [OR]
RewriteCond %{REMOTE_ADDR} ^65\.55\.232 [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=LIVSOP$
RewriteRule .* - [F]
As much as I'd like our sites to appear in the search results for single generic keywords, it's really not feasible. Here are just some of the search terms for which we're seeing dud referrals from Microsoft: about, achievement, amsterdam, apology, argenton, august, australia, backpacker, bridgewater, canberra, chicken, corowa, council, courtesy, darwin, detention, einasleigh, emerald, facials, fields, foster, functions, geographic, government, guantanamo, hicks, hotel, information, judicial, justice, kylie, ladies, legal, melbourne, military, motel, north, northern, nullarbor, photo, prahran, railway, region, restaurant, rugby, search, semester, sentencing, simulation, society, south, spatial, submissions, sydney, systems, taxation, terrorism, title, wollongong, zealand What do Microsoft have to say about this? "We have now optimized the tool to use only keywords that are relevant to your website" What a joke. How is the word 'about' or 'region' relevant to any website?!? Sure they're in a tight spot playing catch-up with Google, but that's no excuse for spamming our websites. Follow the links below for more in-depth reporting of the problem, including (finally) some response from Microsoft. A rose by any other name ... QBHPThey just don't give up, do they. It's June 2008 and suddenly we're getting Microsoft referer spam using the code QBHP instead of LIVSOP. Maybe too many people were blocking the old name? They're using a new range of IP addresses:
AND they've discoverd lower-case so FORM is now form: http://search.live.com/results.aspx?q=search&form=QBHP
So here's the new mod_rewrite instructions to block them: # block spurious referrals from microsoft LIVSOP or QBHP
RewriteCond %{REMOTE_ADDR} ^65\.55\.(109|110|165|232) [OR]
RewriteCond %{REMOTE_ADDR} ^131\.107\.0\.9[56]
RewriteCond %{HTTP_REFERER} FORM=(LIVSOP|QBHP)$ [NC]
RewriteRule .* - [F]
Note: The [NC] after the final RewriteCond indicates that the match is case-insensitive. There's a vague reference to QBHP on the Microsoft Privacy website here. Something about being able to do searches and go to a site without passing the search string. WTF?!? Is that in case you searched for your own credit card number or some other private/personal information?!? Or because you're too stupid to copy and paste or re-type the link yourself? Sheesh. Dear Microsoft, Just because you can't build an operating system that protects your users, and can't assume that they have even basic common sense, please don't mess with the web developers and webmasters who rely on log analysis every day to build good websites. Just don't! They just won't stop!I guess I shouldn't be surprised that Microsoft have now introduced
yet another format for their log spamming user agent, but I am. This
time they've stripped off all the extra parameters and are just passing
the query string - to make it hard to identify I guess.
This traffic started on 13 August 2008 and has been seen on our server coming from the following IP addresses:
Again, this can only be classified as log spam and it's unbelievable that an organisation as large as Microsoft would be so stupid as to think it's ok to spam the millions of websites in their index - on whatever pretext. The robot follows msnbot as before, but uses the user agent Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322). It 'pretends' to have done a search to find your website and also downloads .js and .css files - including those from external sites and of course without caching. It doesn't appear to check robots.txt on those sites. We spotted it after search traffic from Live to one of our sites shot up again from a typical 1-2 search referrals per day to closer to 20. Why do you think they want to inflate your search numbers from Live Search? I can already hear you asking, how do we block them now? # more spurious referrals from microsoft
RewriteCond %{REMOTE_ADDR} ^65\.55\.232
RewriteCond %{HTTP_REFERER} search\.live\.com
RewriteCond %{HTTP_REFERER} !\&
RewriteRule .* - [F]
At the time of writing the 'fake' traffic in this format has slowed down and we haven't had any hits for a couple of hours, but we're monitoring the logs to be sure... Yep, they're still at it, but now being blocked (403) by our new rewrite rule. Update: These hits are now coming from other IP blocks so the blocking script needs to be changed as follows: # more spurious referrals from microsoft
RewriteCond %{REMOTE_ADDR} ^65\.55\.(109|110|165|232)
RewriteCond %{HTTP_REFERER} search\.live\.com
RewriteCond %{HTTP_REFERER} !\&
RewriteRule .* - [F]
Update 10/2008: The spamming campaign has been expanded yet again and now includes addresses in the range 65.55.107.* in addition to those listed above. Related Articles
References
Feedback and Questions2008-06-11: Jeff Walker says: I just wanted to thank you for your comprehensive documentation of this issue. I too have been plagued by these Microsoft bots across several websites; they are particularly annoying on the e-commerce websites. That's the same pattern we're seeing on all our sites. First "msnbot/1.1" makes a request, with no referer string, and apparently with a valid If-Modified-Since request-header as they receive a 304 response. That's followed by another request with a fake referer (QBHP) and the user agent "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322)", which we block (403) using the above rewrite rule. 2008-06-17: John K says: I hate MS just as much as the rest but give them credit. They appear to be looking for a better way to search and rank web content. (that is different than google) If that messes up our "canned" web stat programs, then we have to change with the times. To stay on top, we as web page content providers and web admins have to adjust. Way back when Google sent out bots to understand the web, if we all blocked them then we would be out in the cold now. Re-write your stats software, we just did; or risk loss of traffic in the future. Monitor, adjust repeat; that’s the art of the web. Sorry, but I can't give them credit for what they haven't done. And I don't think sending referer spam, which is what they're doing, is going to get them any closer to having a half-decent search engine. 2009-05-20: Penguin Pete says: Most excellent write-up! I run a Linux-centric site anyway, so when I was dealing with this, I just plain blocked all traffic with a live.com in it and I'm done in one step. |
||||||||||||||||||||
|
© Copyright 2009 Chirp Internet
- Page Last Modified: 29 October 2008
|
||||||||||||||||||||