Biting the Hand That Feeds

On July 26, 2004, something nobody thought could possibly happen; happened. Google went down. Or so it seemed to those who were confronted with the lackluster "Service error -27" message displayed on Google’s website early July 26 (US time).

Reports started flying across the country and across the globe by 15:30 GMT, when web users around the world started noticing that their searches were returning an error message instead of the expected results from Google, with Yahoo, AltaVista and Lycos also affected by a similar problem. Throughout the day, it was revealed that rather than being hacked or having other internal problems, the search engines were all being attacked by a new strain of the MyDoom worm, automatically using the search engines to locate new targets.

Experts say that the new version of MyDoom, propagating via email, was different to previous versions, "because it uses the search engines to verify and locate additional e-mail domains to infect" (Taylor, quoted in eWeek). This process of repeatedly accessing the search engines to locate new email addresses is what is believed to have caused the availability problems being reported across the world. An alert email sent to the MessageLabs email security list at 21:13 GMT provided more details as to the operation and background of the worm, explaining in detail how it locates domain names on infected machines and then searches the Internet for more variants of that domain.

The attack came at the worst possible time for Google, hitting them on the same day as they released the highly-anticipated financial details of their pending IPO. After announcing that they were expecting to float for up to $3.8 billion, Google was then struck down in the widest-ranging problem it has had in recent times. Despite the timing, temporary outages don’t appear to have affected the generally-positive market-sentiment towards the Google IPO. Existing skeptics do not appear to have picked up the attack as a major issue, although they continue to suggest that Google’s growth is in doubt, their opening price is too high and that instant millionaire-employees will lead to increased staff turn-over in the near future.

Meanwhile, security experts were busy warning that a second phase of the worm’s infection was beginning to emerge. "The new attack uses MyDoom-infected systems to launch a denial-of-service attack against Microsoft’s Web site, says Ken Dunham, director of malicious code at security firm iDefense [sic] Inc." The attack is launched through a companion program called "Zindos" which resides on previously-infected computers and "starts bombarding with requests"

Interestingly, publicity surrounding these events appeared to focus more on the fact that major search engines were being attacked than the worm itself or its methods of propagation. This change in approach sparks questions about the future approaches of viruses, the vulnerability of search engines and the continued problem of email-borne viruses in general.

Clicking All The Way to the Bank

One of the most popular forms of online advertising is in jeopardy of becoming so plagued with fraud as to make it useless to most companies. Advertising a website through targeted listings against specific keywords in a search engine is not a new process, and neither is the fraud which eats up to 20% of the clicks paid for by advertisers. Under this model, advertisers pay a fee to the search engine every time someone clicks on their advertisement, which normally appears alongside or above related search results. This fee may range from only a few cents, up to several dollars for certain market verticals.

Click fraud refers to ‘fake’ clicks on these paid listings, each of which costs the advertiser a small amount. This fraud takes place in a number of ways, the most common of which is to target the listings of competitors and click their advertisements, costing them money. Competitors are said to be paying for either groups of people, or automated software robots (bots) to repeatedly click on the listings, costing the advertiser larger and larger amounts. All of the major search engines which accept paid listings say that they are aware of the issue, take it very seriously, and are aggressively pursuing it, but also concede that it still occurs. A recent article in The Times of India even outlined an entire industry in India, based around employing housewives, college students and some business people who get paid to click on online advertisements.

With the ability for services such as FindWhat results and Google AdSense advertisements to be included directly on other websites, another form of fraud has developed. In this system, website owners are paid a portion of the value of a click which originates from their website. Obviously there is an incentive for the website owner to generate more clicks through the advertisements on their website, which leads some of them to act fraudulently to inflate the figures, and their pay-check.

The first publicized case of click fraud was recorded in 2001, by Jessie Stricchiola, President of Alchemist Media, LLC, a company which specializes in search engine marketing, pay-per-click positioning and search engine optimization. 3 years later, the advertiser-paid search results business is worth around $3 billion dollars, and the problem of click fraud is only getting bigger. Google has even acknowledged that click fraud and other fraudulent behavior are a significant risk to their continued operation and profitability in their SEC filing, leading up to their initial public offering. "If we are unable to stop this fraudulent activity, these refunds may increase," Google said. "If we find new evidence of past fraudulent clicks we may have to issue refunds retroactively of amounts previously paid to our Google Network members." (as quoted in WebProNews)

Although only directly affecting paid listings such as those in the Google AdWords/AdSense programs and Yahoo’s Overture service, the impact of click fraud may become a major issue for everyone. With most search engines deriving their primary income from paid listings, if click fraud turns advertisers off, search engines may soon find themselves without the revenue required to support the expensive business of indexing the Web and handling millions of searches a day.

Dense Approaches to SEO

Alongside Google‘s PageRank, keyword density is one of the most touted and important metrics involved in Search Engine Optimization (SEO) and Search Engine Marketing. Christopher Herg of TheSiteWizard calls it "[o]ne of the simplest ways to improve your site’s placement in the search engine results". Keyword density refers to the number of times a specific keyword appears on a webpage in relation to other words. Search Engine Marketers believe that a higher keyword density for carefully-selected keywords will result in a web page being listed in a higher position for those keywords. This belief leads to the act of ‘keyword spamming’ whereby a webmaster places a keyword on a page more often than would otherwise be necessary. Some webmasters go as far as to create entire pages which are nothing but a collection of keywords (doorway pages), in an attempt to draw the attention of search engine spiders. Google consider this problem so important, that they specifically noted it as a "Risk Factor" in their SEC filing as part of the lead-up to their imminent initial public offering (IPO) and listing on the NASDAQ

Wikipedia contains an extensive entry on keyword spamming (spamdexing) which includes details of some of the approaches taken by webmasters to affect their keyword density (hidden or invisible text, meta tag stuffing, hidden links etc) and discusses some related problems and implications of the practice. even includes "[h]idden text and keywords" under a list of "[o]bjectives to avoid" when "optimizing your website for search engines." Keyword spamming is hardly a new problem for search engines, and yet it continues to be an issue when presenting fairly-weighted results for a user’s search.

Consider for a moment that you are a search engine spider; a piece of software which is sent out to scour the Web for new content and report back to your search engine on what you find, and how it should be ranked against certain keywords. Without reading and understanding the content of a page, how would you be able to tell that it shouldn’t have a certain keyword listed as many times as it does? Assuming you determined that there were ‘too many occurrences’ of a keyword, what would the punishment be? What if it was a simple case of the writing style of the author calling for the use of a word a number of times? Without being able to somehow determine that a page has used a keyword ‘too many times’, a search engine is likely to assume that the page is very heavily related to that term, and thus rank it well for searches containing that keyword.

Evolving search ranking algorithms are getting better at detecting and punishing keyword spamming, but there is always going to have to be an acceptable level of keyword density, and no doubt people will figure out what it is. Webmasters eager for traffic will make sure that their pages are right on the line to avoid punishment while maximizing benefit. This is one of the problems which may just be part of the search engine scene whether we like it or not.

Farming For A Better Listing

You’ve probably seen them before: mysterious "Links" pages or small text links at the bottom or side of a page with curious collections of words used to link to another website. You may have wondered at the time what purpose they really served, or you may not have given them another thought, after all, they’re not part of what you were looking for, so why pay them any attention? If you were a search engine spider however, the piece of software which ‘crawls’ the web on behalf of search engines, reading web pages and indexing their content, you wouldn’t realize that those links were any different from other links on the page. You would follow them, reporting back to your search engine that they were just as important as every other link on the page.

This is what some search engine spammers and ‘link exchangers’ or ‘reciprocal linkers’ rely on to promote their websites within search engines. You see, part of the ranking algorithm which decides where a website comes up in a Google (and possibly other search engines) results page relies on the links between websites to determine which sites are more important or relevant to particular keywords. To do this, links coming in to a page are analyzed and weighted based on where they come from and what text the link contained. If, for example, a website had multiple incoming links using the words "online corporate training", then that site’s ranking for those keywords would likely be increased. If the sites providing the links were ranked highly in a related field (say "business training"), then the ranking of the first site might be affected even more.

Companies now offer link-exchanging networks, where sites that are willing to provide a link on their site to another trade that service for a link back from them. In this way, both sites benefit from the value of links from other websites. There are even software programs which allow you to search for competitors and the people linking to them, to manage the process of acquiring links (via email) from other websites, and handle the creation of a page dedicated to linking to other sites from your website.

Another form of link-farming which is a little more public, and lacks the financial motives, is blatantly called "Google-bombing". A Google-bomb is when a group of websites work together to artificially place a website in Google under a specific search term. Although this technique has relatively harmless motives (more of a joke than a business strategy), it is one and the same technique as is used by commercial operators to get themselves listed in a good position against a search term which their customers might use.

Search engines will continue to combat efforts like this to affect their listings, and search engine spammers will no doubt continue to come up with new and more effective ways of getting to the top of the results. Unfortunately the open nature of the web and the sheer volume of information available means that search engines will always be subject to some form of abuse unless they allow only selected sites to be displayed. With Google indexing 4,285,199,774 pages at last count, that’s a tall order.

Artificial Search Engine Positioning Schemes

Search engines form an integral part of our view of information online, with 39% of Americans using a search engine during January of 2004. With search playing such a central role in our access to the Information Superhighway, it is critical that search engine results pages (SERPs) contain relevant, appropriate results which correspond with what the user is searching for.

While search engines such as Yahoo, MSN and Google are constantly striving to improve the quality of their SERPs, businesses and individuals are also striving to get their sites listings improved; often for search terms which don’t necessarily relate to that page. In fact, artificially influencing SERPs is such a problem that recent Microsoft research indicates that of their selected sample of the web, 8% of pages are spam, created to artificially affect positioning.

Major search engines are attacking this problem in a variety of ways. Google engineers are "working to improve every aspect of Google on a daily basis", MSN has their research team developing new identification techniques to prevent spam making its way into their listings, and Yahoo continues to employ methods such as paid inclusion to clean up heavily-spammed market verticals such as adult entertainment and online gambling. This ongoing battle means that search engines are in a constant state of flux, and search engine owners are continually playing a game of cat and mouse with the spammers and those who are attempting to improve their listings in an effort to keep SERPs relevant.

In their efforts to improve website listings, search engine spammers analyse the methods that the search engines use to rank normal sites, and then attempt to take advantage of those processes. One of the most common techniques is called ‘link-spamming’ or ‘link-farming‘ and involves acquiring (or trading) numerous links from other (preferably related) websites, all pointing to their target site. This has the effect of making the page look more popular or important to search engines, thus affecting its position for certain keywords. Although this practise is also seen to some extent amongst ‘honest’ webmasters, it is virulent amongst those attempting to artificially boost their ranking, to the extent that people pay for links from other sites, never intending to directly attract people via those links, but to only attract search engines.

The problem for search engine owners is in identifying spam attempts, without penalising honest linking, keyword uses and other elements of web-development which can be used for spamming just as easily as they can be for normal web pages. Search engine spam is an ongoing problem, and given the importance of search engines to the process of finding anything on the Internet, one that’s not likely to go away quickly or easily.

XooMLe Download Finally Available!

After a very long wait, a downloadable version of XooMLe is now available!

This means that you can now download XooMLe and install it on your own PHP-enabled server so that you can use it locally for integrating Google search results (and cache and spelling suggestion power) into your web-based applications.

In the near future, a ‘XooMLe In Action’ page will also be added to the site, showcasing some of the ways that people are making use of XooMLe and Google.