Overall, phrase searching proved an effective strategy on the three search engines. Using an exact phrase helped to limit the gigantic number of keyword combinations that these search engines are able to retrieve. A drawback to using exact phrase searching was the loss of documents where the order of the words was slightly altered, or were plurals were used. However, as the initial recall was sufficient, precision was prioritized over a higher recall.
For evaluating the results of the searches, an arbitrary relevancy ranking was assigned. Far from considering the concept of relevancy an objective measure, this study considered it “a dynamic concept that depends on users’ judgements of the quality of the relationship between information and information needs at a certain point in time” (Schamber et al. qtd in Walker and Jones 269). All three search engines retrieved relevant results, including a large number of policy statements from American academic libraries. Also, two annotated bibliographies were retrieved. These bibliographies were assigned a higher score in terms of relevancy, because they constitute a carefully crafted selection of useful information from reputable sources on the topic.
The fact that both Google and Ask.com retrieved a message to a listserv with very little information on the topic deserves special attention. Various factors might have affected the retrieval of this document. First, the document is very short and includes the phrase in the first paragraph, which may make the search engine give a higher relevancy ranking. Secondly, in both Ask.com and Google, many query-independent factors are involved in determining relevancy. As Hawking notices, “A page with a high query-independent score has a higher a priori probability of retrieval than others that match the query equally well.” (par. 18). Among these factors, link-structure is key in determining relevancy. Because this listserv message is hosted within a site that has hundreds of links to library related websites (webjunction.org), it makes sense that it scored so high in terms of relevancy, even though its content is not useful for our search.
Going back to the search strategy, additional keywords and limits could have been used in conjunction with the selected phrases. For example, we could have used the terms ‘academic library’ or ‘university’ to limit our search, but this was not necessary since, without using these words, most retrieved documents corresponded to pages from university libraries. A reason for this might be that these types of institutions are most likely to make such policy statements public via their websites.
Another possible strategy to limit results to academic institutions could have been to limit the search to .edu domains. As we saw in the review of search engine features, all three engines offer this feature. The reason for not using this limit is that many universities do not use the .edu domain. Instead, they prefer .com or the country domain extension. Also, as pages from US based sites surpass all other countries in the English speaking world, additional country limits were not considered necessary.
Multiple Boolean nesting was tested, but its use was--predictably--not as effective as it usually is in exact match systems. In fact, multiple Boolean nesting caused Gigablast to crash returning the following statement “Error = Bad engineer”. In a similar case, Google and Ask.com either ignored the operators or displayed a “no results found” followed by suggestions on query syntax.
Ask.com and Google retrieved similar results among the first five results, however, the number of hits were quite different. While Google brought up 244 hits, Ask.com retrieved 1290. A quick scan on the results shows much more duplication of results in Ask.com than in Google, which might explain the difference in numbers.
As a different strategy was used for Gigablast, it is reasonable that the engine retrieved a higher number of hits: 288,274. However, a quick look at the first pages of results shows much duplication and spam. It should be noticed that Google proved much superior in spam control than the other two search engines.
In Gigablast, the display of search results in thematic clusters sounded like a great promise but in reality it was disappointing. Clusters did not seem to represent real thematic units, instead they seemed to represent word occurrence. This resulted in some clusters being real thematic units while others not representing ‘aboutness’ of search results. For example, when we tried the original search strategy used for Google and Ask.com, Gigablast offered the following clusters: "Library is in the process", "institution". These clusters are far from being thematic units and even further from making any sense in the context of electronic collection development.
In terms of navigation within search results, Gigablast showed another drawback compared to the other 2 engines. Gigablast only allows moving towards the next page of results, which could be set to display up to 100 records. This is an inconvenience in searches that retrieve a large number of hits. If a person wanted to take a look at the last 100 records to see whether they are still relevant, she would have to click on the “next 100 results” many times until finding the last page.
This paper reviewed a selection of search features in Google, Ask.com and Gigablast.
Though in principle all three systems looked similar, performance differed.
Ask.com and Google retrieved similar results, but Google performed better in
removal of duplication and spam control. Gigablast proved to be the most vulnerable
to spam, and the least successful for exact search phrase searching. However,
once the query was reformulated in Gigablast, this engine also retrieved highly
relevant results.
Overall, all three systems were successful in retrieving relevant documents. This success was partially a result of a very specific query, controlled by the use of quotation marks. However, the importance of query-independent factors, such as link structure, was also noticeable. The importance of query-independent factors together with limited support for multiple Boolean nesting--combined with a lack of knowledge of the hundreds of factors involved in search engine algorithms--limit the degree of control that even expert searchers can achieve on web searching.