
This study compares three web search engines through a search on the
topic of digital collection development. Search results are
recorded and ranked according to relevancy. A general approach
to web searching is discussed, as well as specific strategies for
each individual search engine. For each search engine, query syntax,
limits, and ranking of results are discussed, according to findings
gathered in a review of the literature.
The manager of a medium-size American academic library needs to locate information to help her re-define the library’s digital collection policy. With the increasing demand for electronic resources over print, this manager must revamp the library’s electronic collection development policy in order to better reflect the new reality.
The new policy should consider:
Selection criteria: which formats to include among the variety of electronic resources (e.g.: e-journals, e-books, websites, government documents, working papers, conference proceedings, databases, web sites, image files, etc.); what equipment or software is needed to provide access to these resources; choosing interfaces that are user friendly.
Guidelines for dealing with websites: Many resources are available for free in the open web. Should some of these be incorporated into the catalogue? Should they be simply linked from the library website?
Updates: with the changing nature of electronic resources, the new policy should provide a framework to guide staff on how to maintain the collection up to date.
The client would like to consider how other libraries are dealing with these issues in order to decide on a best approach. Ideally, she would like to review other libraries’ policy statements. Alternatively, articles reflecting on the issues mentioned above would also be useful.
Among the many search engines available in the market, Google, Ask.com, and Gigablast were selected for specific reasons. Google was included because it is a market leader and could provide an interesting comparison with other less popular search engines. Ask.com was chosen because it has recently undergone a transformation and re-branding (NewsMarket). Its inclusion would provide a good opportunity to explore how the re-branding has affected the engine core searching technology. Finally, Gigablast was new to me, so its inclusion provided an opportunity to explore it.
In contrast to human-powered information organization systems, Web search engines automatically generate their results listings based on information retrieved by “crawlers” or “spiders”. The crawler is the software employed to follow links from one web page to others and to collect the information to be added to an index. Search engines use indexes to match the terms in a search query and retrieve corresponding results (Sullivan 2002).
Search engines program the crawlers to visit pages periodically, so that they can find new pages or modification to existing ones. As opposed to a popular belief, web search engines only cover a portion on the web. Some identify only the top or second level pages on each site while others follow the complete URL extension for each site (Bopp and Smith 136).
In addition to word indexing, search engines use other mechanisms to rank search results, such as term frequency or positions, document length, inlink score, anchor text matches, phrase matches, and more. Search engines may also calculate frequency of incoming links and consequently assign link popularity scores (Hawking par. 16). Other factors affecting retrieval are URL brevity, spam score, and the frequency users click on links.
Overall, search engine ranking is a complex process which may involve dozens or even hundreds of factors. Many of the exact elements involved in search engine algorithms are major business secrets. In the following sections, a selection of the publicly known search features of Google, Ask.com, and Gigablast are discussed.
A common search strategy was used to approach all search engines. Instead of searching for concepts, words more likely to appear in natural language on the target pages were used.
Phrase searching was chosen over simple keyword
searching, in order to increase precision by only retrieving documents
where the terms appeared in exactly the specified order. Finally,
this search strategy was slightly modified for Gigablast for reasons
we will explain in the next section.
The selected phrases were “electronic collection development policy” and “collection development policy for electronic resources”. The phrases were combined with the OR Boolean operator, to retrieve either instances of the phrase. This strategy aimed to retrieve a likely used title for most academic libraries in this respect. A test was also done with the phrase, “digital collection development policy” but this phrase resulted less commonly used in libraries' policy statements.