Searching the web – Part II & III

In my continuing series on searching the web I will look at the ARC (Automatic resource compilation) algorithm and the SALSA Algorithm

ARC Algorithm
It is actually an extension of the HITS algorithm, and uses the notion of hubs and authorities. This algorithm also uses a term based search engine to create a root set. The only difference with this algorithm is that it performs textual analysis of the web pages, and assigns a weight on the hub and the authority scores based on the textual analysis.

SALSA algorithm
The stochastic approach for link structure analysis algorithm is an extension of the HITS algorithm. This algorithm also uses the concepts of hub and authority pages;however this algorithm uses the theory of Markov chains to perform two random walks on the web graph. One walk is conducted on the authority side of a web graph (authority chain) and the other walk is conducted on the hub side of the web graph (hub chain).The algorithm creates a matrix that consists of the links between pages. This link matrix is applied to the hub and authority matrices in an iterative manner. What is produced are eigenvectors of the hub and authority matrices. The web pages with the highest eigenvectors are the highest ranked.

I have not found any practical applications that use these algorithms. As soon as I find thm I will post the links.

Searching the web Part I

This week I took a cuil one week challenge. I was curious to find out how this search engine stacks up against Goolge’s search engine. My aim was to use no other search engine apart from cuil. However, within three days of the challenge, I had to switch. Search results were poor. A search on say ‘agile development’ resulted in no wikipedia hits. There were no spell checker and no add-ons for Firefox.

One reason for abrupt end to my cuil one week challenge was that I was introduced to another search engine called clusty. Clusty not only returned better results when compared to cuil,but instead of delivering millions of search results in one long list, clusty grouped similar results together into clusters. Clusters help you see your search results by topic so you can zero in on exactly what you’re looking for or discover unexpected relationships between items. What is great you can search within clusters.

Clusty used a clustering algorithm for its search engine. Everyone is familiar with Google’s PageRank  algorithm. Clustering involves the separation of , say, unrelated documents and group related documents together. Using the contents of a web pages and their link information, the content-link hypertext clustering algorithm groups similar web pages into more complete web pages that can be searched or combined into larger clusters. To generate clusters, the algorithm uses similarity functions based on the contents of the web pages and the hyperlink information. There are two similarity functions for this algorithm, a similarity function that examines the hyperlinks of the pages and a similarity function that examine the contents of the web pages. Combining the hyperlink and content similarity functions together in an iterative nature produces web pages that are similar, grouped in clusters.

Other web search algorithms of note are:

HITS -Hyper text induced topic selection

ARC – Automatic resource compilation

SALSA – Stochastic Approach for Link Structure Analysis

These I will discuss in my next blog post.