Sunday, January 31, 2010

Using Haskell to calculate PageRank

PageRank algorithm was originally developed by the Google’s founders for ranking pages by their links. But the idea of ranking, based on references to the items could be used not only in Web, but in other different fields, which have directed graph structure. This article, and simultaneously Literate Haskell code, shows how to calculate page rank with Haskell.

PageRank algorithm

PageRank algorithm uses random surfer model. It calculates the probability of visiting concrete page by some surfer, who randomly clicks on different links on the page.

Originally PageRank for page A is the:

PR(A)=PR(P1)/LC(P1)+PR(P2)/LC(P2)+...+PR(Pn)/LC(Pn)

where LC(Pi) — is number of outgoing links from Pi, and Pi is all pages, which links to the A.

When page hasn’t outgoing links, that we suppose, that it connects to every other page in the Web.

Also PageRank includes damping factor — a probability, that random surfer will continue clicking links on each step. Usually damping factor equals to 0.85.

The previous PageRank formula with damping factor D will look:

PR(A)=(1-D)+D(PR(P1)/LC(P1)+PR(P2)/LC(P2)+...+PR(Pn)/LC(Pn))

Because, the link graph have cycles (i.e. PR(A) could indirectly depend on PR(A)), than calculating on page rank is iterative — on each step we recalculate whole graph, getting it closer to real PageRank value.

Thus the algorithm should look like:

  1. Initialize PR(Ai) for all Ai with arbitrary value, simply 1

  2. Calculate PR(Ai) for all Ai

  3. Repeat step 2 specified times, or until the error will be relatively small

Implementation

> module Main where
>
> import System.Random
> import Control.Monad
> import Control.Applicative
> import Data.Maybe
> import System.Environment
>
> import qualified Data.IntMap as M
> import qualified Data.IntSet as S

Convert input data to more convenient structure

Our initial input data will have the form:

[(vertex, [vertexes to which current vertex links])]

vertexes, are integers, for more efficient implementation.

> -- Each vertex has its rank, number of outbound links and 
> -- set of inbound links
> data VertexProps = VertexProps {
> vertexRank :: Double,
> numOutboundLinks :: Int,
> inboundLinks :: S.IntSet
> } deriving Show
>
> isDangling (VertexProps _ 0 _) = True
> isDangling _ = False
>
> -- Our graph is map between vertex index and it properties
> type RankGraph = M.IntMap VertexProps
>
> fromGraph :: [(Int, [Int])] -> RankGraph
> fromGraph gr =
> let depPairs (v, l) = (v, VertexProps 1 (length l) S.empty) :
> [(x, VertexProps 0 0 (S.singleton v)) |
> x <- l]
> depList = concatMap depPairs gr
> catVertexes (VertexProps r1 l1 s1) (VertexProps r2 l2 s2) =
> VertexProps (r1+r2) (l1+l2) (s1 `S.union` s2)
> in M.fromListWith catVertexes depList
>

Implement single-step PageRank algorithm

> pageRank :: Double -> RankGraph -> RankGraph
> pageRank df gr = M.mapWithKey transformRank gr
> where transformRank v (VertexProps r l s) =
> VertexProps (calcRank s) l s
> -- calculate rank for a single vertex
> calcRank s = (1 - df) + df * (danglingScore + (S.fold incomR 0 s))
> -- sum of all linkVotes for vertixe vrt
> incomR vrt r = r + (linkVote vrt)
> linkVote v = let lvf (VertexProps r l _) = r / (fromIntegral l)
> in maybe 0 lvf (v `M.lookup` gr)
> -- calculate the dangling score
> -- sum all dangling page ranks and divide them to graph size
> danglingPages = filter (isDangling . snd) (M.toList gr)
> danglingScore = (sum (map (vertexRank . snd) danglingPages)) /
> (fromIntegral (M.size gr))
>
> pageRankList df gr = iterate (pageRank df) gr
>
> -- Convert from RankGraph to list [(id, rank)]
> fromRankGraph = map (\(v, r) -> (v, vertexRank r)) . M.toList

Testing

Let’s test our creation. I create simple main program, that will read graph size from the command-line and calculate rank with 20 iterations. To force evaluation, we’ll just print last element from the result of the last iteration. Here I only interested in algorithm’s efficiency.

Generate test graph

> createTestGraph :: Int -> Int -> IO [(Int, [Int])]
> createTestGraph numPages maxLinks =
> sequence [(,) p <$> outboundLinks p | p <- [1..numPages]]
> where
> outboundLinks p = do
> outLinksNum <- randomRIO (0, maxLinks)
> sequence [randomRangeNot p (1, numPages) |
> x <- replicate outLinksNum 0]
>
> randomRangeNot p (a, b) = do
> v <- randomRIO (a, b - 1)
> return (if v == p then b else v)
> main :: IO ()
> main = do
> (gSize:gLinks:_) <- getArgs
> rank <- pageRankList 0.85 . fromGraph <$>
> createTestGraph (read gSize) (read gLinks)
> putStrLn ("Graph size: " ++ gSize ++ "x" ++ gLinks)
> putStrLn ("20th iteration result: " ++
> (show (M.toList (rank !! 20) !! ((read gSize) - 1))))

My algorithm works 7 seconds to calculate 20 iterations of PageRank for a graph with 10000 vertexes. Not bad, but I will be glad to reduce the memory usage because 150MB is too big for the tiny set of data.

3 comments:

  1. "danglingPages = filter (isDangling . snd) (M.toList gr)" is probably faster as " = snd . unzip . M.toList . M.filter isDangling gr" not having to allocate list cells that will be filtered out.

    Also, folding a function which calls IntMap.lookup over a another IntMap or IntSet is usually sub-optimal. The two tries can be intersected in time that is mostly linear in the size of the smaller, then you can fold over this intersected trie. Check out IntMap.intersectionWithKey. Although, you may have to convert your Set to a Map, but this is fast (M.fromAscList . map ((,) ()) . S.toAscList). If you find this conversion to use significant time then replace IntSet with IntMap ().

    Also, M.size is not constant time, it's linear. The IntSet/Map tries do not store size info at nodes as Data.Set and Data.Map do, so M.size is essentially (M.foldl' (+) 0).

    ReplyDelete
  2. Search Ranking = Relevance * PageRank

    more value you will pass to your visitor more better chance that search engine will recommend you by way of top organic listing.

    ReplyDelete
  3. Great job for publishing such a beneficial web site. Your web log isn’t only useful but it is additionally really creative too. There tend to be not many people who can certainly write not so simple posts that artistically. Continue the nice writing
    check pagerank

    ReplyDelete