Sunday, January 31, 2010

Using Haskell to calculate PageRank

PageRank algorithm was originally developed by the Google’s founders for ranking pages by their links. But the idea of ranking, based on references to the items could be used not only in Web, but in other different fields, which have directed graph structure. This article, and simultaneously Literate Haskell code, shows how to calculate page rank with Haskell.

PageRank algorithm

PageRank algorithm uses random surfer model. It calculates the probability of visiting concrete page by some surfer, who randomly clicks on different links on the page.

Originally PageRank for page A is the:

PR(A)=PR(P1)/LC(P1)+PR(P2)/LC(P2)+...+PR(Pn)/LC(Pn)

where LC(Pi) — is number of outgoing links from Pi, and Pi is all pages, which links to the A.

When page hasn’t outgoing links, that we suppose, that it connects to every other page in the Web.

Also PageRank includes damping factor — a probability, that random surfer will continue clicking links on each step. Usually damping factor equals to 0.85.

The previous PageRank formula with damping factor D will look:

PR(A)=(1-D)+D(PR(P1)/LC(P1)+PR(P2)/LC(P2)+...+PR(Pn)/LC(Pn))

Because, the link graph have cycles (i.e. PR(A) could indirectly depend on PR(A)), than calculating on page rank is iterative — on each step we recalculate whole graph, getting it closer to real PageRank value.

Thus the algorithm should look like:

  1. Initialize PR(Ai) for all Ai with arbitrary value, simply 1

  2. Calculate PR(Ai) for all Ai

  3. Repeat step 2 specified times, or until the error will be relatively small

Implementation

> module Main where
>
> import System.Random
> import Control.Monad
> import Control.Applicative
> import Data.Maybe
> import System.Environment
>
> import qualified Data.IntMap as M
> import qualified Data.IntSet as S

Convert input data to more convenient structure

Our initial input data will have the form:

[(vertex, [vertexes to which current vertex links])]

vertexes, are integers, for more efficient implementation.

> -- Each vertex has its rank, number of outbound links and 
> -- set of inbound links
> data VertexProps = VertexProps {
> vertexRank :: Double,
> numOutboundLinks :: Int,
> inboundLinks :: S.IntSet
> } deriving Show
>
> isDangling (VertexProps _ 0 _) = True
> isDangling _ = False
>
> -- Our graph is map between vertex index and it properties
> type RankGraph = M.IntMap VertexProps
>
> fromGraph :: [(Int, [Int])] -> RankGraph
> fromGraph gr =
> let depPairs (v, l) = (v, VertexProps 1 (length l) S.empty) :
> [(x, VertexProps 0 0 (S.singleton v)) |
> x <- l]
> depList = concatMap depPairs gr
> catVertexes (VertexProps r1 l1 s1) (VertexProps r2 l2 s2) =
> VertexProps (r1+r2) (l1+l2) (s1 `S.union` s2)
> in M.fromListWith catVertexes depList
>

Implement single-step PageRank algorithm

> pageRank :: Double -> RankGraph -> RankGraph
> pageRank df gr = M.mapWithKey transformRank gr
> where transformRank v (VertexProps r l s) =
> VertexProps (calcRank s) l s
> -- calculate rank for a single vertex
> calcRank s = (1 - df) + df * (danglingScore + (S.fold incomR 0 s))
> -- sum of all linkVotes for vertixe vrt
> incomR vrt r = r + (linkVote vrt)
> linkVote v = let lvf (VertexProps r l _) = r / (fromIntegral l)
> in maybe 0 lvf (v `M.lookup` gr)
> -- calculate the dangling score
> -- sum all dangling page ranks and divide them to graph size
> danglingPages = filter (isDangling . snd) (M.toList gr)
> danglingScore = (sum (map (vertexRank . snd) danglingPages)) /
> (fromIntegral (M.size gr))
>
> pageRankList df gr = iterate (pageRank df) gr
>
> -- Convert from RankGraph to list [(id, rank)]
> fromRankGraph = map (\(v, r) -> (v, vertexRank r)) . M.toList

Testing

Let’s test our creation. I create simple main program, that will read graph size from the command-line and calculate rank with 20 iterations. To force evaluation, we’ll just print last element from the result of the last iteration. Here I only interested in algorithm’s efficiency.

Generate test graph

> createTestGraph :: Int -> Int -> IO [(Int, [Int])]
> createTestGraph numPages maxLinks =
> sequence [(,) p <$> outboundLinks p | p <- [1..numPages]]
> where
> outboundLinks p = do
> outLinksNum <- randomRIO (0, maxLinks)
> sequence [randomRangeNot p (1, numPages) |
> x <- replicate outLinksNum 0]
>
> randomRangeNot p (a, b) = do
> v <- randomRIO (a, b - 1)
> return (if v == p then b else v)
> main :: IO ()
> main = do
> (gSize:gLinks:_) <- getArgs
> rank <- pageRankList 0.85 . fromGraph <$>
> createTestGraph (read gSize) (read gLinks)
> putStrLn ("Graph size: " ++ gSize ++ "x" ++ gLinks)
> putStrLn ("20th iteration result: " ++
> (show (M.toList (rank !! 20) !! ((read gSize) - 1))))

My algorithm works 7 seconds to calculate 20 iterations of PageRank for a graph with 10000 vertexes. Not bad, but I will be glad to reduce the memory usage because 150MB is too big for the tiny set of data.