Monday, February 1, 2010

Using PageRank to calculate HackageRank

Previous article was introductory to the PageRank algorithm. Today I want to show one of the practical use of previous library.

Objectives

Hackage grows every day. Because of its centralized structure and the cabal tool, it is easy to install packages with all dependencies with single command.

But I always wonder, what packages are most important, and obviously present at almost all PC’s with Haskell. Sometimes looks strange, when simple library with several functions depends on more than 5 packages. Even when I have installed most popular packages, because my Haskell environment haven’t changed last year. Also I always have troubles with choosing the right library for Web server, or for work with database.

So I decided to apply PageRank algorithm to the Hackage. PageRank shows the probability of the random surfer to visit single page. Similarly HackageRank must show the probability for single package to be present on the random PC of the random haskeller.

The code

> module Main where
>
> -- use distribution package to read dependencies
> import Distribution.PackageDescription
> import Distribution.PackageDescription.Parse
> import Distribution.PackageDescription.Configuration
> import Distribution.Verbosity
> import Distribution.Version
> import Distribution.Package
>
> -- file utils, to enumerate directories
> import System.Directory
> import System.FilePath
> import System.Posix.Files
>
> import System.Environment
>
> import Control.Monad
> import Control.Applicative
> import Control.Arrow
>
> import qualified Data.ByteString.Char8 as B
>
> import qualified Data.Map as M
> import Data.Map (size, keys, fromListWith, toList)
>
> import qualified Data.Set as S
> import Data.Set (empty, union, fromList, toList)
>
> import Data.Maybe
> import Data.List (sortBy)
>
> import Text.Printf
>
> import Rank

Gathering all cabal files

All packages descriptions could be downloaded from the Hackage.

The path to the unpacked archive will be send through command line argument.

> isDir = liftA isDirectory . getFileStatus
> notDir = liftA not . isDir
>
> directoryListing p = (map (p </>) . filterUpperDirs) <$>
> getDirectoryContents p
> where
> filterUpperDirs = filter (`notElem` [".", ".."])
>
> cabalFiles root = do
> dir <- directoryListing root
> directories <- filterM isDir dir
> files <- filterM notDir dir
> subfiles <- concat <$> mapM cabalFiles directories
> return (files ++ subfiles)

Read package names and dependencies

I pack all strings into byte strings, to reduce memory usage (the PageRank algorithm itself heavily uses memory).

> packName pkg = let (PackageName pkgName) = packageName pkg
> in B.pack pkgName
>
> dependencyName (Dependency (PackageName name) _) = B.pack name
>
> packageDependencies pkg =
> let directDeps = buildDepends pkg
> allBuildInfoDeps =
> concatMap (uncurry (++) . (buildTools &&& pkgconfigDepends))
> (allBuildInfo pkg)
> in map dependencyName (directDeps ++ allBuildInfoDeps)
>
> parsePackageFile cabal = do
> pkg <- flattenPackageDescription <$> readPackageDescription silent cabal
> let deps = packageDependencies pkg
> return ((packName pkg, S.fromList deps) : [(x, S.empty) | x <- deps])
>

Generate the list of all packages and their dependencies

enumeratePackages simply maps each package name to integer value. Previous versions of PageRank algorithm run out of memory, when I tried to use strings in vertex names.

> -- we have a list of elements in the form of: [(name, deps set)]
> -- the name could be repeated in the list, so let's convert it to
> -- map, and combine all sets
> packages root = do
> largeList <- concat <$> (mapM parsePackageFile =<< cabalFiles root)
> return (M.fromListWith (S.union) (filter (not . B.null . fst) largeList))
>
> nameToIdx pkgsMap = M.fromList (zip (M.keys pkgsMap) ([0..]::[Int]))
> idxToName pkgsMap = M.fromList (zip ([0..]::[Int]) (M.keys pkgsMap))
>
> enumeratePackages pkgsMap =
> let indexes = nameToIdx pkgsMap
> getIndex name = fromJust (name `M.lookup` indexes)
> enumerator (n, d) = (getIndex n, map getIndex d)
> in map enumerator (M.toList (M.map S.toList pkgsMap))

Decode indexes to package names

Last function just simply decodes the integer values to the corresponding packages names.

> decodePackages pkgsMap ranks = 
> let names = idxToName pkgsMap
> getName idx = fromJust (idx `M.lookup` names)
> in map (\(i, r) -> (getName i, r)) ranks
>

Main function

> main :: IO ()
> main = do
> -- take the packages directory and number of iterations
> (dir:iters:_) <- getArgs
> -- read packages to the map
> packagesMap <- packages dir
> let hackageRankList = pageRankList 0.85
> (fromGraph (enumeratePackages packagesMap))
> lastRank = decodePackages packagesMap
> (fromRankGraph (hackageRankList !! (read iters)))
> sortedRank = sortBy (\(_, r1) (_, r2) -> compare r2 r1) lastRank
> -- printing rank to the screen
> let formatString (n, r) = printf "%10.5f - %s\n" r (B.unpack n)
> mapM_ formatString sortedRank

Some results

Here the Top20 of packages, by HackageRank:

 470.84702 - base
90.73473 - syb
68.48688 - ghc-prim
57.82218 - integer
57.56195 - integer-gmp
57.55739 - rts
57.52356 - integer-simple
45.63995 - array
34.83558 - mtl
34.49726 - bytestring
33.62577 - containers
20.91788 - directory
17.19401 - haskell98
16.18344 - old-time
14.68120 - filepath
14.65891 - old-locale
13.04511 - unix
13.03026 - parsec
11.40593 - random
10.29807 - process

Hm. Doesn’t tell anything :) Let’s try to rank packages, which provides interface to sqlite database:

   0.41222 - HDBC-sqlite3
0.33478 - hsql-sqlite
0.33478 - hsql-sqlite3
0.33024 - bindings-sqlite3
0.31262 - sqlite
0.29853 - haskelldb-hdbc-sqlite3
0.29853 - haskelldb-hsql-sqlite
0.29853 - haskelldb-hsql-sqlite3
0.29853 - hsSqlite3

Much better, now I know, that best tool to work with sqlite is HDBC.

I think, there could be another uses of these calculations, waiting for your ideas.

The complete list of HackageRank calculated for Jan 31, 2010 you can download here.