Saturday, May 29, 2010

Released new version of ProjectTemplate. Now on Windows.

Recently I've uploaded new, more powerful version of my ProjectTemplate tool. Now it could create from templates single files, and don't require the project directory to be created. More info in the README file.

Also I've added support for Windows. It is not so beauty as for Linux platforms (on Windows it works without realine functionality). The Windows binaries are stored on Google Code.

Ironically, but ProjectTemplate helps me more in C++ coding, not in Haskell ;)

Tuesday, March 9, 2010

On creating Haskell projects. Cabal init.

In comments to my previous article on reddit I read about new feature in the cabal-install package (version 0.8.0) - cabal init. This is more convenient successor for mkcabal and I glad it now seamlessly integrated to this powerful project management tool.

This tool also supports rich command line interface, to avoid entering some common properties (such as license, version etc.) every time.

Friday, March 5, 2010

ProjectTemplate - a template-based tool for creating Haskell projects

Yesterday I created 5 projects, and found many boilerplate code on preparing cabal files, Setup.hs and directory structure. So I decided to simplify project creation for haskellers and created ProjectTemplate.

I was inspired by the Don’s mkcabal tool, and created the similar dialog-based application.

mkcabal, created by Don, is very simple tool, and it helps, when you start using Haskell, but this tool hasn’t any configuration and personalization of Haskell project, so I every time after creating project with mkcabal make some changes in the directory structure and file names.

ProjectTemplate is the template-based project creation tool. It was developed for ease creation of Haskell projects, but now I see, it could ease creation different directory structures, based on templates (for example web sites).

Templates are directories with files it them, which will be copied to the newly created project. Each template could contain complex directory hierarchy. Each file/directory name could be the plain name, or template name. Template name is created from the double underscore and parameter name, and in the project directory is substituted to the parameter value.

For example __ProjectName.cabal file with ProjectName parameter set to test, will be substituted to test.cabal. The same is true for directories.

ProjectTemplate also fills template files, where substitutes parameters to their values. The template file should have .template extension, which will be cut of in the destination file. In the template file parameters are enclosed in $ character. The example cabal file, with template parameters could look like:

name:                $ProjectName$
version: 0.1
category: $Category$
license: BSD3
license-file: LICENSE
author: $Author$
maintainer: $Maintainer$
build-depends: base
build-type: Simple

executable: $ProjectName$
main-is: src/Main.hs

ProjectTemplate defines following parameters: ProjectName, Day, Month, Year. All other parameters will be asked to user during the project creation.

You could look at sample templates in the ProjectTemplate repository, to better understanding how this tool works.

Monday, February 1, 2010

Using PageRank to calculate HackageRank

Previous article was introductory to the PageRank algorithm. Today I want to show one of the practical use of previous library.


Hackage grows every day. Because of its centralized structure and the cabal tool, it is easy to install packages with all dependencies with single command.

But I always wonder, what packages are most important, and obviously present at almost all PC’s with Haskell. Sometimes looks strange, when simple library with several functions depends on more than 5 packages. Even when I have installed most popular packages, because my Haskell environment haven’t changed last year. Also I always have troubles with choosing the right library for Web server, or for work with database.

So I decided to apply PageRank algorithm to the Hackage. PageRank shows the probability of the random surfer to visit single page. Similarly HackageRank must show the probability for single package to be present on the random PC of the random haskeller.

The code

> module Main where
> -- use distribution package to read dependencies
> import Distribution.PackageDescription
> import Distribution.PackageDescription.Parse
> import Distribution.PackageDescription.Configuration
> import Distribution.Verbosity
> import Distribution.Version
> import Distribution.Package
> -- file utils, to enumerate directories
> import System.Directory
> import System.FilePath
> import System.Posix.Files
> import System.Environment
> import Control.Monad
> import Control.Applicative
> import Control.Arrow
> import qualified Data.ByteString.Char8 as B
> import qualified Data.Map as M
> import Data.Map (size, keys, fromListWith, toList)
> import qualified Data.Set as S
> import Data.Set (empty, union, fromList, toList)
> import Data.Maybe
> import Data.List (sortBy)
> import Text.Printf
> import Rank

Gathering all cabal files

All packages descriptions could be downloaded from the Hackage.

The path to the unpacked archive will be send through command line argument.

> isDir = liftA isDirectory . getFileStatus
> notDir = liftA not . isDir
> directoryListing p = (map (p </>) . filterUpperDirs) <$>
> getDirectoryContents p
> where
> filterUpperDirs = filter (`notElem` [".", ".."])
> cabalFiles root = do
> dir <- directoryListing root
> directories <- filterM isDir dir
> files <- filterM notDir dir
> subfiles <- concat <$> mapM cabalFiles directories
> return (files ++ subfiles)

Read package names and dependencies

I pack all strings into byte strings, to reduce memory usage (the PageRank algorithm itself heavily uses memory).

> packName pkg = let (PackageName pkgName) = packageName pkg
> in B.pack pkgName
> dependencyName (Dependency (PackageName name) _) = B.pack name
> packageDependencies pkg =
> let directDeps = buildDepends pkg
> allBuildInfoDeps =
> concatMap (uncurry (++) . (buildTools &&& pkgconfigDepends))
> (allBuildInfo pkg)
> in map dependencyName (directDeps ++ allBuildInfoDeps)
> parsePackageFile cabal = do
> pkg <- flattenPackageDescription <$> readPackageDescription silent cabal
> let deps = packageDependencies pkg
> return ((packName pkg, S.fromList deps) : [(x, S.empty) | x <- deps])

Generate the list of all packages and their dependencies

enumeratePackages simply maps each package name to integer value. Previous versions of PageRank algorithm run out of memory, when I tried to use strings in vertex names.

> -- we have a list of elements in the form of: [(name, deps set)]
> -- the name could be repeated in the list, so let's convert it to
> -- map, and combine all sets
> packages root = do
> largeList <- concat <$> (mapM parsePackageFile =<< cabalFiles root)
> return (M.fromListWith (S.union) (filter (not . B.null . fst) largeList))
> nameToIdx pkgsMap = M.fromList (zip (M.keys pkgsMap) ([0..]::[Int]))
> idxToName pkgsMap = M.fromList (zip ([0..]::[Int]) (M.keys pkgsMap))
> enumeratePackages pkgsMap =
> let indexes = nameToIdx pkgsMap
> getIndex name = fromJust (name `M.lookup` indexes)
> enumerator (n, d) = (getIndex n, map getIndex d)
> in map enumerator (M.toList ( S.toList pkgsMap))

Decode indexes to package names

Last function just simply decodes the integer values to the corresponding packages names.

> decodePackages pkgsMap ranks = 
> let names = idxToName pkgsMap
> getName idx = fromJust (idx `M.lookup` names)
> in map (\(i, r) -> (getName i, r)) ranks

Main function

> main :: IO ()
> main = do
> -- take the packages directory and number of iterations
> (dir:iters:_) <- getArgs
> -- read packages to the map
> packagesMap <- packages dir
> let hackageRankList = pageRankList 0.85
> (fromGraph (enumeratePackages packagesMap))
> lastRank = decodePackages packagesMap
> (fromRankGraph (hackageRankList !! (read iters)))
> sortedRank = sortBy (\(_, r1) (_, r2) -> compare r2 r1) lastRank
> -- printing rank to the screen
> let formatString (n, r) = printf "%10.5f - %s\n" r (B.unpack n)
> mapM_ formatString sortedRank

Some results

Here the Top20 of packages, by HackageRank:

 470.84702 - base
90.73473 - syb
68.48688 - ghc-prim
57.82218 - integer
57.56195 - integer-gmp
57.55739 - rts
57.52356 - integer-simple
45.63995 - array
34.83558 - mtl
34.49726 - bytestring
33.62577 - containers
20.91788 - directory
17.19401 - haskell98
16.18344 - old-time
14.68120 - filepath
14.65891 - old-locale
13.04511 - unix
13.03026 - parsec
11.40593 - random
10.29807 - process

Hm. Doesn’t tell anything :) Let’s try to rank packages, which provides interface to sqlite database:

   0.41222 - HDBC-sqlite3
0.33478 - hsql-sqlite
0.33478 - hsql-sqlite3
0.33024 - bindings-sqlite3
0.31262 - sqlite
0.29853 - haskelldb-hdbc-sqlite3
0.29853 - haskelldb-hsql-sqlite
0.29853 - haskelldb-hsql-sqlite3
0.29853 - hsSqlite3

Much better, now I know, that best tool to work with sqlite is HDBC.

I think, there could be another uses of these calculations, waiting for your ideas.

The complete list of HackageRank calculated for Jan 31, 2010 you can download here.

Sunday, January 31, 2010

Using Haskell to calculate PageRank

PageRank algorithm was originally developed by the Google’s founders for ranking pages by their links. But the idea of ranking, based on references to the items could be used not only in Web, but in other different fields, which have directed graph structure. This article, and simultaneously Literate Haskell code, shows how to calculate page rank with Haskell.

PageRank algorithm

PageRank algorithm uses random surfer model. It calculates the probability of visiting concrete page by some surfer, who randomly clicks on different links on the page.

Originally PageRank for page A is the:


where LC(Pi) — is number of outgoing links from Pi, and Pi is all pages, which links to the A.

When page hasn’t outgoing links, that we suppose, that it connects to every other page in the Web.

Also PageRank includes damping factor — a probability, that random surfer will continue clicking links on each step. Usually damping factor equals to 0.85.

The previous PageRank formula with damping factor D will look:


Because, the link graph have cycles (i.e. PR(A) could indirectly depend on PR(A)), than calculating on page rank is iterative — on each step we recalculate whole graph, getting it closer to real PageRank value.

Thus the algorithm should look like:

  1. Initialize PR(Ai) for all Ai with arbitrary value, simply 1

  2. Calculate PR(Ai) for all Ai

  3. Repeat step 2 specified times, or until the error will be relatively small


> module Main where
> import System.Random
> import Control.Monad
> import Control.Applicative
> import Data.Maybe
> import System.Environment
> import qualified Data.IntMap as M
> import qualified Data.IntSet as S

Convert input data to more convenient structure

Our initial input data will have the form:

[(vertex, [vertexes to which current vertex links])]

vertexes, are integers, for more efficient implementation.

> -- Each vertex has its rank, number of outbound links and 
> -- set of inbound links
> data VertexProps = VertexProps {
> vertexRank :: Double,
> numOutboundLinks :: Int,
> inboundLinks :: S.IntSet
> } deriving Show
> isDangling (VertexProps _ 0 _) = True
> isDangling _ = False
> -- Our graph is map between vertex index and it properties
> type RankGraph = M.IntMap VertexProps
> fromGraph :: [(Int, [Int])] -> RankGraph
> fromGraph gr =
> let depPairs (v, l) = (v, VertexProps 1 (length l) S.empty) :
> [(x, VertexProps 0 0 (S.singleton v)) |
> x <- l]
> depList = concatMap depPairs gr
> catVertexes (VertexProps r1 l1 s1) (VertexProps r2 l2 s2) =
> VertexProps (r1+r2) (l1+l2) (s1 `S.union` s2)
> in M.fromListWith catVertexes depList

Implement single-step PageRank algorithm

> pageRank :: Double -> RankGraph -> RankGraph
> pageRank df gr = M.mapWithKey transformRank gr
> where transformRank v (VertexProps r l s) =
> VertexProps (calcRank s) l s
> -- calculate rank for a single vertex
> calcRank s = (1 - df) + df * (danglingScore + (S.fold incomR 0 s))
> -- sum of all linkVotes for vertixe vrt
> incomR vrt r = r + (linkVote vrt)
> linkVote v = let lvf (VertexProps r l _) = r / (fromIntegral l)
> in maybe 0 lvf (v `M.lookup` gr)
> -- calculate the dangling score
> -- sum all dangling page ranks and divide them to graph size
> danglingPages = filter (isDangling . snd) (M.toList gr)
> danglingScore = (sum (map (vertexRank . snd) danglingPages)) /
> (fromIntegral (M.size gr))
> pageRankList df gr = iterate (pageRank df) gr
> -- Convert from RankGraph to list [(id, rank)]
> fromRankGraph = map (\(v, r) -> (v, vertexRank r)) . M.toList


Let’s test our creation. I create simple main program, that will read graph size from the command-line and calculate rank with 20 iterations. To force evaluation, we’ll just print last element from the result of the last iteration. Here I only interested in algorithm’s efficiency.

Generate test graph

> createTestGraph :: Int -> Int -> IO [(Int, [Int])]
> createTestGraph numPages maxLinks =
> sequence [(,) p <$> outboundLinks p | p <- [1..numPages]]
> where
> outboundLinks p = do
> outLinksNum <- randomRIO (0, maxLinks)
> sequence [randomRangeNot p (1, numPages) |
> x <- replicate outLinksNum 0]
> randomRangeNot p (a, b) = do
> v <- randomRIO (a, b - 1)
> return (if v == p then b else v)
> main :: IO ()
> main = do
> (gSize:gLinks:_) <- getArgs
> rank <- pageRankList 0.85 . fromGraph <$>
> createTestGraph (read gSize) (read gLinks)
> putStrLn ("Graph size: " ++ gSize ++ "x" ++ gLinks)
> putStrLn ("20th iteration result: " ++
> (show (M.toList (rank !! 20) !! ((read gSize) - 1))))

My algorithm works 7 seconds to calculate 20 iterations of PageRank for a graph with 10000 vertexes. Not bad, but I will be glad to reduce the memory usage because 150MB is too big for the tiny set of data.