Wednesday, 11 July 2007

PageRanking in SEO

This article is intended to provide a fair knowledge about PageRanking in SEO. PageRank is one of the methods Google uses to determine a page’s relevance or importance. Before going into details it’s better to mention the short hands used in this article.



PR: Page Rank of the page.

Backlink: If page A links out to page B, then page B is said to have a “backlink” from page A.



What is a PageRank?


In short PageRank is a “vote”, by all the other pages on the Web, about how important a page is. So a link to a page from any other page counts as a vote of support. If there is no link it doesn’t mean it’s a vote against the page, its only not having a supporting vote.


Quoting from the original Google paper, PageRank is defined like this:


Lets assume page A has pages T1...Tn, which point to it (i.e., are Backlinks to page A).

The parameter d is a damping factor which can be set between 0 and 1. Usually “d” is set to 0.85
C(A) is defined as the number of links going out of page A.



The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.



Let us dig it deeper.


PR(Tn) : Each page has a notion of its own self-importance. That’s “PR(T1)” for the first page in the web all the way up to “PR(Tn)” for the last page
C(Tn) : Each page spreads its vote out evenly amongst all of it’s outgoing links. The count, or number, of outgoing links for page 1 is “C(T1)”, “C(Tn)” for page n, and so on for all pages.

PR(Tn)/C(Tn) - so if our page (page A) has a backlink from page “n” the share of the vote page A will get is “PR(Tn)/C(Tn)”

d : - All these fractions of votes are added together but, to stop the other pages having too much influence, this total vote is “damped down” by multiplying it by 0.85 (the factor “d”)

(1 - d) : The (1 – d) bit at the beginning is a bit of probability math magic so the “sum of all web pages' PageRanks will be one”: it adds in the bit lost by the d.It also means that if a page has no links to it (no backlinks) even then it will still get a small PR of 0.15 (i.e. 1 – 0.85). (Aside: the Google paper says “the sum of all pages” but they mean the “the normalised sum” – otherwise known as “the average”.



How is PageRank Calculated?

It’s obvious from the formula that PR of a page depends on PR of the pages pointing to it. But we won’t know what PR those pages have until the pages pointing to them have their PR calculated and so on… And when you consider that page links can form circles and it seems impossible to do this calculation!
Its really not that difficult as it seems. According to Google Paper we can just go ahead and calculate a page’s PR without knowing the final value of the PR of the other pages. That seems strange but, basically, each time we run the calculation we’re getting a closer estimate of the final value. So all we need to do is remember each value we calculate and repeat the calculations lots of times until the numbers stop changing much.




Lets take the simplest example network: two pages, each pointing to the other:





Each page has one outgoing link (the outgoing count is 1, i.e. C(A) = 1 and C(B) = 1).



Guess 1



We don’t know what their PR should be to begin with, so let’s take a guess at 1.0 and do some calculations:



The numbers aren’t changing at all! So it looks like we started out with a wrong guess.


Guess 2


let’s start the guess at 0 instead and re-calculate:

PR(A) = 0.15 + .85*0

= 0.15


PR(B) = 0.15+ 0.85*0.15

= 0.2775


And again:


And again




and so on. The numbers just keep going up. But will the numbers stop increasing when they get to 1.0? What if a calculation over-shoots and goes above 1.0?


Guess 3


Well let’s see. Let’s start the guess at 40 each and do a few cycles:


PR(A) = 40 PR(B) = 40


First calculation





And again





Clearly those numbers are heading down. It sure looks the numbers will get to 1.0 and stop.


Principle: it doesn’t matter where you start your guess, once the PageRank calculations have settled down, the normalized probability distribution” (the average PageRank for all pages) will be 1.0



So lets take a look at some of examples and study how the PageRank is Getting affected in various scenarios. Values mentioned as PRs in the examples are calculated according to the formula mentioned above.

Example 2
A simple hierarchy with some outgoing links





As you’d expect, the home page has the most PR – after all, it has the most incoming links. But what went wrong is the average PR is not 1, as said earlier.


Why is it so? Take a look at the “external site” pages – what’s happening to their PageRank? They’re not passing it on, they’re not voting for anyone, they’re wasting theirs.

Example 3


Let’s link those external sites back into our home page just so we can see what happens to the average…




Look at the PR of our home page! All those incoming links sure make a difference.


Example 4


A simple hierarchy






Our home page has 2 and a half times as much PR as the child pages.


Observation: a hierarchy concentrates votes and PR into one page




Example 5


Looping





All the pages have the same number of incoming links, all pages are of equal importance to each other, all pages get the same PR of 1.0 (i.e. the “average” probability).


Example 6


Extensive Interlinking – or Fully Meshed




The results are the same as the Looping example above and for the same reasons.


Example 7


Hierarchical – but with a link in and one out.


We’ll assume there’s an external site that has lots of pages and links with the result that one of the pages has the average PR of 1.0. We’ll also assume that there’s just one link from that page and it’s pointing at our home page.





In example 4 the home page only had a PR of 1.92 but now it is 3.31!
Not only has site A contributed 0.85 PR to us, but the raised PR in the “About”, “Product” and “More” pages has had a lovely “feedback” effect, pushing up the home page’s PR even further!Priciple: a well structured site will amplify the effect of any contributed PR

Example 8


Looping – but with a link in and a link out






Well, the PR of our home page has gone up a little, but what’s happened to the “More” page?
The vote of the “Product” page has been split evenly between it and the external site. We now value the external Site B equally with our “More” page. The “More” page is getting only half the vote it had before – this is good for Site B but very bad for us!


Example 9


Fully meshed – but with one vote in and one vote out




That’s much better. The “More” page is still getting less share of the vote than in example 7 of course, but now the “Product” page has kept three quarters of its vote within our site - unlike example 8 where it was giving away fully half of it’s vote to the external site!
Keeping just this small extra fraction of the vote within our site has had a very nice effect on the Home Page too – PR of 2.28 compared with just 1.66 in example 8.


Observation: increasing the internal links in your site can minimize the damage to your PR when you give away votes by linking to external sites.


Principle: If a particular page is highly important – use a hierarchical structure with the important page at the “top”.
Where a group of pages may contain outward links – increase the number of internal links to retain as much PR as possible.
Where a group of pages do not contain outward links – the number of internal links in the site has no effect on the site’s average PR. You might as well use a link structure that gives the user the best navigational experience.

Site Maps


Site maps are useful in at least two ways:
If a user types in a bad URL most websites return a really unhelpful “404 – page not found” error page. This can be discouraging. Why not configure your server to return a page that shows an error has been made, but also gives the site map? This can help the user enormously
Linking to a site map on each page increases the number of internal links in the site, spreading the PR out and protecting you against your vote “donations”


Example 10


A common web layout for long documentation is to split the document into many pages with a “Previous” and “Next” link on each plus a link back to the home page. The home page then only needs to point to the first page of the document.



In this simple example, where there’s only one document, the first page of the document has a higher PR than the Home Page! This is because page B is getting all the vote from page A, but page A is only getting fractions of pages B, C and D.


Principle: in order to give users of our site a good experience, we may have to take a hit against our PR. There’s nothing we can do about this - and neither should we try to or worry about it! If our site is a pleasure to use lots of other webmasters will link to it and we’ll get back much more PR than we lost.


We can also see the trend between this and the previous example? As we add more internal links to a site it gets closer to the Fully Meshed example where every page gets the average PR for the mesh.Observation: as we add more internal links in our site, the PR will be spread out more evenly between the pages.


Example 11


Getting high PR the wrong way and the right way.
Just as an experiment, let’s see if we can get 1,000 pages pointing to our home page, but only have one link leaving it…








Those Spam pages are pretty worthless but they sure add up!


Observation: it doesn’t matter how many pages you have in your site, your average PR will always be 1.0 at best. But a hierarchical layout can strongly concentrate votes, and therefore the PR, into the home page!
This is a technique used by some disreputable sites (mostly adult content sites). If Google’s robots decide you’re doing this there’s a good chance you’ll be banned from Google!

On the other hand there are at least two right ways to do this:


1. Be a Mega-site


Mega-sites, like http://news.bbc.co.uk/ have tens or hundreds of editors writing new content – i.e. new pages - all day long! Each one of those pages has rich, worthwile content of its own and a link back to its parent or the home page! That’s why the Home page Toolbar PR of these sites is 9/10 and the rest of us just get pushed lower and lower by comparison…
Principle: Content Is King! There really is no substitute for lots of good content…


2. Give away something useful


http://www.phpbb.com/ has a Toolbar PR of 8/10 and it has no big money or marketing behind it! How can this be?
What the group has done is write a very useful bulletin board system that is becoming very popular on many websites. And at the bottom of every page, in every installation, is this HTML code:
Powered by phpBB
The administrator of each installation can remove that link, but most don’t because they want to return the favour…
Imagine all those millions of pages giving a fraction of a vote to http://www.phpbb.com/?
· Principle: Make it worth other people’s while to use your content or tools. If your give-away is good enough other site admins will gladly give you a link back. Principle: it’s probably better to get lots (perhaps thousands) of links from sites with small PR than to spend any time or money desperately trying to get just the one link from a high PR page.



Finally


PageRank is, in fact, very simple. But when a simple calculation is applied hundreds (or billions) of times over the results can seem complicated.



Reference:
This article is extracted from a paper written by Ian Rogers. He has been a Senior Research Fellow in User Interface Design and a consultant in Network Security and Database Backed Websites.
It was sponsored by IPR Computing Ltd – specialists in Secure Networks and Database Backed Websites