Google page rank ::

Page rank algorithm for large data sets

“The pageRank citation ranking. Bringing order to the web. Technical Report 1998”

Our Setting to come up with a algorithm ::

We want to build a search engine ::

Query comes in
we need to find all documents matching the query
we need to decide which selection to display on top of the list. ==We require a ranking system to evaluate the importance / relevance of answers==

Early attempts in the 1990ies looked at the content of the document - website - count how often the keyword occurs >> prone to manipulation by setting invisible headers

The new idea by google founders was to look at the link structure of each webpage instead.

Definition Pagerank #def :

Primary Idea ::

A webpage is important if many important links point to that page
A link is important if it comes from an important page

Results of a search query should then be ranked according to importance

Given a directed graph G = (V,E) potentially with edge weights sIJ, denote the in- and out-degrees by $d_{in} (j)$ and $d_{o u t} (j)$ where the degree is the amount of links going in and out of the website.

each element in the graph is a potential website.

We define the ranking score function r for each vertices v $\in V :$ $r(j) = \sum\limits_{i\ine \text{parents}(j)} ( \frac{r(i)}{d_{out}(i)})$

This is an implicit definition. We need to find a way to solve this problem for r(j, for each j) >> because each valeu depends on another and thus we cannot utilize this function without a implicit context.

THe higher the score r(v), the more relevant the page v.

Using eigenvector to rewrite pageranking ::

We define a matrix P with entries

$[ a_{ij} = \left[ {\begin{array}{c} 1\d_{out}(i)} if i\longmapstp j \ 0 }]$ retard ![[Pasted image 20221212143030.png]]

Denote by r the vector with the relevance scores as entries . Observe that (r’ * P) $_{j}$ = $i \sum r_{i} * a_{ij}$ sodass wir die Multiplikation r’ mit r’ * P beschreiben können. Anders gesagt suchen wir einen vektor r’, sodass die Gleichung r’ = r’ * P gilt. Das heißt wir suchen den Eigenvektor von r’, da diese Eigenschaft eine Streckung um eine Konstante definiert.

So r is a left eigenvector of P with eigenvalue 1. We extract the eigenvector from our matrix.

The pagerank idea consists of ranking vertices according to this given eigevector. Score an der Stelle J ist dann ein dementsprechender Wert, der dessen Ranking wiedergibt.

Random Walk | Irrfahrt :

Random walk on a graph : we start at some point and then randomly walk to another vertices, walking along an edge. Each of the edges contains the same probability to appear.

To give the random surfer interpretation to pagerank, we need to learn about random walks on a graph ::

Consider a directed graph G = (V,E) with n vertices.
A random walk ( Irrfahrt) on the graph is a particular stochastic process on the graph. At each point in time, we randomly jump from one vertex to a neighboring vertex. The probability to jump to each of the neighbours only depends on the current vertex, on nothing else - in particular, not on anythign that we have seen in the past.

Transition Matrix and initial state ::

A random walk is fully described by a transition matrix P with entries $p_{ij} = P (X + t + 1 = v_{j} ∣ X_{t} = v_{i})$

For a weighted graph with similiarity edge weights $s_{ij}$ we define $p_{ij} = s_{ij} / d_{i}$ and in particular $P = D^{- 1} S$

Wahrscheinlichkeit ist 0, wenn wir nicht zu dem Graphen gekommen sind – er nicht erreichbar ist.

Continue at 836 - 840

The random surfer model ::

Recall the definition of the matrix P for pagerank. Observe ::

the matrix P is the transition matrix of a random walk on the graph of the internet
the ranking vector r is its stationary distribution

If done naively, two big problems could occur:

dangling nodes - a page that does not referrence to something else
disconnected components - a page that is not referenced by anyone yet should be visible and found

==Solution== :: We introduce a random restart - ==“teleportation”== ::

with probability $α$ close to 1, we walk along edges of the graph.
With probability 1 - $α$ we teleport : is really rare so that we don’t teleport that often
- In the simplest case, we jump to any other random webpage with the same probability. THe transition matrix of this random walk is then given as $α P + (1 - α) \frac{1}{n} 1_{n x n}$ where n is the number of vetices and $1_{n x n}$ the constante one matrix. $1 - α$ is a really small probability so we barely teleport.

With disconnected websites we only get to them if we teleport to those, so their ranking will probably be low most of the time.

Instead of jumping randomly we could also jump after N steps, but that makes analysis more complex because we have to distinguish each step we are at at the moment.

WE can also introduce a personalized teleportation vector v that specifies the teleportation preferences of the individual user. Then the transition matrix is :

this creates the option to personalize a persons rankings, based on their preferences etc.

$α P + (1 - α) \frac{1}{n} 1_{n x n}^{v^{'}}$ this is the adaption influenced by a vector v.

The ranking is then stationary distribution of this matrix.

Power method computing eigenvectors ::

We need to compute an eigenvector of a matrix of size n x n where n is the numebr of webpages in the internet ~ 2018 around two billion webpages.

Computing an eigenvector of a symmetric matrix has worst case complexity of about $O (n^{3})$ - and even worse our current matrix is not symmetric

Our matrix is probably sparse >> wir haben viele Websites die nicht verbunden sind.

==simples way to compute eigenvectors: power method==

Let A be any diagonalizable matrix
Goal : want to compute eigenvector corresponding to the largest eigenvalue.
Observe : Denote by v1,….vn a basis of eigenvectors of matrix A. Consider any vector v = $i \sum a_{i} v_{i}$ , der Eigenvektor kann als Linearkombination dargestellt werden. Then $A v = A (i \sum a_{i} v_{i}) = i \sum a_{i} (A v_{i}) = i \sum a_{i} λ_{i} v_{i}$
![[Pasted image 20221212154131.png]] vanishes because as eigenvector its length stays constant at 1, while the other vector v1 is multiplied k times and grows in size as well. Thus the influence of our eigenvector is diminishing later.

$\frac{λ _{2}^{k}}{λ _{1}^{k}}$ is a small constant and thus we can leave out the sum later.

Power method, vanilla Version ::

Initiliaze $q_{o}$ by any random vector with $∣∣ q_{0} ∣∣ = 1$ >> Skalarprodukt 1 while not converged: $z^{k} := A q^{l}$ $q^{k} = z^{k} /∣∣ z^{k} ∣∣$

Caveats ::

Wont work if $q_{0}$ is the first eigenvector
Does not necessarily convege if the multiplicity of the largest eigenvalue is larger than 1.
Speed of convergence depends on the gap between the first and second eigenvalue, namely $λ_{2} / λ_{1}$

Result of power method for utilization in pagerank ::

Implementation of pagerank is a simple power iteration :

Given : random walk matrix P of the graph, personalization vector v of the user
We initialise with constant vector $r_{o} = (1, \dots\dots, 1)^{'}$
We iterate until convergence ::

![[Pasted image 20221212154656.png]] der Teleporations-vektor wird nur hinzuaddiert, weil er natürlich extra auftritt.

Result of PageRank ::

We acquired a ranking that scores websites among the internet based on their outgoing / incomming links. However it does not conclude with a [[111.99_algo_invertedIndex|Inverted Index]]

scattered-lenity