Computes the personalized pagerank for specified seeds using the
ApproximatePageRank
algorithm of Andersen et al. (2006). Computes
degree-adjustments and degree-regularization of personalized
pagerank vectors as described in Algorithms 3 and 4 of Chen et al. (2019).
These algorithms are randomized; if results are unstable across
multiple runs, decrease epsilon
.
Usage
# S3 method for neocache_graph
appr(
graph,
seeds,
...,
alpha = 0.15,
epsilon = 1e-06,
tau = NULL,
max_visits = Inf
)
Arguments
- graph
An
abstract_graph()
object, such as that created byrtweet_graph()
. This argument is required.- seeds
A character vector of seeds for the personalized pagerank. The personalized pagerank will return to each of these seeds with probability
alpha
at each node transition. At the moment, all seeds are given equal weighting. This argument is required.- ...
Ignored. Passing arguments to
...
results in a warning.- alpha
Teleportation constant. The teleportation constant is the probability of returning to a seed node at each node transition.
alpha
must be a valid probabilty; that is, between zero and one. Defaults to0.15
. This is the inverse of the "dampening factor" in the original PageRank paper, soalpha = 0.15
corresponds to a dampening factor of0.85
. Runtime is proportional to1 / (epsilon * alpha)
, so smallalpha
can result in long runtimes.- epsilon
Desired accuracy of approximation.
epsilon
must be a small positive number. Defaults to1e-6
.aPPR
guarantees that approximated personalized pageranks are uniformly withinepsilon
of their true value. That is, the approximation is guaranteed to be good in an L-infinity sense. This does not guarantee, however, that a ranking of nodes by aPPR is close to a ranking of nodes by PPR.For Twitter graphs, we recommend testing your code with
1e-4
or1e-5
, using1e-6
for exploration, and1e-7
to1e-8
for final results, although these numbers are very rough. It also perfectly reasonable to runaPPR
for a given number of steps (set viamax_visits
), and then note the approximation accuracy of your results. Internally,aPPR
keeps a running estimate of achieved accuracy that is always valid.Anytime you would like to explore more of the graph, you can simply decrease
epsilon
. So you can start withepsilon = 1e-5
and then gradually decreaseepsilon
until you have a sample of the graph that you are happy with.Also note that runtime is proportional to
1 / (epsilon * alpha)
, so smallepsilon
can result in long runtimes.- tau
Regularization term. Additionally inflates the in-degree of each observation by this term by performing the degree adjustment described in Algorithm 3 and Algorithm 4, which are described in
vignette("Mathematical details")
. Defaults toNULL
, in which casetau
is set to the average in-degree of the observed nodes. In general, setting it's reasonable to settau
to the average in-degree of the graph.- max_visits
Maximum number of unique nodes to visit. Should be a positive integer. Defaults to
Inf
, such that there is no upper bound on the number of unique nodes to visit. Useful when you want to specify a fixed amount of computation (or API calls) to use rather than an error tolerance. We recommend debugging withmax_visits ~ 20
, exploration withmax_visits
in the hundreds, andmax_visits
in the thousands to ten of thousands for precise results, although this is a very rough heuristic.
Value
A Tracker()
object. Most relevant is the stats
field,
a tibble::tibble()
with the following columns:
name
: Name of a node (character).p
: Current estimate of residual per out-degree for a node.r
: Estimated error of pagerank estimate for a node.in_degree
: Number of incoming edges to a node.out_degree
: Number of outcoming edges from a node.degree_adjusted
: The personalized pagerank divided by the node in-degree.regularized
: The personalized pagerank divide by the node in-degree plustau
.
When computing personalized pageranks for Twitter users (either
via rtweet_graph()
, name
is given
as a user ID, not a screen name, regardless of how the seed nodes
were specified.
References
Chen, Fan, Yini Zhang, and Karl Rohe. “Targeted Sampling from Massive Block Model Graphs with Personalized PageRank.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, no. 1 (February 2020): 99–126. https://doi.org/10.1111/rssb.12349.
Andersen, Reid, Fan Chung, and Kevin Lang. “Local Graph Partitioning Using PageRank Vectors.” In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), 475–86. Berkeley, CA, USA: IEEE, 2006. https://doi.org/10.1109/FOCS.2006.44.