Use spectral node embeddings in ordinary least squares regression

A helper function that exposes the adjacency matrix A, normalized graph Laplacian L, and regularized graph Laplacian L_tau to model formulas for convenient network regression. Primarily designed to work with tidygraph::tbl_graph() objects, but can also be used with a matrix representation of a graph together with a data.frame() of nodal covariates.

Usage

nodelm(formula, graph = NULL, data = NULL, attr = NULL, ...)

Arguments

formula

A regression formula that can include ase_specials and vsp_specials, which encode node embeddings. Data for non- embedding terms can come from the global environment, data, or can be named attributes of an igraph object. It is likely most convenient and intuitive to but nodal covariates in the nodes table of a tidygraph::tbl_graph() object to expose nodal data. See reddit, addhealth and smoking for examples.

graph

An optional igraph::graph() or tidygraph::tbl_graph() object. If specified, the graph adjacency matrix A, normalized graph Laplacian L, and regularized graph Laplacian L_tau are injected into the environment of formula, so these matrices may be used freely in formula. See igraph::as_adjacency_matrix() for details about the construction of A, and invertiforms::NormalizedLaplacian() and invertiforms::RegularizedLaplacian() for details about the construction of L and L_tau. Note that you can also use node embeddings based on arbitrary matrix representations of a graph--see the examples.

data

A data.frame() with one row for each node in the graph.

attr

Either NULL or a character string giving an edge attribute name. If NULL a traditional adjacency matrix is returned. If not NULL then the values of the given edge attribute are included in the adjacency matrix. If the graph has multiple edges, the edge attribute of an arbitrarily chosen edge (for the multiple edges) is included. This argument is ignored if edges is TRUE.

Note that this works only for certain attribute types. If the sparse argumen is TRUE, then the attribute must be either logical or numeric. If the sparse argument is FALSE, then character is also allowed. The reason for the difference is that the Matrix package does not support character sparse matrices yet.

...

Arguments passed on to stats::lm

subset: an optional vector specifying a subset of observations to be used in the fitting process. (See additional details about how this argument interacts with data-dependent bases in the ‘Details’ section of the model.frame documentation.)
weights: an optional vector of weights to be used in the fitting process. Should be NULL or a numeric vector. If non-NULL, weighted least squares is used with weights weights (that is, minimizing sum(w*e^2)); otherwise ordinary least squares is used. See also ‘Details’,
na.action: a function which indicates what should happen when the data contain NAs. The default is set by the na.action setting of options, and is na.fail if that is unset. The ‘factory-fresh’ default is na.omit. Another possible value is NULL, no action. Value na.exclude can be useful.
method: the method to be used; for fitting, currently only method = "qr" is supported; method = "model.frame" returns the model frame (the same as with model = TRUE, see below).
model,x,y,qr: logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.
singular.ok: logical. If FALSE (the default in S but not in R) a singular fit is an error.
contrasts: an optional list. See the contrasts.arg of model.matrix.default.
offset: this can be used to specify an a priori known component to be included in the linear predictor during fitting. This should be NULL or a numeric vector or matrix of extents matching those of the response. One or more offset terms can be included in the formula instead or as well, and if more than one are specified their sum is used. See model.offset.

Value

An object of class lm. See stats::lm() for details.

Examples


data(addhealth, package = "latentnetmediate")
data(smoking, package = "latentnetmediate")

### some examples where data is specified as a tidygraph

# a regression that does not use any node embeddings
nodelm(grade ~ sex, graph = addhealth[[36]])
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)      sexmale  
#>     9.79575     -0.04575  
#> 

# a regression including left and right singular embeddings of
# the adjacency matrix and the normalized graph Laplacian
nodelm(grade ~ sex + U(A, 5) + V(L, 3), graph = addhealth[[36]])
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)      sexmale     U(A, 5)1     U(A, 5)2     U(A, 5)3     U(A, 5)4  
#>     9.87934     -0.03804     -2.27030      0.33828    -14.13248      1.85827  
#>    U(A, 5)5     V(L, 3)1     V(L, 3)2     V(L, 3)3  
#>     5.56796      0.52628      0.85442     47.54782  
#> 

nodelm(as.integer(smokes) ~ sex + U(A, 5) , graph = smoking)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Coefficients:
#> (Intercept)      sexmale     U(A, 5)1     U(A, 5)2     U(A, 5)3     U(A, 5)4  
#>     0.96025      0.40651      1.57252      0.64105     -0.17947     -0.09439  
#>    U(A, 5)5  
#>     1.10853  
#> 

library(Matrix)
library(tidygraph)

B <- igraph::as_adjacency_matrix(addhealth[[36]], attr = "weight")

node <- addhealth[[36]] |>
  as_tibble() |>
  mutate(level = rowSums(B))

node[5, "sex"] <- NA
node
#> # A tibble: 2,209 × 5
#>    sex    race     grade school level
#>    <fct>  <fct>    <int> <fct>  <dbl>
#>  1 male   hispanic    10 B          5
#>  2 male   hispanic    10 B          0
#>  3 female hispanic     9 B          0
#>  4 male   hispanic    10 B         19
#>  5 NA     hispanic     9 B         16
#>  6 female hispanic    11 B          0
#>  7 female hispanic     9 B          7
#>  8 male   white        9 B         19
#>  9 male   white       11 B          8
#> 10 female white       10 B          6
#> # ℹ 2,199 more rows

fit <- nodelm(level ~ sex + grade + race + U(sign(B), 10), data = node)
summary(fit)
#> 
#> Call:
#> stats::lm(formula = formula, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -29.221  -6.024  -1.228   4.687  34.829 
#> 
#> Coefficients:
#>                    Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         3.70846    1.62265   2.285 0.022385 *  
#> sexmale            -2.87030    0.33382  -8.598  < 2e-16 ***
#> grade               0.48221    0.14445   3.338 0.000858 ***
#> raceblack           0.09635    0.94038   0.102 0.918404    
#> racehispanic        1.66458    0.79837   2.085 0.037191 *  
#> racemixed/other     2.48689    1.11826   2.224 0.026260 *  
#> racewhite           1.43935    0.84573   1.702 0.088919 .  
#> U(sign(B), 10)1   194.77884    8.38885  23.219  < 2e-16 ***
#> U(sign(B), 10)2   119.71510    8.71828  13.732  < 2e-16 ***
#> U(sign(B), 10)3   185.53943    8.51792  21.782  < 2e-16 ***
#> U(sign(B), 10)4   -89.70215    7.91825 -11.329  < 2e-16 ***
#> U(sign(B), 10)5   -76.51199    9.25422  -8.268 2.37e-16 ***
#> U(sign(B), 10)6  -155.05165    8.64150 -17.943  < 2e-16 ***
#> U(sign(B), 10)7    19.97444    8.13131   2.456 0.014110 *  
#> U(sign(B), 10)8    89.03533    8.31167  10.712  < 2e-16 ***
#> U(sign(B), 10)9    -8.72362    7.86190  -1.110 0.267294    
#> U(sign(B), 10)10  -28.35095    8.06069  -3.517 0.000445 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 7.711 on 2132 degrees of freedom
#>   (60 observations deleted due to missingness)
#> Multiple R-squared:  0.4453,	Adjusted R-squared:  0.4411 
#> F-statistic: 106.9 on 16 and 2132 DF,  p-value: < 2.2e-16
#>