Use spectral node embeddings in OLS with robust standard errors

A helper function that exposes the adjacency matrix A, normalized graph Laplacian L, and regularized graph Laplacian L_tau to model formulas for convenient network regression. Primarily designed to work with tidygraph::tbl_graph() objects, but can also be used with a matrix representation of a graph together with a data.frame() of nodal covariates.

Usage

nodelm_robust(formula, graph = NULL, data = NULL, attr = NULL, ...)

Arguments

formula

A regression formula that can include ase_specials and vsp_specials, which encode node embeddings. Data for non- embedding terms can come from the global environment, data, or can be named attributes of an igraph object. It is likely most convenient and intuitive to but nodal covariates in the nodes table of a tidygraph::tbl_graph() object to expose nodal data. See reddit, addhealth and smoking for examples.

graph

An optional igraph::graph() or tidygraph::tbl_graph() object. If specified, the graph adjacency matrix A, normalized graph Laplacian L, and regularized graph Laplacian L_tau are injected into the environment of formula, so these matrices may be used freely in formula. See igraph::as_adjacency_matrix() for details about the construction of A, and invertiforms::NormalizedLaplacian() and invertiforms::RegularizedLaplacian() for details about the construction of L and L_tau. Note that you can also use node embeddings based on arbitrary matrix representations of a graph--see the examples.

data

A data.frame() with one row for each node in the graph.

attr

Either NULL or a character string giving an edge attribute name. If NULL a traditional adjacency matrix is returned. If not NULL then the values of the given edge attribute are included in the adjacency matrix. If the graph has multiple edges, the edge attribute of an arbitrarily chosen edge (for the multiple edges) is included. This argument is ignored if edges is TRUE.

Note that this works only for certain attribute types. If the sparse argumen is TRUE, then the attribute must be either logical or numeric. If the sparse argument is FALSE, then character is also allowed. The reason for the difference is that the Matrix package does not support character sparse matrices yet.

...

Arguments passed on to estimatr::lm_robust

weights: the bare (unquoted) names of the weights variable in the supplied data.
subset: An optional bare (unquoted) expression specifying a subset of observations to be used.
clusters: An optional bare (unquoted) name of the variable that corresponds to the clusters in the data.
fixed_effects: An optional right-sided formula containing the fixed effects that will be projected out of the data, such as ~ blockID. Do not pass multiple-fixed effects with intersecting groups. Speed gains are greatest for variables with large numbers of groups and when using "HC1" or "stata" standard errors. See 'Details'.
se_type: The sort of standard error sought. If clusters is not specified the options are "HC0", "HC1" (or "stata", the equivalent), "HC2" (default), "HC3", or "classical". If clusters is specified the options are "CR0", "CR2" (default), or "stata". Can also specify "none", which may speed up estimation of the coefficients.
ci: logical. Whether to compute and return p-values and confidence intervals, TRUE by default.
alpha: The significance level, 0.05 by default.
return_vcov: logical. Whether to return the variance-covariance matrix for later usage, TRUE by default.
try_cholesky: logical. Whether to try using a Cholesky decomposition to solve least squares instead of a QR decomposition, FALSE by default. Using a Cholesky decomposition may result in speed gains, but should only be used if users are sure their model is full-rank (i.e., there is no perfect multi-collinearity)

Value

An object of class lm_robust. See estimatr::lm_robust() for details.

Examples


### some examples where data is specified as a tidygraph

data(addhealth, package = "latentnetmediate")
data(smoking, package = "latentnetmediate")

# a regression that does not use any node embeddings
nodelm_robust(grade ~ sex, graph = addhealth[[36]])
#>                Estimate Std. Error     t value  Pr(>|t|)  CI Lower   CI Upper
#> (Intercept)  9.79574861 0.04440072 220.6213812 0.0000000  9.708676 9.88282122
#> sexmale     -0.04574861 0.06299104  -0.7262717 0.4677509 -0.169278 0.07778078
#>               DF
#> (Intercept) 2160
#> sexmale     2160

# a regression including left and right singular embeddings of
# the adjacency matrix and the normalized graph Laplacian
nodelm_robust(grade ~ sex + U(A, 5) + V(L, 3), graph = addhealth[[36]])
#>                 Estimate Std. Error     t value      Pr(>|t|)    CI Lower
#> (Intercept)   9.87934123 0.03360119 294.0176290  0.000000e+00   9.8134471
#> sexmale      -0.03804066 0.04422524  -0.8601570  3.897983e-01  -0.1247693
#> U(A, 5)1     -2.27029557 0.58329181  -3.8922123  1.023563e-04  -3.4141699
#> U(A, 5)2      0.33828108 1.24915054   0.2708089  7.865640e-01  -2.1113868
#> U(A, 5)3    -14.13248367 0.67382758 -20.9734419  5.097222e-89 -15.4539047
#> U(A, 5)4      1.85826748 0.46787645   3.9717055  7.370557e-05   0.9407304
#> U(A, 5)5      5.56795840 1.32618364   4.1984822  2.796484e-05   2.9672235
#> V(L, 3)1      0.52627629 1.64213265   0.3204834  7.486330e-01  -2.6940558
#> V(L, 3)2      0.85441741 1.53932913   0.5550583  5.789125e-01  -2.1643101
#> V(L, 3)3     47.54782188 1.54149212  30.8453227 2.555583e-173  44.5248526
#>                 CI Upper   DF
#> (Intercept)   9.94523541 2152
#> sexmale       0.04868801 2152
#> U(A, 5)1     -1.12642127 2152
#> U(A, 5)2      2.78794893 2152
#> U(A, 5)3    -12.81106267 2152
#> U(A, 5)4      2.77580454 2152
#> U(A, 5)5      8.16869331 2152
#> V(L, 3)1      3.74660835 2152
#> V(L, 3)2      3.87314490 2152
#> V(L, 3)3     50.57079113 2152

nodelm_robust(as.integer(smokes) ~ sex + U(A, 5) , graph = smoking)
#>                Estimate Std. Error    t value     Pr(>|t|)   CI Lower  CI Upper
#> (Intercept)  0.96024972 0.09895098  9.7042975 1.083297e-17  0.7647832 1.1557162
#> sexmale      0.40650863 0.07779719  5.2252357 5.533918e-07  0.2528291 0.5601882
#> U(A, 5)1     1.57251864 1.50442927  1.0452593 2.975305e-01 -1.3993116 4.5443489
#> U(A, 5)2     0.64105255 0.52974082  1.2101249 2.280732e-01 -0.4053907 1.6874958
#> U(A, 5)3    -0.17947427 0.40233025 -0.4460869 6.561570e-01 -0.9742323 0.6152837
#> U(A, 5)4    -0.09438519 0.33939577 -0.2780977 7.813080e-01 -0.7648232 0.5760529
#> U(A, 5)5     1.10853421 0.33943585  3.2658135 1.343659e-03  0.4380170 1.7790514
#>              DF
#> (Intercept) 155
#> sexmale     155
#> U(A, 5)1    155
#> U(A, 5)2    155
#> U(A, 5)3    155
#> U(A, 5)4    155
#> U(A, 5)5    155

library(Matrix)
library(tidygraph)

B <- igraph::as_adjacency_matrix(addhealth[[36]], attr = "weight")

node <- addhealth[[36]] |>
  as_tibble() |>
  mutate(level = rowSums(B))

node[5, "sex"] <- NA
node
#> # A tibble: 2,209 × 5
#>    sex    race     grade school level
#>    <fct>  <fct>    <int> <fct>  <dbl>
#>  1 male   hispanic    10 B          5
#>  2 male   hispanic    10 B          0
#>  3 female hispanic     9 B          0
#>  4 male   hispanic    10 B         19
#>  5 NA     hispanic     9 B         16
#>  6 female hispanic    11 B          0
#>  7 female hispanic     9 B          7
#>  8 male   white        9 B         19
#>  9 male   white       11 B          8
#> 10 female white       10 B          6
#> # ℹ 2,199 more rows

fit <- nodelm_robust(level ~ sex + grade + race + U(sign(B), 10), data = node)
summary(fit)
#> 
#> Call:
#> estimatr::lm_robust(formula = formula, data = data)
#> 
#> Standard error type:  HC2 
#> 
#> Coefficients:
#>                    Estimate Std. Error  t value  Pr(>|t|)  CI Lower  CI Upper
#> (Intercept)         3.70846     1.5597   2.3776 1.751e-02    0.6497    6.7672
#> sexmale            -2.87030     0.3345  -8.5816 1.771e-17   -3.5262   -2.2144
#> grade               0.48221     0.1375   3.5081 4.607e-04    0.2126    0.7518
#> raceblack           0.09635     0.8779   0.1098 9.126e-01   -1.6252    1.8179
#> racehispanic        1.66458     0.7881   2.1121 3.480e-02    0.1190    3.2102
#> racemixed/other     2.48689     1.0766   2.3099 2.099e-02    0.3756    4.5982
#> racewhite           1.43935     0.8365   1.7207 8.545e-02   -0.2011    3.0797
#> U(sign(B), 10)1   194.77884     9.1004  21.4032 3.198e-92  176.9322  212.6255
#> U(sign(B), 10)2   119.71510     9.2763  12.9055 9.586e-37  101.5236  137.9066
#> U(sign(B), 10)3   185.53943     8.8289  21.0150 2.911e-89  168.2253  202.8536
#> U(sign(B), 10)4   -89.70215     8.2256 -10.9052 5.536e-27 -105.8333  -73.5710
#> U(sign(B), 10)5   -76.51199     7.7617  -9.8577 1.902e-22  -91.7332  -61.2908
#> U(sign(B), 10)6  -155.05165    10.6856 -14.5104 1.437e-45 -176.0069 -134.0964
#> U(sign(B), 10)7    19.97444     6.5063   3.0700 2.168e-03    7.2151   32.7338
#> U(sign(B), 10)8    89.03533    11.4875   7.7506 1.405e-14   66.5074  111.5633
#> U(sign(B), 10)9    -8.72362     9.0506  -0.9639 3.352e-01  -26.4725    9.0252
#> U(sign(B), 10)10  -28.35095     8.0183  -3.5358 4.153e-04  -44.0755  -12.6264
#>                    DF
#> (Intercept)      2132
#> sexmale          2132
#> grade            2132
#> raceblack        2132
#> racehispanic     2132
#> racemixed/other  2132
#> racewhite        2132
#> U(sign(B), 10)1  2132
#> U(sign(B), 10)2  2132
#> U(sign(B), 10)3  2132
#> U(sign(B), 10)4  2132
#> U(sign(B), 10)5  2132
#> U(sign(B), 10)6  2132
#> U(sign(B), 10)7  2132
#> U(sign(B), 10)8  2132
#> U(sign(B), 10)9  2132
#> U(sign(B), 10)10 2132
#> 
#> Multiple R-squared:  0.4453 ,	Adjusted R-squared:  0.4411 
#> F-statistic: 91.57 on 16 and 2132 DF,  p-value: < 2.2e-16