Design • neocache

neocache is designed to cache information from the Twitter API to a Neo4J database to facilitate long term data collection, in particular of the following graph. We use Neo4J to enable fast graphical operations such as neighborhood queries. A previous iteration of neocache called twittercache used a traditional SQL database to store the following graph, and ran into scaling issues. Since Neo4J can be complicated to install, we run it through a local Docker container. You will need to install docker to use neocache. We use the stevedore R package to programmatically interact with Docker.

Neo4J Data Model

Neo4J uses the labelled property graph model. The elements of the labeled property graph model are nodes, relationships, properties, and labels.

Nodes contain properties. Properties are key-value pairs where the key is a string and the value is a semi-arbitrary value
Nodes can have one or more labels. Labels indicate the role of nodes within a dataset
Relationships connect nodes. Relationships are directed, and have a start node and an end node.
Relationships can have properties, just like nodes.

Neo4J Schema

In our database, we will represent each Twitter user and each tweet as a node. We will indicate that user A follows user B by adding an edge from user A to user B. The edge will have the label Follows.

Nodes and node labels

User: a user in the Twitter graph. Nodes with User label should have the following properties:
sampled_at : Takes on a sentinel NULL value to indicate user has not yet been fully sampled, but partial user information was added to the graph at some point. Otherwise a datetime indicate when full information from lookup_users() was added to the graph.
sampled_friends_at: NULL unless we know that all of a users friends are in the graph becaues we explicitly added all information from get_friends(), in which case this should be the datetime when we added this information to the graph.
sampled_followers_at: Same as sampled_friends_at but for followers instead of friends.
user_id
screen_name
protected
followers_count
friends_count
listed_count
statuses_count
favourites_count
account_created_at
verified
profile_url
profile_expanded_url
account_lang
profile_banner_url
profile_background_url
profile_image_url
name
location ,
description
url
Tweet (ignore for now)
sampled_at
status_id
created_at
text
source
display_text_width
is_quote
is_retweet
favorite_count
rtweet_count
quote_count
reply_count
hashtags
symbols
urls_url
urls_t.co
urls_expanded_url
media_url
media_t.co
media_expanded_url
media_type
ext_media_url
ext_media_t.co
ext_media_expanded_url
ext_media_type
lang
status_url

Edges

follows: A relationship strictly between two Users. For example, both @alexpghayes and @karlrohe follow each other.
tweeted: A relationship between a User and a Tweet
reply_to: A relationship between a User and a Tweet
reply_to: A relationship between two Tweets
mentions: A relationship between a User and a Tweet
quotes: A relationship between two Tweets

Psuedocode

Representation of the database

In some preliminary work, we found that the fastest way to send data to the Neo4J database was not over the HTTPs connection, but rather to write to a csv, move the csv into the Docker filesystem, and then use read the csv back into Neo4J. This implies that we should abstract the concept of a cache object that contains all the information necessary to ultimately interact with the underlying Neo4J database. This turns out to consist of two things:

An active HTTPs connection to a running Neo4J database, represented by a neo4r::connexion object
All the information about the docker container hosting the Neo4J database

We will take this information and combine it into a single cache object.

# called for side effects
new_cache <- function(cache_name) {
  
  # use cli to ask to the user what to do if a cache with that name
  # already exists. either overwrite, or stop.
  
  cache  <- list(
    name = cache_name,
    docker_container_name = ...,
    ...
  )
  
  class(cache) <- "neocache"
  
  # create a docker container to host the cache's Neo4J database. be sure to
  # overwrite an existing container if the user has requested it
  
  # should use https://cli.r-lib.org/ to update the user if this is successful
  # or if it fails
  
  # do not start running the docker container
  
  # save the cache object to ~/.cache/neocache/{cache_name}.rds
}

The idea is then that in any code that uses neocache, we want the user to explicitly select the cache they want to save data into. That is, before

Cached API Calls

Here we describe some pseudocode for drop in replacements for rtweet::lookup_users(), rtweet::get_friends() and rtweet::get_followers(). In the following our goal will always be to minimize both queries to the Neo4J data base and the Twitter API.

#' @param users A character vector of user ids (never screen names)
#'
#' @return A tibble where each row corresponds to a User and each column
#'   to one of the User properties. If a user cannot be sampled, should
#'   return nothing for that user. If no users can be sampled, should
#'   return an empty tibble with appropriate columns.
#'
lookup_users <- function(users) {
  user_data <- db_lookup_users(users)
  not_in_graph <- setdiff(users, user_data$user_id)
  new_user_data <- add_new_users(not_in_graph)
  not_sampled <- user_data %>`%
    filter(is.null(sampled_at)) %>`%
    pull(user_id)
  upgraded_user_data <- sample_existing_users(not_sampled)
  bind_rows(user_data, new_user_data, upgraded_user_data)
}

db_lookup_users <- function(users) {
  # return a tibble where each row corresponds to a User and each column
  # to one of the User properties. when a user in not present in the
  # database, should not return a row in the output tibble for that
  # user. if no users are in the db should return an empty tibble with one
  # column for each User property
}

#' @return The tibble of user data, with one row for each (accessible)
#'   user in `users` and one column for each property of `User` nodes
#'   in the graph database.
add_new_users <- function(users) {
  # make sure to set sampled_at to Sys.time() and
  # sampled_friends_at and sampled_followers_at to NULL
  # return data on users
}

sample_existing_users <- function(users) {
  # pretty much the same as add_new_users(), except overwrite
  # existing data on each user rather than creating new user nodes
  # from scratch
}

#' @param users A character vector of user ids (never screen names)
#'
#' @return A tibble where each row corresponds to a User and each column
#'   to one of the User properties. If a user cannot be sampled, should
#'   return nothing for that user. If no users can be sampled, should
#'   return an empty tibble with appropriate columns.
get_friends <- function(users) {
  # here we will need to query twice: once to ask who we actually
  # have *complete* friendship edges for, and then a second time to get
  # those friendship edges
  status <- friend_sampling_status(users)
  # if you can add new edges to the database without
  # creating duplicate edges (using MERGE?) you may be able
  # to combine the logic in the following steps
  new_edges <- add_new_friends(status$not_in_graph)
  upgraded_edges <- sample_existing_friends(status$sampled_friends_at_is_null)
  existing_edges <- db_get_friends(status$sampled_friends_at_not_null)
  # need to be careful about duplicate edges here. ideally
  # we guarantee that edges are unique somehow before this, but if not
  # we can use dplyr::distinct(), although this is an expensive operation
  bind_rows(new_edges, upgraded_edges, existing_edges)
}

add_new_friends <- function(users) {
  # set sampled_friends_at to Sys.time() and
  # sampled_at and sampled_followers_at to NULL
  # return friends of each user
}

sample_existing_friends <- function(users) {
  # set sampled_friends_at to Sys.time() and
  # *leave* sampled_at and sampled_followers_at at their
  # existing values
  # first remove outgoing edges from users to other nodes
  # in the database, as this information may be outdated
  # and we're about to update it anyway
  # return friends of each user
}

friend_sampling_status <- function(users) {
  # generate based on queries of user.sampled_friends_at node property
  list(
    not_in_graph = ...,
    sampled_friends_at_is_null = ...,
    sampled_friends_at_not_null = ...
  )
}

Misc thoughts

A fair amount of information may be missing for any given User node; we need to carefully think about the various states a User node might be in and how this will impact various graph queries
Any user generated text is liable to contain Unicode abominations

Note that rtweet returns several fields containing geographic information that I think we should discard because that shit is creepy (I’m not clear which of the following are User versus ‘Tweet properties either): - place_url - place_name - place_full_name - place_type - country - country_code - geo_coords - coords_coords - bbox_coords

To think about: user states: healthy, banned, suspected, removed, etc, etc

Cache design dump

cache in one of two states: active or inactive. active means docker container is running

library(neocache) 12:27 # 1 user creates a cache (once). this cache object gets saved to ~/.cache/neocache/{neocache_name}.rds 12:28 when the user wants to do stuff, then first tell neocache to gets things running in the background cache <- use_cache(“{neocache_name”) start_cache(cache) – starts docker 12:28 and then all the API calls need to explicitly use the cache 12:29 friends <- get_friends(…, cache = cache) 12:32 each cache gets its own docker container