neocache
is designed to cache information from the Twitter API to a Neo4J database to facilitate long term data collection, in particular of the following graph. We use Neo4J to enable fast graphical operations such as neighborhood queries. A previous iteration of neocache
called twittercache
used a traditional SQL database to store the following graph, and ran into scaling issues. Since Neo4J can be complicated to install, we run it through a local Docker container. You will need to install docker to use neocache.
We use the stevedore
R package to programmatically interact with Docker.
Neo4J Data Model
Neo4J uses the labelled property graph model. The elements of the labeled property graph model are nodes, relationships, properties, and labels.
- Nodes contain properties. Properties are key-value pairs where the key is a string and the value is a semi-arbitrary value
- Nodes can have one or more labels. Labels indicate the role of nodes within a dataset
- Relationships connect nodes. Relationships are directed, and have a start node and an end node.
- Relationships can have properties, just like nodes.
Neo4J Schema
In our database, we will represent each Twitter user and each tweet as a node. We will indicate that user A follows user B by adding an edge from user A to user B. The edge will have the label Follows
.
Nodes and node labels
- User: a user in the Twitter graph. Nodes with User label should have the following properties:
- sampled_at
: Takes on a sentinel NULL value to indicate user has not yet been fully sampled, but partial user information was added to the graph at some point. Otherwise a datetime indicate when full information from lookup_users() was added to the graph. - sampled_friends_at: NULL unless we know that all of a users friends are in the graph becaues we explicitly added all information from get_friends(), in which case this should be the datetime when we added this information to the graph.
- sampled_followers_at: Same as sampled_friends_at but for followers instead of friends.
- user_id
- screen_name
- protected
- followers_count
- friends_count
- listed_count
- statuses_count
- favourites_count
- account_created_at
- verified
- profile_url
- profile_expanded_url
- account_lang
- profile_banner_url
- profile_background_url
- profile_image_url
- name
- location
, - description
- url
- Tweet (ignore for now)
- sampled_at
- status_id
- created_at
- text
- source
- display_text_width
- is_quote
- is_retweet
- favorite_count
- rtweet_count
- quote_count
- reply_count
- hashtags
- symbols
- urls_url
- urls_t.co
- urls_expanded_url
- media_url
- media_t.co
- media_expanded_url
- media_type
- ext_media_url
- ext_media_t.co
- ext_media_expanded_url
- ext_media_type
- lang
- status_url
Edges
- follows: A relationship strictly between two Users. For example, both @alexpghayes and @karlrohe follow each other.
- tweeted: A relationship between a User and a Tweet
- reply_to: A relationship between a User and a Tweet
- reply_to: A relationship between two Tweets
- mentions: A relationship between a User and a Tweet
- quotes: A relationship between two Tweets
Psuedocode
Representation of the database
In some preliminary work, we found that the fastest way to send data to the Neo4J database was not over the HTTPs connection, but rather to write to a csv, move the csv into the Docker filesystem, and then use read the csv back into Neo4J. This implies that we should abstract the concept of a cache
object that contains all the information necessary to ultimately interact with the underlying Neo4J database. This turns out to consist of two things:
- An active HTTPs connection to a running Neo4J database, represented by a
neo4r::connexion
object - All the information about the docker container hosting the Neo4J database
We will take this information and combine it into a single cache object.
# called for side effects
new_cache <- function(cache_name) {
# use cli to ask to the user what to do if a cache with that name
# already exists. either overwrite, or stop.
cache <- list(
name = cache_name,
docker_container_name = ...,
...
)
class(cache) <- "neocache"
# create a docker container to host the cache's Neo4J database. be sure to
# overwrite an existing container if the user has requested it
# should use https://cli.r-lib.org/ to update the user if this is successful
# or if it fails
# do not start running the docker container
# save the cache object to ~/.cache/neocache/{cache_name}.rds
}
The idea is then that in any code that uses neocache
, we want the user to explicitly select the cache they want to save data into. That is, before
Cached API Calls
Here we describe some pseudocode for drop in replacements for rtweet::lookup_users()
, rtweet::get_friends()
and rtweet::get_followers()
. In the following our goal will always be to minimize both queries to the Neo4J data base and the Twitter API.
#' @param users A character vector of user ids (never screen names)
#'
#' @return A tibble where each row corresponds to a User and each column
#' to one of the User properties. If a user cannot be sampled, should
#' return nothing for that user. If no users can be sampled, should
#' return an empty tibble with appropriate columns.
#'
lookup_users <- function(users) {
user_data <- db_lookup_users(users)
not_in_graph <- setdiff(users, user_data$user_id)
new_user_data <- add_new_users(not_in_graph)
not_sampled <- user_data %>`%
filter(is.null(sampled_at)) %>`%
pull(user_id)
upgraded_user_data <- sample_existing_users(not_sampled)
bind_rows(user_data, new_user_data, upgraded_user_data)
}
db_lookup_users <- function(users) {
# return a tibble where each row corresponds to a User and each column
# to one of the User properties. when a user in not present in the
# database, should not return a row in the output tibble for that
# user. if no users are in the db should return an empty tibble with one
# column for each User property
}
#' @return The tibble of user data, with one row for each (accessible)
#' user in `users` and one column for each property of `User` nodes
#' in the graph database.
add_new_users <- function(users) {
# make sure to set sampled_at to Sys.time() and
# sampled_friends_at and sampled_followers_at to NULL
# return data on users
}
sample_existing_users <- function(users) {
# pretty much the same as add_new_users(), except overwrite
# existing data on each user rather than creating new user nodes
# from scratch
}
#' @param users A character vector of user ids (never screen names)
#'
#' @return A tibble where each row corresponds to a User and each column
#' to one of the User properties. If a user cannot be sampled, should
#' return nothing for that user. If no users can be sampled, should
#' return an empty tibble with appropriate columns.
get_friends <- function(users) {
# here we will need to query twice: once to ask who we actually
# have *complete* friendship edges for, and then a second time to get
# those friendship edges
status <- friend_sampling_status(users)
# if you can add new edges to the database without
# creating duplicate edges (using MERGE?) you may be able
# to combine the logic in the following steps
new_edges <- add_new_friends(status$not_in_graph)
upgraded_edges <- sample_existing_friends(status$sampled_friends_at_is_null)
existing_edges <- db_get_friends(status$sampled_friends_at_not_null)
# need to be careful about duplicate edges here. ideally
# we guarantee that edges are unique somehow before this, but if not
# we can use dplyr::distinct(), although this is an expensive operation
bind_rows(new_edges, upgraded_edges, existing_edges)
}
add_new_friends <- function(users) {
# set sampled_friends_at to Sys.time() and
# sampled_at and sampled_followers_at to NULL
# return friends of each user
}
sample_existing_friends <- function(users) {
# set sampled_friends_at to Sys.time() and
# *leave* sampled_at and sampled_followers_at at their
# existing values
# first remove outgoing edges from users to other nodes
# in the database, as this information may be outdated
# and we're about to update it anyway
# return friends of each user
}
friend_sampling_status <- function(users) {
# generate based on queries of user.sampled_friends_at node property
list(
not_in_graph = ...,
sampled_friends_at_is_null = ...,
sampled_friends_at_not_null = ...
)
}
Misc thoughts
A fair amount of information may be missing for any given User node; we need to carefully think about the various states a User node might be in and how this will impact various graph queries
Any user generated text is liable to contain Unicode abominations
Note that rtweet returns several fields containing geographic information that I think we should discard because that shit is creepy (I’m not clear which of the following are User versus ‘Tweet properties either): - place_url - coords_coords
- bbox_coords
- To think about: user states: healthy, banned, suspected, removed, etc, etc
Cache design dump
cache in one of two states: active or inactive. active means docker container is running
library(neocache) 12:27 # 1 user creates a cache (once). this cache object gets saved to ~/.cache/neocache/{neocache_name}.rds
12:28 when the user wants to do stuff, then first tell neocache to gets things running in the background cache <- use_cache(“{neocache_name”) start_cache(cache) – starts docker 12:28 and then all the API calls need to explicitly use the cache 12:29 friends <- get_friends(…, cache = cache) 12:32 each cache gets its own docker container