Reddit is an online forum divided into topic-specific subforums called
'subreddits. We consider three subreddits: keto, okcupid, and
childfree. In these subreddits, we identify users whose username
flair includes a gender label (usually 'M' or 'F'). We collect all top-level
comments from these users in 2018. We use each comment's text and score,
the number of likes minus dislikes from other users.
The dataset includes 90k comments in the selected subreddits.
Usage
data(reddit)Format
A tidygraph::tbl_graph() bipartite graph object, made up
of a node table and an edge table.
The node table has columns:
type(logical): A logical indicator of whether a node corresponds to a word or a reddit post. Used internally byigraph-- seenode_typefor an easier to use alternative.name(character): Unique node identifier. There is one node in the graph for each reddit post, and for each word used in a reddit post. Posts are identified by a number. Words are identified by tokenized words. Tokenized of raw top-level comment text was performed withtidytext::unnest_tokens(..., token = "tweets").node_type(character): Either"post"or"word".author_gender: Either"female"or"male", based on user flairs. Only available for post nodes.score(integer): The number of upvotes minus the number of downvotes received by a given post. Only available for post nodes.subreddit(character): One of"keto","okcupid"or"childfree". Only available for post nodes.author_pseudonym(character): An author identifier that is consistent across posts. Only available for post nodes.
and the edge table has columns:
from(int): Id of postto(int): Id of wordweight(double): Number of times a word was used in a post.
Source
Downloaded from https://archive.org/details/reddit_posts_2018 on June 6, 2022.
Details
See https://github.com/blei-lab/causal-text-embeddings for a replication package for Veitch et al (2020). See https://github.com/blei-lab/causal-text-embeddings/blob/master/src/reddit/data_cleaning/BigQuery_get_data in particular for additional data details.