Skip to contents

Reddit is an online forum divided into topic-specific subforums called 'subreddits. We consider three subreddits: keto, okcupid, and childfree. In these subreddits, we identify users whose username flair includes a gender label (usually 'M' or 'F'). We collect all top-level comments from these users in 2018. We use each comment's text and score, the number of likes minus dislikes from other users. The dataset includes 90k comments in the selected subreddits.

Usage

data(reddit)

Format

A tidygraph::tbl_graph() bipartite graph object, made up of a node table and an edge table.

The node table has columns:

  • type (logical): A logical indicator of whether a node corresponds to a word or a reddit post. Used internally by igraph -- see node_type for an easier to use alternative.

  • name (character): Unique node identifier. There is one node in the graph for each reddit post, and for each word used in a reddit post. Posts are identified by a number. Words are identified by tokenized words. Tokenized of raw top-level comment text was performed with tidytext::unnest_tokens(..., token = "tweets").

  • node_type (character): Either "post" or "word".

  • author_gender: Either "female" or "male", based on user flairs. Only available for post nodes.

  • score (integer): The number of upvotes minus the number of downvotes received by a given post. Only available for post nodes.

  • subreddit (character): One of "keto", "okcupid" or "childfree". Only available for post nodes.

  • author_pseudonym (character): An author identifier that is consistent across posts. Only available for post nodes.

and the edge table has columns:

  • from (int): Id of post

  • to (int): Id of word

  • weight (double): Number of times a word was used in a post.

Source

Downloaded from https://archive.org/details/reddit_posts_2018 on June 6, 2022.

Details

See https://github.com/blei-lab/causal-text-embeddings for a replication package for Veitch et al (2020). See https://github.com/blei-lab/causal-text-embeddings/blob/master/src/reddit/data_cleaning/BigQuery_get_data in particular for additional data details.

References

Veitch, Victor, Dhanya Sridhar, and David M Blei. "Adapting Text Embeddings for Causal Inference." In Proceedings of the 36 Th Conference on Uncertainty in Artificial Intelligence (UAI), 124:10, 2020.