Reddit Post Dataset

Description

This public dataset consists of one month of posts made by users on subreddits. We selected the 1,000 most active subreddits as items and the 10,000 most active users. This results in 672,447 interactions. We convert the text of each post into a feature vector representing their LIWC categories

Dataset Statistics

Users Items Interactions Node Labels Node Features Edge Labels Edge Features Action Repetition (%)
10,000 984 672,447 Exist None None Exist 79


References

Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1269–1278. DOI:https://doi.org/10.1145/3292500.3330895

 @inproceedings{kumar2019predicting,
	title={Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks},
	author={Kumar, Srijan and Zhang, Xikun and Leskovec, Jure},
	booktitle={Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining},
	year={2019},
	organization={ACM}
 }

Files

  1. reddit.tsv
row_id	user_id	item_id	timestamp  
0	0	0	0  
1	1	1	54
2	1	2	306
3	2	3	479
......
  1. reddit_user_mapping.tsv & reddit_item_mapping.tsv
original_id	mapped_id
0	0
1	1
4	2
7	3
......
  1. reddit_edge_features.tsv
row_id  edge_feature_0	edge_feature_1
0  0.5 1.0
1 -0.5  1.0
......
  1. reddit_node_labels.tsv
user_id	timestamp	state_label
0	0.000	0
1	36.000	0
1	77.000	1
......

Contacts

Sejoon Oh, soh337@gatech.edu, Georgia Institute of Technology
Srijan Kumar, srijan@gatech.edu, Georgia Institute of Technology