Wikipedia Edit Dataset

Description

This public dataset is one month of edits made by edits on Wikipedia pages. We selected the 1,000 most edited pages as items and editors who made at least 5 edits as users (a total of 8,227 users). This generates 157,474 interactions. Similar to the Reddit dataset, we convert the edit text into a LIWC-feature vector.

Dataset Statistics

Users Items Interactions Node Labels Node Features Edge Labels Edge Features Action Repetition (%)
8,227 1,000 157,474 Exist None None Exist 61


References

Srijan Kumar, Xikun Zhang, and Jure Leskovec. 2019. Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ’19). Association for Computing Machinery, New York, NY, USA, 1269–1278. DOI:https://doi.org/10.1145/3292500.3330895

 @inproceedings{kumar2019predicting,
	title={Predicting Dynamic Embedding Trajectory in Temporal Interaction Networks},
	author={Kumar, Srijan and Zhang, Xikun and Leskovec, Jure},
	booktitle={Proceedings of the 25th ACM SIGKDD international conference on Knowledge discovery and data mining},
	year={2019},
	organization={ACM}
 }

Files

  1. wikipedia.tsv
    • Tab-separated file Header: (row_id, user, item, timestamp)
    • All user ids & item ids are normalized starting from 0.
    • Sorted by timestamp.
    • All NaN values are removed from data.
    • Example
row_id	user_id	item_id	timestamp  
0	0	0	0  
1	1	1	54
2	1	2	306
3	2	3	479
......
  1. wikipedia_user_mapping.tsv & wikipedia_item_mapping.tsv
original_id	mapped_id
0	0
1	1
4	2
7	3
......
  1. wikipedia_edge_features.tsv
row_id  edge_feature_0	edge_feature_1
0  0.5 1.0
1 -0.5  1.0
......
  1. wikipedia_node_labels.tsv
user_id	timestamp	state_label
0	0.000	0
1	36.000	0
1	77.000	1
......

Contacts

Sejoon Oh, soh337@gatech.edu, Georgia Institute of Technology
Srijan Kumar, srijan@gatech.edu, Georgia Institute of Technology