Link to our research paper: https://ojs.aaai.org/index.php/ICWSM/article/view/31338
Due to the large size of the raw dataset (~46GB), it is not included in this repository. If you would like access to the dataset, please contact us via email:
- Kevin Leach: [email protected]
- Yu Huang: [email protected]
- Haonan Hou: [email protected]
This repository contains sample data from the study. The dataset is structured as follows:
-
pre_2022/ – Raw Reddit post data in JSON format. Each file follows the Reddit API structure, containing:
title
,selftext
: The original post content.num_comments
: Number of comments on the post.comments
: A list of comment objects, each containing metadata such asauthor
,score
, andbody
.
-
post_2022/ – Preprocessed Reddit data with numerical labels:
99999
indicates the original post (OP's text).- Any other number represents the upvote count of the corresponding comment.
The repository includes scripts for data collection, OpenAI API evaluation, and analysis:
reddit_scraper.py
– Scrapes Reddit posts and comments from relevant subreddits.pure_requester.py
– Sends raw relationship posts and advice to OpenAI for ranking. NOTE: this might need update;variant_requester.py
– A variation ofpure_requester.py
, incorporating additional contextual topics.comment_length_preference.py
– Analyzes GPT's ranking preferences based on comment length.IRA_analysis.py
– Measures inter-rater agreement between GPT rankings and human rankings.randomness_check.py
– Evaluates the consistency of GPT-generated rankings.
The repository also includes the analysis results of each Research Question. For more details, please refer to our research paper's result and analysis section.