Pushshift is a third party Reddit API useful to find comments and submissions (posts) from the past or that are otherwise archived.
Searching submissions
Searching submissions uses this endpoint:
https://api.pushshift.io/reddit/search/submission/
Importantly there are a great number of parameters to better define your search. Here are some important ones:
subreddit
the name of the subreddit you want to search.
score
submissions that =, > or < a score (upvote).
domain
submissions for a domain URL (eg youtube.com or imgur.com).
after
submissions after a date and time (unix format).
before
submissions before date and time (unix format).
sort_type
sort posts by value (“score”,”num_comments”,”created_utc”).
sort
ascending or descending.
To search query post titles use title
. To query posts from a certain user: author
.
size
controls the number of results returned, the maximum amount is 100.
Getting YouTube submissions to r/nba from 2008 to 2015
The goal is to query the top 100 posts each month (2008 to 2015) at r/nba which was a youtube.com submission. Unlike today’s rapid-fire and almost live highlight clips the posting of videos was still scarce particularly before 2014.
Define the months and years in arrays:
$months_array = [ '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12' ]; $years_array = [ '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015' ];
Looping through each month for every year and verify it is a valid date:
foreach ($years_array as $y) { foreach ($months_array as $m) { if (checkdate($m, '01', $y)) { //Is valid } } }
Set the timezone to UTC (Reddit uses this) and convert the previous date and current date into Unix format to get the after and before parameter values.
Doing this will help refine the results down as you can only get a maximum of 100 results per query.
date_default_timezone_set('UTC'); foreach ($years_array as $y) { foreach ($months_array as $m) { if (checkdate($m, '01', $y)) { $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}"); $after_unix = $after_dt->getTimestamp(); $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}"))); $before_dt = new DateTime($before_dt_str); $before_unix = $before_dt->getTimestamp(); echo "{$after_dt->format('Y-m-d')} $before_dt_str<br>"; } } }
This will output:
2008-01-01 2008-02-01 2008-02-01 2008-03-01 2008-03-01 2008-04-01 2008-04-01 2008-05-01 2008-05-01 2008-06-01 2008-06-01 2008-07-01 2008-07-01 2008-08-01 2008-08-01 2008-09-01 2008-09-01 2008-10-01 2008-10-01 2008-11-01 2008-11-01 2008-12-01 2008-12-01 2009-01-01 2009-01-01 2009-02-01 2009-02-01 2009-03-01 2009-03-01 2009-04-01 2009-04-01 2009-05-01 2009-05-01 2009-06-01 2009-06-01 2009-07-01 2009-07-01 2009-08-01 2009-08-01 2009-09-01 2009-09-01 2009-10-01 ......
GET request
All that’s left now is to add in the GET API call for the built Pushshift URL:
$url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100";
You can utilize a cURL function or simply use file_get_contents() do be mindful of rate limits (sleep 1 second for each month loop is suitable):
foreach ($years_array as $y) { foreach ($months_array as $m) { if (checkdate($m, '01', $y)) { $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}"); $after_unix = $after_dt->getTimestamp(); $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}"))); $before_dt = new DateTime($before_dt_str); $before_unix = $before_dt->getTimestamp(); $url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100"; $data = json_decode(file_get_contents($url), true); //Do stuff with response } } }
Accessing each post from the returned data is done with a foreach loop:
foreach ($data['data'] as $p) { $p['id'];//Post id $p['title'];//Post title $p['author'];//Username $p['url'];//Submitted URL //etc }