Getting old Reddit submissions with Pushshift API

Pushshift is a third party Reddit API useful to find comments and submissions (posts) from the past or that are otherwise archived.

Searching submissions

Searching submissions uses this endpoint:

https://api.pushshift.io/reddit/search/submission/

Importantly there are a great number of parameters to better define your search. Here are some important ones:

subreddit the name of the subreddit you want to search.

score submissions that =, > or < a score (upvote).

domain submissions for a domain URL (eg youtube.com or imgur.com).

after submissions after a date and time (unix format).

before submissions before date and time (unix format).

sort_typesort posts by value (“score”,”num_comments”,”created_utc”).

sort ascending or descending.

To search query post titles use title. To query posts from a certain user: author.

size controls the number of results returned, the maximum amount is 100.

Getting YouTube submissions to r/nba from 2008 to 2015

The goal is to query the top 100 posts each month (2008 to 2015) at r/nba which was a youtube.com submission. Unlike today’s rapid-fire and almost live highlight clips the posting of videos was still scarce particularly before 2014.

Define the months and years in arrays:

$months_array = [
    '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'
];

$years_array = [
    '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'
];

Looping through each month for every year and verify it is a valid date:

foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
           //Is valid
        }
    }
}

Set the timezone to UTC (Reddit uses this) and convert the previous date and current date into Unix format to get the after and before parameter values.

Doing this will help refine the results down as you can only get a maximum of 100 results per query.

date_default_timezone_set('UTC');
foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
            $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}");
            $after_unix = $after_dt->getTimestamp();

            $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}")));
            $before_dt = new DateTime($before_dt_str);
            $before_unix = $before_dt->getTimestamp();

            echo "{$after_dt->format('Y-m-d')} $before_dt_str<br>";
        }
    }
}

This will output:

2008-01-01 2008-02-01
2008-02-01 2008-03-01
2008-03-01 2008-04-01
2008-04-01 2008-05-01
2008-05-01 2008-06-01
2008-06-01 2008-07-01
2008-07-01 2008-08-01
2008-08-01 2008-09-01
2008-09-01 2008-10-01
2008-10-01 2008-11-01
2008-11-01 2008-12-01
2008-12-01 2009-01-01
2009-01-01 2009-02-01
2009-02-01 2009-03-01
2009-03-01 2009-04-01
2009-04-01 2009-05-01
2009-05-01 2009-06-01
2009-06-01 2009-07-01
2009-07-01 2009-08-01
2009-08-01 2009-09-01
2009-09-01 2009-10-01
......

GET request

All that’s left now is to add in the GET API call for the built Pushshift URL:

$url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100";

You can utilize a cURL function or simply use file_get_contents() do be mindful of rate limits (sleep 1 second for each month loop is suitable):

foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
            $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}");
            $after_unix = $after_dt->getTimestamp();

            $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}")));
            $before_dt = new DateTime($before_dt_str);
            $before_unix = $before_dt->getTimestamp();

            $url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100";
            $data = json_decode(file_get_contents($url), true);

            //Do stuff with response
        }
    }
}

Accessing each post from the returned data is done with a foreach loop:

foreach ($data['data'] as $p) {
    $p['id'];//Post id
    $p['title'];//Post title
    $p['author'];//Username
    $p['url'];//Submitted URL
    //etc
}