Development

Getting old Reddit submissions with Pushshift API

Pushshift is a third party Reddit API useful to find comments and submissions (posts) from the past or that are otherwise archived.

Searching submissions

Searching submissions uses this endpoint:

https://api.pushshift.io/reddit/search/submission/

Importantly there are a great number of parameters to better define your search. Here are some important ones:

subreddit the name of the subreddit you want to search.

score submissions that =, > or < a score (upvote).

domain submissions for a domain URL (eg youtube.com or imgur.com).

after submissions after a date and time (unix format).

before submissions before date and time (unix format).

sort_typesort posts by value (“score”,”num_comments”,”created_utc”).

sort ascending or descending.

To search query post titles use title. To query posts from a certain user: author.

size controls the number of results returned, the maximum amount is 100.

Getting YouTube submissions to r/nba from 2008 to 2015

The goal is to query the top 100 posts each month (2008 to 2015) at r/nba which was a youtube.com submission. Unlike today’s rapid-fire and almost live highlight clips the posting of videos was still scarce particularly before 2014.

Define the months and years in arrays:

$months_array = [
    '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'
];

$years_array = [
    '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015'
];

Looping through each month for every year and verify it is a valid date:

foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
           //Is valid
        }
    }
}

Set the timezone to UTC (Reddit uses this) and convert the previous date and current date into Unix format to get the after and before parameter values.

Doing this will help refine the results down as you can only get a maximum of 100 results per query.

date_default_timezone_set('UTC');
foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
            $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}");
            $after_unix = $after_dt->getTimestamp();

            $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}")));
            $before_dt = new DateTime($before_dt_str);
            $before_unix = $before_dt->getTimestamp();

            echo "{$after_dt->format('Y-m-d')} $before_dt_str<br>";
        }
    }
}

This will output:

2008-01-01 2008-02-01
2008-02-01 2008-03-01
2008-03-01 2008-04-01
2008-04-01 2008-05-01
2008-05-01 2008-06-01
2008-06-01 2008-07-01
2008-07-01 2008-08-01
2008-08-01 2008-09-01
2008-09-01 2008-10-01
2008-10-01 2008-11-01
2008-11-01 2008-12-01
2008-12-01 2009-01-01
2009-01-01 2009-02-01
2009-02-01 2009-03-01
2009-03-01 2009-04-01
2009-04-01 2009-05-01
2009-05-01 2009-06-01
2009-06-01 2009-07-01
2009-07-01 2009-08-01
2009-08-01 2009-09-01
2009-09-01 2009-10-01
......

GET request

All that’s left now is to add in the GET API call for the built Pushshift URL:

$url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100";

You can utilize a cURL function or simply use file_get_contents() do be mindful of rate limits (sleep 1 second for each month loop is suitable):

foreach ($years_array as $y) {
    foreach ($months_array as $m) {
        if (checkdate($m, '01', $y)) {
            $after_dt = new DateTime("{$y}-{$m}-01 {$after_time}");
            $after_unix = $after_dt->getTimestamp();

            $before_dt_str = date('Y-m-d', strtotime("+1 months", strtotime("{$y}-{$m}-01 {$before_time}")));
            $before_dt = new DateTime($before_dt_str);
            $before_unix = $before_dt->getTimestamp();

            $url = "https://api.pushshift.io/reddit/search/submission/?subreddit=nba&after={$after_unix}&before={$before_unix}&metadata=true&domain=youtube.com&sort_type=score&sort=desc&size=100";
            $data = json_decode(file_get_contents($url), true);

            //Do stuff with response
        }
    }
}

Accessing each post from the returned data is done with a foreach loop:

foreach ($data['data'] as $p) {
    $p['id'];//Post id
    $p['title'];//Post title
    $p['author'];//Username
    $p['url'];//Submitted URL
    //etc
}

 

 

Share

Recent Posts

Kennington reservoir drained drone images

A drained and empty Kennington reservoir images from a drone in early July 2024. The…

1 year ago

Merrimu Reservoir drone images

Merrimu Reservoir from drone. Click images to view larger.

1 year ago

FTP getting array of file details such as size using PHP

Using FTP and PHP to get an array of file details such as size and…

2 years ago

Creating Laravel form requests

Creating and using Laravel form requests to create cleaner code, separation and reusability for your…

2 years ago

Improving the default Laravel login and register views

Improving the default Laravel login and register views in such a simple manner but making…

2 years ago

Laravel validation for checking if value exists in the database

Laravel validation for checking if a field value exists in the database. The validation rule…

2 years ago