Scraping LivestreamFails Part 1

I do not encourage web scraping nor will I post full code to do so, this was simply a challenge to myself and was documented as so.

LivestreamFails the place to view streamers doing stupid/funny/illegal/annoying things provides a clip which Twitch cannot remove as well as a points parameter, NSFW notice and link to Reddit post discussing the “fail”.

Given these clips are 720p, massively compressed and short in length I had thoughts about doing a “back up” of the whole site. This vision became easier when you realize how the website is laid out.

The first clip is at https://livestreamfails.com/post/1 all the way through to the latest post at https://livestreamfails.com/post/28203 The best way to describe this layout is predictive. I know the URL to all of LivestreamFails clips thanks to this design.

To scrap every clip I just need to remember the last “id” I did, +1 onto it and repeat the process. Using random strings or the title as the URL would prevent this. I can’t predict https://livestreamfails.com/post/5Hq81B or “https://livestreamfails.com/post/Sn1p3r_spills_drink_on_keyboard”.

With the looping process done I just needed to fetch data from the page and store it. Using PHP simple HTML DOM parser this was done easy. The data I needed was:

  • Title
  • Streamer
  • Game
  • Video link (MP4)
  • Thumbnail link
  • Score/upvotes
  • Reddit link
  • NSFW or not?
  • LivestreamFails URL

Once I had done the conditions with the DOM parser correctly it was just a matter of storing them in a MYSQL database.

I did get restricted to only doing 12 (at most) pages per try. To get around this I would limit the loops to 10 pages, sleep 5 seconds and do another 10. This process took around 25 seconds, which was important to be under 30 seconds so the cron job service got the impression it was a live link and didn’t stop it.

The process

The overall process looked like this with a cron job for every minute:

  1. Call from database the highest id (latest scrape identifier)
  2. +1 onto it, loop from this number through to +10 (10 Entries)
  3. Sleep 5 seconds
  4. Repeat once more

This was getting 28,800 results every 24 hours. LivestreamFails has 28,000 clips, its good enough.

Once i have all the details for each clip I can move onto the second part which is downloading, renaming and storing the MP4 files.