I do not encourage web scraping nor will I post full code to do so, this was simply a challenge to myself and was documented as so.
LivestreamFails the place to view streamers doing stupid/funny/illegal/annoying things provides a clip which Twitch cannot remove as well as a points parameter, NSFW notice and link to Reddit post discussing the “fail”.
Given these clips are 720p, massively compressed and short in length I had thoughts about doing a “back up” of the whole site. This vision became easier when you realize how the website is laid out.
The first clip is at https://livestreamfails.com/post/1
all the way through to the latest post at https://livestreamfails.com/post/28203
The best way to describe this layout is predictive. I know the URL to all of LivestreamFails clips thanks to this design.
To scrap every clip I just need to remember the last “id” I did, +1 onto it and repeat the process. Using random strings or the title as the URL would prevent this. I can’t predict https://livestreamfails.com/post/5Hq81B or “https://livestreamfails.com/post/Sn1p3r_spills_drink_on_keyboard”.
With the looping process done I just needed to fetch data from the page and store it. Using PHP simple HTML DOM parser this was done easy. The data I needed was:
- Title
- Streamer
- Game
- Video link (MP4)
- Thumbnail link
- Score/upvotes
- Reddit link
- NSFW or not?
- LivestreamFails URL
Once I had done the conditions with the DOM parser correctly it was just a matter of storing them in a MYSQL database.
I did get restricted to only doing 12 (at most) pages per try. To get around this I would limit the loops to 10 pages, sleep 5 seconds and do another 10. This process took around 25 seconds, which was important to be under 30 seconds so the cron job service got the impression it was a live link and didn’t stop it.
The process
The overall process looked like this with a cron job for every minute:
- Call from database the highest id (latest scrape identifier)
- +1 onto it, loop from this number through to +10 (10 Entries)
- Sleep 5 seconds
- Repeat once more
This was getting 28,800 results every 24 hours. LivestreamFails has 28,000 clips, its good enough.
Once i have all the details for each clip I can move onto the second part which is downloading, renaming and storing the MP4 files.