I do not encourage web scraping nor will I post full code to do so, this was simply a challenge to myself and was documented as so.
LivestreamFails the place to view streamers doing stupid/funny/illegal/annoying things provides a clip which Twitch cannot remove as well as a points parameter, NSFW notice and link to Reddit post discussing the “fail”.
Given these clips are 720p, massively compressed and short in length I had thoughts about doing a “back up” of the whole site. This vision became easier when you realize how the website is laid out.
The first clip is at https://livestreamfails.com/post/1
all the way through to the latest post at https://livestreamfails.com/post/28203
The best way to describe this layout is predictive. I know the URL to all of LivestreamFails clips thanks to this design.
To scrap every clip I just need to remember the last “id” I did, +1 onto it and repeat the process. Using random strings or the title as the URL would prevent this. I can’t predict https://livestreamfails.com/post/5Hq81B or “https://livestreamfails.com/post/Sn1p3r_spills_drink_on_keyboard”.
With the looping process done I just needed to fetch data from the page and store it. Using PHP simple HTML DOM parser this was done easy. The data I needed was:
Once I had done the conditions with the DOM parser correctly it was just a matter of storing them in a MYSQL database.
I did get restricted to only doing 12 (at most) pages per try. To get around this I would limit the loops to 10 pages, sleep 5 seconds and do another 10. This process took around 25 seconds, which was important to be under 30 seconds so the cron job service got the impression it was a live link and didn’t stop it.
The overall process looked like this with a cron job for every minute:
This was getting 28,800 results every 24 hours. LivestreamFails has 28,000 clips, its good enough.
Once i have all the details for each clip I can move onto the second part which is downloading, renaming and storing the MP4 files.
A drained and empty Kennington reservoir images from a drone in early July 2024. The…
Merrimu Reservoir from drone. Click images to view larger.
Using FTP and PHP to get an array of file details such as size and…
Creating and using Laravel form requests to create cleaner code, separation and reusability for your…
Improving the default Laravel login and register views in such a simple manner but making…
Laravel validation for checking if a field value exists in the database. The validation rule…