
Reddit is a centralized piece of shit forum and Digg killer. It has become one of the only good mainstream sources of information.
In 2023, Reddit has made their API paid to use, preventing impaired users from using unofficial extensions, apps and bots that let them browse the site. Most importantly, it's preventing (You) from scraping the site, so let's discuss some methods.
(Sane) Scraping methods
Despite everything, scraping Reddit is pretty easy. The community has done a good job of scraping their data anyway, although it remains difficult to do at a large scale.
PullPush API
The PullPush API aims to act as a Reddit API replacement, while providing additional features like viewing deleted posts crawled by their service. There's a bunch of web frontends available for you to try here:
- View deleted threads: https://undelete.pullpush.io/
- Search the archive: https://search.pullpush.io/
The PullPush.io website does a really good job of documenting their JSON endpoints, so I won't document it here.
Their API is great, however it doesn't crawl everything. My experience with it has been pretty poor when accessing smaller subreddits. I guess you'll need to try it out and see for yourself. If you wish to scrape a large amount of data, you might be interested in a full, uncensored download of their database:
Hidden JSON endpoints
You can append .json
to any reddit URL to fetch back a JSON representation of the page. For example:
https://www.reddit.com/r/selfhosted/comments/16emfv0/4get_a_proxy_search_engine_that_doesnt_suck/
Or to get profile data, you can do this:
https://www.reddit.com/user/Main_Attention_7764/.json
As of right now, these methods let you get literally every single piece of information you'd see on the actual page.
oEmbed endpoints
If you need something that won't break in 2 weeks from now, you can try the oEmbed endpoint:
https://www.reddit.com/oembed?url=/r/MoreSexyASMRGirls/comments/1isugrt/kamicakes/
You will get access to the following data:
- Subreddit name (you need to parse it, it's inside the
html
object) - Author name
- title
The <blockquote>
code loads up this URL, which lets you access all of the images in addition to the list above. Do note that you won't be able to access the post's text here. All of this information is located inside the initial HTML of the embed page. As far as I know, this endpoint doesn't have any ratelimits.