
Reddit is a centralized piece of shit forum and Digg killer. It has become one of the only good mainstream sources of information.
In 2023, Reddit has made their API paid to use, preventing impaired users from using unofficial extensions, apps and bots that let them browse the site. Most importantly, it's preventing (You) from scraping the site, so let's discuss some methods.
(Sane) Scraping methods
Despite everything, scraping Reddit is pretty easy. The community has done a good job of scraping their data anyway, although it remains difficult to do at a large scale.
PullPush API
The PullPush API aims to act as a Reddit API replacement, while providing additional features like viewing deleted posts crawled by their service. There's a bunch of web frontends available for you to try here:
- View deleted threads: https://undelete.pullpush.io/
- Search the archive: https://search.pullpush.io/
The PullPush.io website does a really good job of documenting their JSON endpoints, so I won't document it here.
Their API is great, however it doesn't crawl everything. My experience with it has been pretty poor when accessing smaller subreddits. I guess you'll need to try it out and see for yourself. If you wish to scrape a large amount of data, you might be interested in a full, uncensored download of their database:
Hidden JSON endpoints
Although they say the API is paid to use, there are still JSON endpoints available (fucking why?? Wasn't it such a big deal when they removed the API then? Fucking redditors I swear to god). You can append .json
to almost any reddit URL to fetch back a JSON representation of the page. The JSON objects returned are different from the API.
Data type | Endpoint |
---|---|
Post data | https://www.reddit.com/r/selfhosted/comments/16emfv0/4get_a_proxy_search_engine_that_doesnt_suck/.json |
Profile data | https://www.reddit.com/user/Main_Attention_7764/.json |
Profile data (alternative endpoint) | https://www.reddit.com/user/Main_Attention_7764/about.json |
New posts in a subreddit | https://www.reddit.com/r/funny/new.json |
Hot posts in a subreddit | https://www.reddit.com/r/funny/hot.json |
Controversial posts in a subreddit | https://www.reddit.com/r/funny/controversial.json |
Post search | https://www.reddit.com/search.json?q=asmr |
Post search with subreddit constraint | https://www.reddit.com/r/asmr/search.json?q=asmr&restrict_sr=on |
For searching posts, there are additional parameters you can supply:
Parameter | Description | Example |
---|---|---|
q
|
Search query | your mom |
sort
|
Sort by | relevance , top , new , comments
|
t
|
Time filter | hour, day, week, month, year, all
(Only works with |
restrict_sr
|
Restrict search to a subreddit. Only useful when using the subreddit search endpoint. | Can only be set to "on ", omit parameter to turn off
|
sr_detail
|
Give additional subreddit information. Again, only useful when using the subreddit search endpoint. | true or false .
|
include_over_18
|
Include NSFW results | Set to "on ". It's turned off by default
|
type
|
Type of results to return. sr stands for subreddit
|
link , comment , sr , user
|
limit
|
Limit of results per page | Goes up to 100 .
|
after / before
|
Pagination cursor. Uses $json["data"]["after"]
|
after=t3_1im71ar
|
count
|
Number of results seen so far. Helps with their shitty pagination logic | count=200
|
include_facets
|
Adds additional metadata to show how relevant a search result is | true or false .
|
oEmbed endpoints
If you need something that won't break in 2 weeks from now, you can try the oEmbed endpoint:
https://www.reddit.com/oembed?url=/r/MoreSexyASMRGirls/comments/1isugrt/kamicakes/
You will get access to the following data:
- Subreddit name (you need to parse it, it's inside the
html
object) - Author name
- Title
The <blockquote>
code loads up this URL, which lets you access all of the images in addition to the list above. Do note that you won't be able to access the post's text here. All of this information is located inside the initial HTML of the embed page. As far as I know, this endpoint doesn't have any ratelimits.