From lolcat's wiki
Jump to: navigation, search
A Redditor spotted consuming shitty media

Reddit

Reddit is a centralized piece of shit forum and Digg killer. It has become one of the only good mainstream sources of information.

In 2023, Reddit has made their API paid to use, preventing impaired users from using unofficial extensions, apps and bots that let them browse the site. Most importantly, it's preventing (You) from scraping the site, so let's discuss some methods.

(Sane) Scraping methods

Despite everything, scraping Reddit is pretty easy. The community has done a good job of scraping their data anyway, although it remains difficult to do at a large scale.

PullPush API

The PullPush API aims to act as a Reddit API replacement, while providing additional features like viewing deleted posts crawled by their service. There's a bunch of web frontends available for you to try here:

The PullPush.io website does a really good job of documenting their JSON endpoints, so I won't document it here.

Their API is great, however it doesn't crawl everything. My experience with it has been pretty poor when accessing smaller subreddits. I guess you'll need to try it out and see for yourself. If you wish to scrape a large amount of data, you might be interested in a full, uncensored download of their database:

Hidden JSON endpoints

Although they say the API is paid to use, there are still JSON endpoints available (fucking why?? Wasn't it such a big deal when they removed the API then? Fucking redditors I swear to god). You can append .json to almost any reddit URL to fetch back a JSON representation of the page. The JSON objects returned are different from the API.

Reddit hidden endpoints
Data type Endpoint
Post data https://www.reddit.com/r/selfhosted/comments/16emfv0/4get_a_proxy_search_engine_that_doesnt_suck/.json
Profile data https://www.reddit.com/user/Main_Attention_7764/.json
Profile data (alternative endpoint) https://www.reddit.com/user/Main_Attention_7764/about.json
New posts in a subreddit https://www.reddit.com/r/funny/new.json
Hot posts in a subreddit https://www.reddit.com/r/funny/hot.json
Controversial posts in a subreddit https://www.reddit.com/r/funny/controversial.json
Post search https://www.reddit.com/search.json?q=asmr
Post search with subreddit constraint https://www.reddit.com/r/asmr/search.json?q=asmr&restrict_sr=on

For searching posts, there are additional parameters you can supply:

(optional: /r/asmr) /search.json parameters
Parameter Description Example
q Search query your mom
sort Sort by relevance, top, new, comments
t Time filter hour, day, week, month, year, all

(Only works with sort=top or sort=comments)

restrict_sr Restrict search to a subreddit. Only useful when using the subreddit search endpoint Can only be set to "on", omit parameter to turn off
sr_detail Give additional subreddit information. Again, only useful when using the subreddit search endpoint true or false.
include_over_18 Include NSFW results Set to "on". It's turned off by default
type Type of results to return. sr stands for subreddit link, comment, sr, user
limit Limit of results per page Goes up to 100.
after / before Pagination cursor. Uses $json["data"]["after"] after=t3_1im71ar
count Number of results seen so far. Helps with their shitty pagination logic count=200
include_facets Adds additional metadata to show how relevant a search result is true or false.

oEmbed endpoints

If you need something that won't break in 2 weeks from now, you can try the oEmbed endpoint:

https://www.reddit.com/oembed?url=/r/MoreSexyASMRGirls/comments/1isugrt/kamicakes/

You will get access to the following data:

  • Subreddit name (you need to parse it, it's inside the html object)
  • Author name
  • Title

The <blockquote> code loads up this URL, which lets you access all of the images in addition to the list above. Do note that you won't be able to access the post's text here. All of this information is located inside the initial HTML of the embed page. As far as I know, this endpoint doesn't have any ratelimits.