From lolcat's wiki
Jump to: navigation, search
A Redditor spotted consuming shitty media

Reddit

Reddit is a centralized piece of shit forum and Digg killer. It has become one of the only good mainstream sources of information.

In 2023, Reddit has made their API paid to use, preventing impaired users from using unofficial extensions, apps and bots that let them browse the site. Most importantly, it's preventing (You) from scraping the site, so let's discuss some methods.

(Sane) Scraping methods

Despite everything, scraping Reddit is pretty easy. The community has done a good job of scraping their data anyway, although it remains difficult to do at a large scale.

PullPush API

The PullPush API aims to act as a Reddit API replacement, while providing additional features like viewing deleted posts crawled by their service. There's a bunch of web frontends available for you to try here:

The PullPush.io website does a really good job of documenting their JSON endpoints, so I won't document it here.

Their API is great, however it doesn't crawl everything. My experience with it has been pretty poor when accessing smaller subreddits. I guess you'll need to try it out and see for yourself. If you wish to scrape a large amount of data, you might be interested in a full, uncensored download of their database:

Hidden JSON endpoints

You can append .json to any reddit URL to fetch back a JSON representation of the page. For example:

https://www.reddit.com/r/selfhosted/comments/16emfv0/4get_a_proxy_search_engine_that_doesnt_suck/

becomes https://www.reddit.com/r/selfhosted/comments/16emfv0/4get_a_proxy_search_engine_that_doesnt_suck/.json

Or to get profile data, you can do this:

https://www.reddit.com/user/Main_Attention_7764/.json

As of right now, these methods let you get literally every single piece of information you'd see on the actual page.

oEmbed endpoints

If you need something that won't break in 2 weeks from now, you can try the oEmbed endpoint:

https://www.reddit.com/oembed?url=/r/MoreSexyASMRGirls/comments/1isugrt/kamicakes/

You will get access to the following data:

  • Subreddit name (you need to parse it, it's inside the html object)
  • Author name
  • title

The <blockquote> code loads up this URL, which lets you access all of the images in addition to the list above. Do note that you won't be able to access the post's text here. All of this information is located inside the initial HTML of the embed page. As far as I know, this endpoint doesn't have any ratelimits.