/r/pushshift

Photograph via snooOG

Subreddit for users of the pushshift.io API

Rules:

  • Read the FAQ before posting.
  • Please be kind to each other.
  • Please see this thread before posting "Is Pushshift Down?"
  • Camas, RedditSearch, and similar tools are unlikely to return due to recent API made by Reddit changes. Please do not post asking about those tools.

New to Pushshift? Read the FAQ

Want your data removed, use this Removal Request Form

/r/pushshift

14,238 Subscribers

0

"User is not an authorized moderator."

I keep getting this message despite 1) being a moderator and 2) having received approval from pushshift.

does anyone know how to resolve this?

4 Comments
2024/05/14
19:12 UTC

0

Emergency

Postgrad student who's (academic) life is hanging on a thread if she failed to use PRAW or Pushift to scrape comments from subreddit 'r/gameofthrones'!!!!!!!!

https://preview.redd.it/5px05uzt620d1.png?width=1654&format=png&auto=webp&s=3fe52df48862188c4bd37b3d7bba54985be63583

15 Comments
2024/05/12
20:54 UTC

5

Trouble with zst to csv

Been using u/watchful1's dumpfile scripts in Colab with success, but can't seem to get the zst to csv script to work. Been trying to figure it out on my own for days (no cs/dev/coding background), trying different things (listed below), but no luck. Hoping someone can help. Thanks in advance.

Getting the Error:

IndexError                                Traceback (most recent call last)


 in <cell line: 50>()
     52                 input_file_path = sys.argv[1]
     53                 output_file_path = sys.argv[2]
---> 54                 fields = sys.argv[3].split(",")
     55 
     56         is_submission = "submission" in input_file_path

<ipython-input-22-f24a8b5ea920>

IndexError: list index out of range

From what I was able to find, this means I'm not providing enough arguments.

The arguments I provided were:

input_file_path = "/content/drive/MyDrive/output/atb_comments_agerelat_2123.zst"
output_file_path = "/content/drive/MyDrive/output/atb_comments_agerelat_2123"
fields = []

Got the error above, so I tried the following...

  1. Listed specific fields (got same error)

input_file_path = "/content/drive/MyDrive/output/atb_comments_agerelat_2123.zst"
output_file_path = "/content/drive/MyDrive/output/atb_comments_agerelat_2123"
fields = ["author", "title", "score", "created", "id", "permalink"]

  1. Retyped lines 50-54 to ensure correct spacing & indentation, then tried running it with and without specific fields listed (got same error)

  2. Reduced the number of arguments since it was telling me I didn't provide enough (got same error)

    if name == "main": if len(sys.argv) >= 2: input_file_path = sys.argv[1] output_file_path = sys.argv[2] fields = sys.argv[3].split(",")

    No idea what the issue is. Appreciate any help you might have - thanks!

16 Comments
2024/05/11
22:59 UTC

0

Pushshift api access for research

Tried to signup but received a message that I am not a mod. Is it possible to get access for academic research?

I’m specifically interested in moderation behavior and its impact on evolution of conversations. So I am interested in identifying moderated messages and analyzing its content. Would such information be accessible through pushshift? Are there other means to obtain such information?

Thanks

4 Comments
2024/05/10
17:01 UTC

1

Why do I see such a strong surge in submissions and indivudal users making submissions on July 1st?

In this graph you can see (for all of Reddit between Jan-Nov 2023)

a) the daily number of submissions, stacked by number of comments per submission

b) the daily number of individual users that made at least one submission to all of Reddit in 2023 (excluding December).

I stacked the numbers for submissions with 0,1,2,3,4,5-10, etc comments in order to visually filter out spam/noise by irrelevant submissions (that result in no engagement).

On July 1st, for all submissions the numbers spike significantly. However when looking at the composition, it becomes clear that the number of submissions with 2 or more comments almost dont budge. For the DAU numbers, this however is not true and we can observe that spike much "deeper".

I would be grateful for any pointers towards why there is such a large spike on July 1st. I suspect it might be due to some moderator tools that stopped working due to the API monetization starting on this date, but dont know for sure. Why would I see so much more individual users beginning on July 1st making submissions?

6 Comments
2024/05/09
09:35 UTC

2

Scheduled maintenance/downtime - Improvements in Pushshift API (5/8 Midnight)

As part of our ongoing efforts to improve Pushshift and help moderators, we are bringing in updates to the system that would make our data collection systems faster. Some of these updates are scheduled to be deployed tonight (8th May 12:00 am EST) and may lead to a temporary downtime in Pushshift. We expect the system to be normalized within 15 to 30 minutes.

Our apologies for any inconvenience caused. We will update this post with system updates as they come by.

4 Comments
2024/05/07
23:57 UTC

0

Deleted reddit history used against me.

Hello,

A post I made recently on a subreddit was removed due to my comment history from a different subreddit. The 2 subreddits have nothing to do with each other so there is no overlap. Said Comments were deleted by myself, and I haven't been able to find them on the popular archive websites. I have several questions

  1. How was this mod able to see my deleted Comments?
  2. If I make a removal request, will my deleted reddit history still be easily accessible?

I'm aware nothing is ever truly gone, but the fact that this mod was able to use my deleted comment history against me is rather concerning.

7 Comments
2024/05/06
16:45 UTC

0

{"detail":"User is not an authorized moderator."}

Hello everyone,

I'm currently developing a sentiment analysis model and am trying to integrate Pushshift API to access historical Reddit data. However, I'm encountering an issue with the authorization process. After granting access to my account, I received the following error message:

{"detail":"User is not an authorized moderator."}

It seems like the API is expecting moderator privileges, which I do not have. Has anyone else faced this issue? Any guidance on how to bypass this or any alternative methods to access the data would be greatly appreciated.

Thank you in advance for your help!

2 Comments
2024/05/05
15:05 UTC

19

Dump files for March 2024

Sorry this one is so delayed. I was on vacation the first two weeks of the month and then the compression script which takes like 4 days to run crashed three times part way through. Next month should be faster.

March dump files: https://academictorrents.com/details/deef710de36929e0aa77200fddda73c86142372c

Previous months: https://www.reddit.com/r/pushshift/comments/194k9y4/reddit_dump_files_through_the_end_of_2023/

Mirror of u/RaiderBDev's zst_blocks: https://academictorrents.com/details/ca989aa94cbd0ac5258553500d9b0f3584f6e4f7

3 Comments
2024/04/28
20:34 UTC

2

wallstreetbets_submissions/comments

Hello guys. I have downloaded the .zst files for wallstreetbets_submissions and comments from u/Watchful1's dump. I just want the names of the field which contain the text and the time it was created. Any suggestions on how to modify the filter_file script. I used glogg as instructed with the .zst file to see the fields but these random symbols come up . should i extract the .zst using the 7zip ZST extractor? submissions is 450 mb and comments is 6.6 gb as .zst files. any idea.

https://preview.redd.it/2krcfoi5opwc1.png?width=1778&format=png&auto=webp&s=d2453f057841e6fe4ee501796afb0b0739dd9989

3 Comments
2024/04/25
23:37 UTC

3

Any guides to pushshift use for modding?

The current pushshift.io allows me to search posts/users but I can't actually see the content of what was posted. In the sub I moderate we are having issues with users posting disallowed material and deleting it before mods have a chance to get to it, thus circumventing a ban. I have two questions:

  1. If a post on my sub is popping up as deleted, is there a way for me to see the content of that post and the username of the submitter?

  2. When I do find a suspicious user and search a their name on pushshift.io, I can see the titles of posts they made but not the content of said posts. Is there any way to view content?

Past tools allowed me to do this. Is there any way I can use other tools (with an auth token) to use these functions?

1 Comment
2024/04/23
01:12 UTC

5

Confused on How to Use Pushshift

I'm new to pushshift and in general scraping posts with a Reddit API. I'm looking to scrape some Reddit posts for a personal research project and have heard secondhand that pushshift is an easy way to do this. However, I'm a little confused about exactly what pushshift is and how it is used. When I go to https://pushshift.io/ I am given the terms of service which explain that pushshift is only to be used by Reddit moderators for the sake of moderation (see attached screenshot). Furthermore, I cannot authorize my account without being a Reddit mod.

I am confused because I have seen other posts referencing pushshift as a large data storage of reddit posts or a third-party scraper perfect for scraping posts off of Reddit for research (like this one). Am I misunderstanding something, or is a different tool more suited for what I am looking for?

https://preview.redd.it/954101mct4uc1.png?width=1818&format=png&auto=webp&s=54db9a3ef18cb1678af8c36c6e5622ff71ba3a75

5 Comments
2024/04/12
23:18 UTC

5

Subreddit torrent size

I am trying to ingest the subreddit torrent as mentioned here:

Separate dump files for the top 20k subreddits :

The total collection is some 2.64 TB in size, but all files are obviously compressed. Anybody who has uncompressed the whole collection, any idea how much storage space will the uncompressed collection occupy?

11 Comments
2024/04/12
13:34 UTC

3

How do you resolve decoding issues in the dump files using Python?

I'm hopeful some folks in community have figured out how to address escaped code points in ndjson fields? ( e.g. body, author_flair_text )

I've been treating the ndjson dumps as utf-8 encoded, and blithely regex'd the code points out to suit my then needs, but that's not really a solution.

One example is a flair_text comprised of repeated '\ u d 8 3 d \ u d e 2 8 '. I assume this to be a string of the same emoji if I'm to believe a handful of online decoders ( "utf-16" decoding ), but Python doesn't agree at all.

>>> text = b'\ u d 8 3 d \ u d e 2 8 '
>>> text.decode( 'utf-8' )
'\ \ u d 8 3 d \ \ u d e 2 8 '
>>> text.decode( 'utf-16' )
'畜㡤搳畜敤㠲'
>>> text.decode( 'unicode-escape' )
'\ u d 8 3 d \ u d e 2 8 '

Pasting the emoji into python interactively, the encoded results are different entirely.

>>> text = '😨'
>>> text.encode( 'utf-8' )
b'\ x f 0 \ x 9 f \ x 9 8 \ x a 8 '
>>> text.encode( 'utf-16' )
b'\ x f f \ x f e = \ x d 8 ( \ x d e '
>>> text.encode( 'unicode-escape' )
b' \ \ U 0 0 0 1 f 6 2 8 '

I've added spaces in the code points to prevent reddit/browser mucking about. Any nudges or 2x4s to push/shove me in a useful direction is greatly appreciated.

3 Comments
2024/04/08
23:17 UTC

4

In the dump files, if a username is deleted, is there any way to identify their other posts/comments?

I actually know the username and two of their posts. I found the posts in the files, but they show the name as deleted, so I wanted to ask if there's any way to find more of their posts.

2 Comments
2024/04/06
05:56 UTC

2

Need help coding (please)

Hello everyone,

I'm doing my thesis in linguistics on the pragmatic use of emojis in politeness strategies.

I would like to extract as many submissions with emojis as possible, so that I would run statistical analyses on them.

Disclaimer: I'm a noob coder, and I'm working with Anaconda NoteBook.

I downloaded some metadumps, but I'm having a few problems extracting comments.

The main problem is that the zst files are WAY TOO BIG when I unpack them (some 300-500GB each). This makes my PC go crazy and causes failures in the code I'm trying to run.

Therefore, I humbly request the assistance of the kind souls in this subreddit.

How can I extract all comments containing emojis from a given zst file into a json file? I don't need all the attributes, just the comment, ID, and subreddit. This would greatly reduce the size of the file, but I'm honestly clueless as to how to do that.

Please help me.

Feel free to ask for further clarification.

Thank you all in advance, and I hope you're having a great day!

2 Comments
2024/04/02
21:53 UTC

3

Old dump files

Hello I have a question with the change of pushshift server in December 2022 many names were overwritten with u/deleted, is there any way to see olddump like this https://academictorrents.com/details/0e1813622b3f31570cfe9a6ad3ee8dabffdb8eb6 and see if the data is still there without overwriting.

1 Comment
2024/04/02
19:50 UTC

3

Passing API key in PMAW?

Hey all - I've got a search that works on the search page, but I need to get a lot more than I manually want to pull from that page.

How do I pass my PushShift API key through PMAW? Can't find anything from searching.

0 Comments
2024/03/31
21:20 UTC

1

Analysis project advice. I'm new new to this, please respond at 5th grade reading level lol

What is the best way to access pushshift for an analysis type project within a specific subreddit? I came across this subreddit doing some research and I think it's pretty cool that this type or resource exists and I'm trying to learn how to best utilize it for a project that aims to analyze sentiments, overall mood .. and/or a temporal analysis.. patterns of change

Any and all information would be greatly appreciated.

1 Comment
2024/03/28
11:30 UTC

2

How to automate token retrieval?

I'm a python noob. How do I retrieve the token using a script? It's incredibly tedious having to go through a link, authenticate, then copy paste every day.

2 Comments
2024/03/27
23:12 UTC

0

How do i download the torrents of the reddit submissions

I tried using academic torrents and transmit qt but the resulting file didnt let me extract it, and it tried to download all 2 f**cking terabytes even tho i specified a year in particular, does anyone have a tutorial or a less risky way to access the data of the submissions in a year in particular?

17 Comments
2024/03/26
10:24 UTC

3

Is there anyway to increase the api limits? Or make pushift code from before the change work again

I am running a very simple rstudio code to get the subreddit name from the number all reddit links have, but it limits me to 100 with long intervals, does anyone know any solution or anyway to get data from reddit links fast and easy?

And for the second question, get access from reddit and make the pushift website work again is possible???

I know this is unlikely after the stupid changes, but I am at my wits end, I had a perfectly working pushift code but the change made it useless and I am STILL not finding a solution.

4 Comments
2024/03/26
00:05 UTC

5

Exact match in dump files

Using the dumps and code provided by u/Watchful1, if I'm looking for the values 'alpha', 'bravo', 'charlie', and 'delta' with exact match set to 'False', will I get returns for 'Alpha', 'Bravo', 'Charlie', and 'Delta'? What about 'alphabet' or 'bravos'? And 'alpha-', 'bravo-'?

Thanks in advance!

6 Comments
2024/03/24
02:55 UTC

15

Would you find the ability to download the reddit data archives in simple python package that interfaces with a SQLite database useful?

I downloaded the pushshift archives a while back and have a full copy of the archives, and have used it for various personal research purposes. I've been converting the zst compressed ndjson files into a single SQLite database that uses SQLmodel as an interface, and integrating embedding search across all comments and self posts as I go. I'd probably upload the database to huggingface if I uploaded it somewhere.

My first question is: would people here on this subreddit find this useful? What specific features would you find most useful in a python package serving as an interface for the database?

My second question is: if I published this and made the dataset available, what do y'all think the legal/ethical implications would be?

7 Comments
2024/03/23
01:13 UTC

0

Do you have to be a moderator to access data via Pushshift?

Do you have to be a subreddit moderator to gain access to Pushshift? This page, where you go if you want to request access, seems to imply that you need to be a moderator to get access to Pushshift. I'm not a moderator; I simply want to search particular subreddit posts and their comments for particular phrases I'm interested in. Thank you.

7 Comments
2024/03/22
19:01 UTC

3

Reddit dumps documentation

Hello, keeper and administrator of the cultural heritage of the internet.

I would like to use Reddit dumps from various subreddits for a university assignment on memes. Is there any documentation explaining what the different properties mean contained in the dumps?

Additional question. Is there an explanation of how the dumps are scraped?

I would be very grateful if someone could provide me with further resources :)

3 Comments
2024/03/21
16:59 UTC

1

How do I find old user deleted comments?

I used to use unddit because reveddit is useless because it doesn't show user deleted comments. I am sick and fucking tired of being right where the solution of my problems are only to be greeted by fucking [deleted].

I just want to know the solution to this. Reveddit is fucking useless and I know no other working site. Anyone know what I could use?

https://old.reddit.com/r/uBlockOrigin/comments/o53xdk/how_do_i_block_a_file_by_name_on_any_domain/

Edit: I was able to find another post and figure out the rule is browser-detect-$script but the point still stands. I could've found this earlier if there were a website that just lets me view user deleted comments.

5 Comments
2024/03/18
10:47 UTC

2

Getting your API token?

I got approved to use pushshift but when I accept the terms it just takes me to a page to search and doesn't give an API token?

2 Comments
2024/03/18
02:05 UTC

3

How can I get data related to depression?

Dear Reddit community,

I am a young researcher and a new user of Reddit. I intend to do a research concerning depression with the text posts on Reddit. I require data from subreddits such as r/depression, r/depressed and so on. How can I get these data? Thank you for your help.

3 Comments
2024/03/17
12:10 UTC

Back To Top