/r/datasets

Photograph via //r/datasets

A place to share, find, and discuss Datasets.

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

  • Try to post original source whenever you can.
  • Low effort posts will be removed.
  • Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
  • Any Paid Dataset or Resource must be marked as such in the title with [PAID].
  • Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
  • All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.

Related Subreddits

/r/datasets

197,442 Subscribers

1

Community health for a subreddit for a project - it's not mine

I wanted to do a quick analysis of a subreddit. Can someone teach me on how to use this? https://github.com/pushshift/api please

3 Comments
2024/10/31
14:06 UTC

3

Guy I am currently doing a research on ml model on cancer research.

I was using gdc cancer portal but they dont have annotation I was wondering is there any resourse for it plsss help me out

0 Comments
2024/10/31
07:29 UTC

1

BEA archive data availability issues

Greetings! I am currently conducting research on the US. To start the analysis I require data from BEA that dates back to 1990s (specifically 1997, when the NAICS has been introduced). I am pretty new to the BEA website, so I may be lost. The data I need is county-level. When I head to the archive for GDP by county and metro level, the only data that's available dates back to 2017. Maybe I am doing something wrong? Where can I find older data for county and metro? I may need other county level data from other categories on the website. Maybe there is a website like nhgis but for BEA data?

1 Comment
2024/10/30
23:08 UTC

2

France inflation data (per department, index type, index variation, household, and product type)

Hi!

I struggled a lot to find the inflation data for France from an official source. I either found articles from INSEE (National Institute for Statistics and Economic Studies) on the inflation for each month which had a link for that data, and even that was only a subset of all the data for that month. Or I found auxiliary websites that didn't cite the source for their data.

I also looked for official APIs but didn't find something that directly provided the consumption index (inflation index) or a preprocessing of it (year-over-year variation for example). But I stumbled randomly on this https://www.insee.fr/fr/statistiques/series/102342213 (it's an official source, it's the INSEE) for which the title might be confusing. The title suggests that the data there is grouped by products and detailed products (a special nomenclature named COICOP).

I preprocessed it here https://github.com/ReinforcedKnowledge/france-inflation-data-cleaned (includes raw data, preprocessing scripts and preprocessed data). The README is in French but it explains the data a bit and explains how I got granular datasets from that big raw data. I found it a bit messy and confusing at the beginning when I started looking at it, but I was able to extract every unique combination of the modalities (region/department, index type, index variation, if product is under the COICOP nomenclature, household type).

I hope it can help if someone is looking for that data or understand it because it really took me some time and effort to find it and make sense of it.

1 Comment
2024/10/30
23:56 UTC

2

Need ayoda with creating dataset i know nada

I wanna make local dataset i don t know how and where to start i need help

4 Comments
2024/10/30
19:27 UTC

2

Regression and Classification Datasets

Hello everyone, I am currently in a class at the moment that requires me to use a classification dataset and a regression dataset that is not from the UCI ML repository and I want to do my project about something in the social sciences (I have a poli sci background) however I’ve been struggling to find datasets that align with what I’m looking for. Does anyone have good recs for places to look for the kind of datasets I wan?

3 Comments
2024/10/30
18:59 UTC

2

Are there any recipe datasets for commercial use?

I'm looking for a dataset/database of good quality (NO Al) food recipes with PICTURES that go alongside with instruction steps for commercial use. I would like to use it in an app l'm creating.

I don't mind paying for it- preferably one time payment, rather than a subscription.

I would have to translate the instructions anyway, so what l'm really worried about are the pictures because of the copyright issues.

And NO APIs, I want to store the database locally.

Thank you

0 Comments
2024/10/30
18:53 UTC

1

Spam Messages Dataset for LLM based Telegram bot

Hello everyone, I need a spam messages dataset to train a LLM based spam message detection bot for Telegram. Any help is appreciated. (Data from Discord would be enough also)

0 Comments
2024/10/30
17:31 UTC

0

Data Request Function on Opendatabay Platform

Feel free to request datasets on the platform, and take a look to see if there are any datasets you could source or produce.

These are non-free datasets that will pay generously for your work.
With community help, we can connect data suppliers with data consumers.

https://www.opendatabay.com/request-data

0 Comments
2024/10/30
07:31 UTC

1

Are there any open source recipe datasets for commercial use?

I’m looking for a dataset/database of good quality (NO AI) food recipes with PICTURES that go alongside with instruction steps, for commercial use. I would like to use it in an app I’m creating.

I don’t mind paying for it- preferably one time payment, rather than a subscription type of thing.

I would have to translate the instructions anyway, so what I’m really worried about are the pictures because of the copyright issues.

And NO APIs, I want to store the database locally.

Thank you

1 Comment
2024/10/29
21:20 UTC

0

Can you suggest an (AI) tool that can read a spreadsheet and produce a summary word/pdf document that summarizes the data into formatted text, table, and figures?

I'm trying to figure out how to essentially automate the production of monthly data report with nice clean visuals and written summaries based off of the excel spreadsheets that are provided. I'm not sure if chatgpt is best for this, or another AI tool, or some combination of a python code and something else. Any advice would be appreciated!

8 Comments
2024/10/29
16:48 UTC

1

Help Needed: Looking for Crime Scene Datasets for a Crime Scene Reconstruction Project 🚔🔍

Hi everyone!

I’m part of a team working on a capstone project focused on crime scene reconstruction and analysis using machine learning and 3D simulations(blender/unity )

What We're Doing: 3D Crime Scene Reconstruction: Creating an interactive model that lets investigators explore and "rewind" scenes to see potential sequences of events (e.g., weapon use, bullet trajectories).

Simulated Evidence Analysis: Replaying crime scenes based on data to visualize how evidence like blood spatter patterns or object placements might have occurred

We’re specifically looking for datasets that contain information related to crime scenes, including data on:

Crime types (especially homicide) Evidence details (e.g., weapon type, trajectory info, blood spatter)

If anyone has worked on a similar project before or knows where we can find reliable and detailed crime scene datasets, we’d greatly appreciate any guidance! We’re especially curious if there’s any open-source or academic dataset available, or if there are any other resources that might be useful for this type of project.

Also any other help related to any aspect of this project will be appreciated and is needed

Thanks in advance for any help, suggestions, or shared experiences!

1 Comment
2024/10/29
14:28 UTC

0

A Tool to Create Datasets from Research Papers using Augmented LLMs– Would This Be Helpful?

I've developed a program that uses multiple language models that talk to each other to create databases from scientific papers. I'm looking to use it to build custom datasets for medicinal neural networks. I'm considering deploying it as a website to see if it could be useful for others, but I'm looking for input on how to make it more robust and accessible for broader use.

For those with experience in dataset creation, AI applications in medicine, or similar fields, what features or improvements would make this tool more valuable or realistic for researchers and practitioners? Any insights would be greatly appreciated!

4 Comments
2024/10/29
06:11 UTC

2

How to find datasets (costacoffee to be specific)

Any leads on costa coffee’s datset. I m a BBA undergrad and require it for a project can someone please help me how to find datasets?

3 Comments
2024/10/29
05:22 UTC

1

Pitchbook Access Request Help Please

Hello everyone. I'm an undergrad student currently conducting a thesis related to VC-funded firms. I found that Pitchbook may have lots of information (financials) that I need for my paper, but it's really pricey. Wanting to see if there is anyone in the community who can share access with me or pull the data for free 😅 This would really help me kickstart my research. Help this broke student graduate

2 Comments
2024/10/29
05:20 UTC

0

Is news APIs usage legal and reliable?

I need some source of information for a data science project (academic research). Specifically, I need to retrieve an historical record of news about certain topic so I am thinking of using a news API instead of web scraping because these APIs seem to return the kind of data I am searching for.

I've came upon some of them such as newsdata.io, newsapi.org and newsapi.ai, but I am wondering if its usage is legal and realiable? I mean, are they legal themselves? And if so, am I inherently allowed to use them for my personal (academic) purposes?

Term & Conditions say this:

"We don't have the right to authorise any user to use the data for their personal and professional purposes. However, the users can use the data for their personal or professional purposes"

I mean, should I have any concern about this? It's not like Twitter or Reddit's API where data belongs to them and they deliberately give it to you. (In fact, I’m asking this because I planned to extract data from these platforms but I’ve just realized it’s just not possible at all so I am wondering if there’s another alternative I can use to meet my requirment)

Well... in essence, my questions are: Are these platforms/tools (APIs) legitimate and meant for data science? or, in other words: is it a common/familiar practice to use these kind of "news APIs" for data science?

I didn't even knew them. Have you ever tried them before? Should I do web scraping instead or can you see another alternative you could advise me to use?

I'd appreciate your help.

6 Comments
2024/10/28
19:32 UTC

1

Data on the borders of the HRE states after the treaty of Westphalia?

Hi everyone!

Does anyone know where to get it? I need to link regions beloning to certain former entities within the HRE to current geographical locations within Germany (at the municipality level).

I hope someone can help!

0 Comments
2024/10/28
15:45 UTC

2

Need help extracting images from this dataset.

I tried extracting images from this dataset but couldn't. It is in DICOM format and I guess in a URL, which I haven't worked with before. Can anyone explain how to access these images?

4 Comments
2024/10/28
14:42 UTC

2

Insurance Fraud Dataset Uncleaned and Not Evenly Distributed or Any Fraud Dataset at all

looks impossible? all the shit i find on kaggle either has no good columns, or many but are just var_1, var_2, var_3, then I search UCI all the datasets are most specific things on the planet, like consumption of energy on a dog´s poop, i am losing my mind

5 Comments
2024/10/27
16:44 UTC

6

European Cities Population data set.

Hello, I'm making a ML algorithm that uses a city infrastructure as features and want to predict its populations.
With OSM library I was able to easly extract the infrastructure data, however I am not able to find a data set with enough european cities. So far all data sets I've encontered only contain data from 50-80 european cities and the rest is Asian cities.

I've tried to use Population density and city area to create the data set for population my self but the numbers I got were terribly wrong.

If someone has any idea of how to get this data I would love the help.

6 Comments
2024/10/27
15:54 UTC

2

Requesting/Looking for a dataset related to Rheumathoid arthritis.

I am trying to build a cnn model for classification purposes. And I need a data set with x-ray (even MRI is fine) images of patients with RA. Preferably images of hands. At least 100 images.

0 Comments
2024/10/27
12:26 UTC

3

Mortgage loan application data sample for a Scorecard

I'm planning on making an application scorecard for home loans as my bachelor thesis for University.

One of my(along with my academic supervisor's) main concern is having a reliable dataset or rather the dataset being from a reliable source. One of the big questions that I'm going to be potentially challenged on in such a thesis is the dataset's reliability so it can't be from somewhere like Kaggle, but for a example somewhere like Experian/Equifax would be okay. I work at a bank and deal with such models but unfortunately I can't use any company data (even if it gets anonymized). So far I've seen some promising stuff in FFIEC's website but would like some additional sources so I can make a more educated decision

Roughly I would need the data to contain these fields:

Age

Job

Income

Education

Marriage Status

Information about previous defaults ( something like a Y/N if the applicant has defaulted on a loan in the last 5 years for example)

Type of property that would be purchased with the loan

Some other fields that I could potentially exclude in further analysis

0 Comments
2024/10/27
11:55 UTC

1

Dataset for Datathon for college students

Pretty much as title.

Hi All, I am planning to host a Datathon as a competition for college students. The sizes which I could find were too small. Share the direct links, websites or any way to get some. Thanks.

5 Comments
2024/10/27
07:18 UTC

1

Seeking VO2max Test Data for Research Training

Hello everyone!

I’m a researcher-in-training working on exercise physiology, and I’m currently looking for datasets on VO2max or incremental exercise tests that include VO2 and, ideally, blood lactate measures. My goal is to practice determining ventilatory and lactate thresholds to refine my analytical skills in these areas.

If you have access to any anonymized data or know of open-source datasets, I’d be very grateful for any pointers! I’ve checked platforms like OSF and PhysioNet but haven’t found exactly what I need, so any help would be highly appreciated.

Thank you in advance!

1 Comment
2024/10/27
01:54 UTC

0

[Urgent] Seeking HIPAA-Compliant PHI Database with Identifiable Health Data

Hi everyone! I’m urgently looking to source a HIPAA-compliant database that includes identifiable PHI (Protected Health Information), such as names and specific diagnosis histories, for a research project with rigorous data protection standards.

I need a reputable third-party vendor experienced in securely handling identifiable health data, with all necessary patient consent and compliance protocols in place. Does anyone know of reliable sources or vendors for acquiring such data legally and ethically? Any insights or recommendations are greatly appreciated—thanks!

3 Comments
2024/10/26
21:16 UTC

1

Looking for an inventory dataset (retail or production)

Hello all,

I am looking for an inventory dataset, however, I would also need the name of the company where the dataset is coming from (not any government data).

0 Comments
2024/10/26
10:56 UTC

1

Hi :) I´m looking for data on the amount of daily e-scooter rides in a city (any city possible) over one year.

Hello,

I am currently researching the correlation between weather patterns and the usage of shared mobility services, specifically focusing on e-scooter rides. I am looking for a dataset containing daily e-scooter ride counts in a city (any city) covering at least one year.

Details of the request:

  • Data Scope: Daily ride counts over a one-year period
  • Primary Interest: E-scooter usage data, though data on bike-sharing or shared car services would also be very helpful for comparison.

Any help or direction to relevant data sources would be greatly appreciated.

Thank you very much in advance for your assistance!

1 Comment
2024/10/25
22:46 UTC

3

Looking for a dataset on companies that "speak out"

I'm not sure of the terminology. But I'm attempting to do research surrounding event based studies when companies speak out. And I've been banging my head against the wall on this for weeks! 😂

Possibly if it's on social issues, on political issues, if they comment on a humanitarian crisis or on an international conflict, etc. But I'm having trouble finding any day sets or any proxies that would measure or rank the number of times they "speak out" other than perhaps things like trading volume of the underlying stock or trending on social media.

Is there any datasets you could suggest or point me towards which can help serve as a proxy for companies that stand up and speak out on societal issues?

Thank you kindly for any thoughts!

0 Comments
2024/10/26
01:46 UTC

Back To Top