/r/datasets

Photograph via //r/datasets

A place to share, find, and discuss Datasets.

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

  • Try to post original source whenever you can.
  • Low effort posts will be removed.
  • Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
  • Any Paid Dataset or Resource must be marked as such in the title with [PAID].
  • Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
  • All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.

Related Subreddits

/r/datasets

197,731 Subscribers

1

🌟 Open Investment Datasets: Free and Growing on GitHub/Huggingface

Hey r/datasets community!

I’m thrilled to share an exciting new resource for all you data enthusiasts, researchers, and finance aficionados out there. https://github.com/sovai-research/open-investment-datasets

🔍 What’s New?

Sov.ai has just launched the Open Investment Data Initiative! We’re building the industry’s first open-source investment datasets tailored for rigorous research and innovative projects. Whether you're into AI, ML, quantitative finance, or just love diving deep into financial data, this is for you.

📅 Free Access with a 6-Month Lag

All our 20 datasets will be available for free with a 6-month lag for non-commercial research purposes. This means you can access high-quality, ticker-linked data without breaking the bank. For commercial use, we offer a subscription plan that makes premium data affordable (more on that below).

📈 What We Offer

By the end of 2026, Sov.ai aims to provide 100+ investment datasets, including but not limited to:

  • 📰 News Sentiment: Ticker-matched and theme-matched sentiment analysis from various news sources.
  • 📈 Price Breakout Predictions: Daily updates predicting upward price movements for US equities.
  • 🔍 Insider Flow Prediction: Over 60 insider trading features ideal for machine learning models.
  • 💼 Institutional Trading: In-depth analysis of institutional investment behaviors and strategies.
  • 📢 Lobbying Data: Detailed data on corporate lobbying activities, linked to specific tickers.
  • 💊 Pharma Clinical Trials: Unique dataset tagging clinical trials with predicted success outcomes.
  • ⚠️ Corporate Risks: Bankruptcy predictions (Chapter 7 & 11) for over 13,000 US publicly traded stocks.
  • ...and many more!

🤝 Get Involved!

We’re looking for firms and individuals to join us as co-architects or sponsors on this journey. Your support can help us expand our offerings and maintain the quality of our data. Interested? Reach out to us here or connect via our LinkedIn, GitHub, and Hugging Face profiles.

🧪 Example Use Cases

Here’s how easy it is to get started with our datasets using the Hugging Face datasets library:

from datasets import load_dataset

# Example: Load News Sentiment Dataset

df_news_sentiment = load_dataset("sovai/news_sentiment", split="train").to_pandas()

# Example: Load Price Breakout Dataset

df_price_breakout = load_dataset("sovai/price_breakout", split="train").to_pandas()

# Add more datasets as needed...

1 Comment
2024/11/09
16:02 UTC

3

Does anyone have a dataset with a plot (bar, scatter, hist....etc) and that plot's description dataset?

I know this has a very low chance of existing but I need it, has anyone seen a dataset like this? With a plot column and a description on the plot, I only found datasets with plots but no description (insights) of the plot

1 Comment
2024/11/09
11:32 UTC

1

Looking for Thyroid scan image dataset

Hii, I am a master student and for my final dissertation, I am looking for thyroid scan images dataset to detect types of hyperthyroidism. Kindly help if you can provide any clue where can I find it. It will be really great if you can share the dataset.

Thank you

0 Comments
2024/11/09
05:11 UTC

1

Need help on extracting the NIHSS from the MIMIC-III Dataset

Hey guys, I am currently working on a Project about the use of Machine Learning for Stroke rehabilitation, and i want to exctract informations, like the NIHSS Score, from Medical Datasets. I found an Article where someone Already did that and even provides the Code on Github. But my problem is, i don´t know where to insert the MIMIC-III Dataset, (I already got that) which consists of several .csv documents, in the code, so that is is running correctly. There is no ReadMe or any file that explains how to run the code correctly or prepare the Dataset. Maybe someone did that or can help me with that.

Link to the Article: https://physionet.org/content/stroke-scale-mimic-iii/1.0.0/

Link to the Github repo: https://github.com/huangxiaoshuo/NIHSS_IE

(sorry for the bad language i am not an english native speaker)

1 Comment
2024/11/08
15:11 UTC

10

Scraped Every Parcel In United States

Hey everyone, me and my co worker are software engineers and were working on a side project that required parcel data for all of the united states. We quickly saw that it was super expensive to get access to this data, so we naively thought we would scrape it ourselves over the next month. Well anyways, here we are 10 months later. We created an API so other people could have access to it much cheaper. I would love for you all to check it out: https://www.realie.ai/data-api. There is a free tier, and you can pull 500 records per call on the free tier meaning you should still be able to get quite a bit of data to review. If you need a higher limit, message me for a promo code.

Would love any feedback, so we can make it better for people needing this property data. Also happy to transfer to S3 bucket for anyone working on projects that require access to the whole dataset.

Our next challenge is making these scripts automatically run monthly without breaking the bank. We are thinking azure functions? Would love any input if people have other suggestions. Thanks!

6 Comments
2024/11/08
13:46 UTC

53

I scraped every band in metal archives

I've been scraping for the past week most of the data present in metal-archives website. I extracted 180k entries worth of metal bands, their labels and soon, the discographies of each band. Let me know what you think and if there's anything i can improve.

https://www.kaggle.com/datasets/guimacrlh/every-metal-archives-band-october-2024/data?select=metal_bands_roster.csv

EDIT: updated with a new file including every bands discography

48 Comments
2024/11/08
13:28 UTC

1

California laws and statutes in a downloadable format?

Before I try to figure out how to do a scraper for https://leginfo.legislature.ca.gov/faces/codes.xhtml I wanted to see if there is any downloadable dataset that includes California statutes (really only need Penal and Evidence Code)? Prefer a PDF but I'll take anything.

0 Comments
2024/11/08
11:18 UTC

2

Please help me find a lost dataset - DISCO-10M

DISCO-10M was removed by huggingface and wiped from the internet. I cannot find any other site than https://www.atyun.com/datasets/info/DISCOX/DISCO-10M.html?lang=en

which I cant signup for, I’ve tried a US number / UK number and a Chinese number

Im desperate yall. please help or dm if you hve anything

0 Comments
2024/11/08
05:11 UTC

2

autolabel tool for labelling your dataset!

hi guys i've made this cool thing! go check it!

https://github.com/leocalle-swag/autolabel-tool

0 Comments
2024/11/07
22:39 UTC

3

[self-promotion] Giving back to the community! Free web data!

Hey guys,

I've built an AI tool to help people extract data from the web. I need to test my tool and learn more about the different use cases that people have, so I'm willing to extract web data for free for anyone that wants it!

6 Comments
2024/11/07
16:25 UTC

1

dataset with financial news ( articles with headlines and news incorporated)

as title

0 Comments
2024/11/07
15:02 UTC

5

2024 county-level presidential election results

Anybody aware of public county-level 2024 presidential election results datasets, downloadable as CSV or accessible via free API? I'm specifically looking for total number of votes by county for each party.

2 Comments
2024/11/07
13:39 UTC

4

Looking for a dataset of hormonal imbalance in women

HI everyone, I am searching for a dataset about homonal imbalance in women for a project. Data set may or should contain physical symptoms,age, height, weight, BMI, food habbit, hormonal test results and other clinical features. Thanks in advance.

1 Comment
2024/11/07
13:01 UTC

1

Returns to education across different countries

I am still trying to understand how can I find proper datasets, everytime I need to look for something, I am lost. Any help highly appreciated! Thank you in advance.

1 Comment
2024/11/07
10:17 UTC

1

PD-Weighted Cardiac MR or Cardiac MR Phantom Images

I'm working on a small project to demonstrate the effects of T1 and T2 weighting on a PD-weighted image or a phantom image.

For example, I aim to recreate a T1 contrast between tissues on a PD image of the heart following the signal equation for MRI.

I've been searching for example pictures but haven't had much luck. I've tried resources like the Cardiac Atlas Project, open-access papers, raw K-space data, and phantom images.

Does anyone have suggestions on where I might find what I need?

0 Comments
2024/11/07
10:02 UTC

1

AI-Chat Dataset's (Previous Context)

I've been learning how to locally finetune and wanted to create a dataset that involve using my conversations I had with LLM's like GPT and Claude. I know that dataset's usually have an input output format and some variations of metadata and instructions along with it but how does one actually finetune data that requires previous context?

Like lets say initially my Chat would go somewhere in the lines like this:

Input: What is a bird?

Output: A bird is...

Input: Why do they fly?

Output: They fly because...

In this context the AI knows what I am referring to based on my previous input. But how would I implement the previous context on a dataset? Because the issue is that if I just include "Why do they fly?" as an isolated input, the model wouldn't have the context about birds from the previous exchange and therefore assumes the input "Why do they fly?" have to associate generally with birds (possibly ignoring that the user could refer to a plane, etc..

I initially combine the previous output and the current input together but I feel like that method would only train the model to associate that previous output to be included with the input in order to get the current output. Another method was to nest the conversation spanning multiple input output pairs but utilizing that method wouldn't be scalable since some of my conversations span 50 chats long.

Is there a much more efficient way for me to handle a dataset that utilizes previous context? The model I would be using to train for now is Llama 3.1 8b as it will be small enough to train fast and test if this dataset approach beneficial

0 Comments
2024/11/06
22:09 UTC

3

[Self-Promotion] [Open Source] Luxxify: Ulta Makeup Reviews

Luxxify: Ulta Makeup Reviews

Hey everyone,

I recently released an open source dataset containing Ulta makeup products and its corresponding reviews!

Custom Created Kaggle Dataset via Webscraping: Luxxify: Ulta Makeup Reviews

Feel free to use the dataset I created for your own projects!

Webscraping Process

  • Web Scraping: Product and review data are scraped from Ulta, which is a popular e-commerce site for cosmetics. This raw data serves as the foundation for a robust recommendation engine, with a custom scraper built using requests, Selenium, and BeautifulSoup4. Selenium was used to perform button click and scroll interactions on the Ulta site to dynamically load data. I then used requests to access specific URLs from XHR GET requests. Finally, I used BeautifulSoup4 for scraping static text data.
  • Leveraging PostgreSQL UDFs For Feature Extraction: For data management, I chose PostgreSQL so that I could clean the scraped data from Ulta. This data was originally stored in a complex JSON which needed to be unrolled in Postgres.

As an example, I made a recommender model using this dataset which benefited greatly from its richness and diversity.

To use the Luxxify Makeup Recommender click on this link: https://luxxify.streamlit.app/

I'd greatly appreciate any suggestions and feedback :)

Link to GitHub Repo

1 Comment
2024/11/06
17:27 UTC

3

Created 24 Interesting Dataset Challenges for December (SQL Advent Calendar) 🎁

Hey data folks! I've put together an advent calendar of SQL challenges that might interest anyone who enjoys exploring and manipulating datasets with SQL.

Each day features a different Christmas themed dataset with an interesting problem to solve (all the data is synthetic).

The challenges focus on different ways to analyze and transform these datasets using SQL. For example, finding unusual patterns, calculating rolling averages, or discovering hidden relationships in the data.

While the problems use synthetic data, I tried to create interesting scenarios that reflect real-world data analysis situations.

Starting December 1st at adventofsql.com - (totally free) and you're welcome to use the included datasets for your own projects.

I'd love to hear what kinds of problems you find most interesting to work on, or if you have suggestions for interesting data scenarios!

1 Comment
2024/11/05
17:55 UTC

4

Looking for jokester Datasets to train my LLMs to be funny

As the title suggests, I'm looking for funny datasets, like one containing only puns.

I'm also interested in character-trait-specific humor, such as a dataset filled with funny and outrageous conspiracy theories or self-deprecating, dark humor.

Any humorous datasets that could turn an LLM into a joke machine are welcome!

0 Comments
2024/11/05
09:48 UTC

3

Looking for DISCO-10M: A Large-Scale Music Dataset

Hi everyone,

I'm looking for the DISCO-10M: A Large-Scale Music Dataset. It was previously available through Huggingface, but it is not there anymore. Someone who can share a copy?

2 Comments
2024/11/05
09:34 UTC

1

[self-promotion] Introducing SymptomCheck Bench: An Open-Source Benchmark for Testing Diagnostic Accuracy of Medical LLM Agents

Hi everyone! I wanted to share a benchmark we developed for testing our LLM-based symptom checker app. We built this because existing static benchmarks (like MedQA, PubMedQA) didn’t fully capture the real-world utility of our app. With no suitable benchmark available, we created our own and are open-sourcing it in the spirit of transparency.

GitHub: https://github.com/medaks/symptomcheck-bench

Quick Summary: 

We call it SymptomCheck Bench because it tests the core functionality of symptom checker apps—extracting symptoms through text-based conversations and generating possible diagnoses. It's designed to evaluate how well an LLM-based agent can perform this task in a simulated setting.

The benchmark has three main components:

  1. Patient Simulator: Responds to agent questions based on clinical vignettes.
  2. Symptom Checker Agent: Gathers information (limited to 12 questions) to form a diagnosis.
  3. Evaluator agent: Compares symptom checker diagnoses against the ground truth diagnosis.

Key Features:

  • 400 clinical vignettes from a study comparing commercial symptom checkers.
  • Multiple LLM support (GPT series, Mistral, Claude, DeepSeek)
  • Auto-evaluation system validated against human medical experts

We know it's not perfect, but we believe it's a step in the right direction for more realistic medical AI evaluation. Would love to hear your thoughts and suggestions for improvement!

0 Comments
2024/11/05
09:25 UTC

1

[Request] Working on a project for Underwater Human body detection for rescue missions.

Hello everyone,

I’m working on an image segmentation project aimed at aiding rescue missions by detecting human bodies in underwater crash site images. Specifically, the goal is to identify and segment human figures from underwater images, which could be instrumental in emergency response and recovery operations.

I’m reaching out to see if anyone has, or knows of, a dataset that includes underwater human imagery, especially from crash sites or similar scenarios. Ideally, the dataset would contain varied conditions like different lighting, depths, and visibility to better simulate real-world underwater environments.

If such a dataset isn’t readily available, any resources, advice on data collection, or possible collaboration opportunities to create one would be greatly appreciated! I’m open to any suggestions, as I understand this is a unique and challenging request.

Thank you in advance for any help you can provide!

1 Comment
2024/11/04
18:06 UTC

1

Hi all, looking to find some data on not for profit vs for profit hospital performance Any help is greatly appreciated!

This is for a university project. Thus far I've tried Guidestar, the American hospital directory, CMS, and more to no avail. I am really struggling to obtain any data but am passionate about this topic (and unfamiliar with datasets lol). Looking for financials and/or patient outcomes. I would really appreciate anything!

1 Comment
2024/11/04
18:47 UTC

3

Looking for Datasets on Soil Characteristics for Farming and Water Consumption in Agriculture/Industry/Home Use

Hi everyone,

I’m working on a project that requires datasets related to two areas:

1.	Soil characteristics: I need data on soil and whether the soil is suitable for farming or not.
2.	Water consumption: Datasets that track water usage, ideally in agriculture, industrial settings, or residential homes. Information on seasonal or regional usage trends would be especially helpful.

If anyone knows where I could find reliable datasets for these, or if you’ve come across anything similar in your own work, I’d really appreciate your guidance. Thanks in advance for any recommendations or resources!

0 Comments
2024/11/04
20:40 UTC

3

[self-promotion] Open synthetic dataset and fine-tuned models from Gretel.ai for PII/PHI detection across diverse data types on Huggingface

Detect PII and PHI with Gretel's latest synthetic dataset and fine-tuned NER models 🚀:
- 50k train / 5k validation / 5k test examples
- 40 PII/PHI types
- Diverse real world industry contexts
- Apache 2.0

Dataset: https://huggingface.co/datasets/gretelai/gretel-pii-masking-en-v1
Fine-tuned GliNER PII/PHI models: https://huggingface.co/gretelai/gretel-gliner-bi-large-v1.0
Blog / docs: https://gretel.ai/blog/gliner-models-for-pii-detection

0 Comments
2024/11/04
18:19 UTC

1

[Dataset] Introducing K2Q: A Diverse Prompt-Response Dataset for Information Extraction from Documents

Hey r/Datasets! We’re excited to announce K2Q, a newly curated dataset collection for anyone working with visually rich documents and large language models (LLMs) in document understanding. If you want to push the boundaries on how models handle complex, natural prompt-response queries, K2Q could be the dataset you've been looking for! The paper can be found here and is accepted to the Empirical Methods in Natural Language Processing (EMNLP) Conference.

What’s K2Q All About?

As LLMs continue to expand into document understanding, the need for prompt-based datasets is growing fast. Most existing datasets rely on basic templates like "What is the value for {key}?", which don’t fully reflect the varied, nuanced questions encountered in real-world use. K2Q steps in to fill this gap by:

  • Converting five Key Information Extraction (KIE) datasets into a diverse, prompt-response format with multi-entity, extractive, and boolean questions.
  • Using bespoke templates that better capture the types of prompts LLMs face in real applications.

Why Use K2Q?

Our empirical studies on generative models show that K2Q’s diversity significantly boosts model robustness and performance compared to simpler, template-based datasets.

Who Can Benefit from K2Q?

Researchers and practitioners can use K2Q to:

  • Test zero-shot or fine-tuned models with realistic, challenging questions.
  • Improve model performance on KIE tasks through diverse prompt-response training.
  • Contribute to future studies on data quality for generative model training.

📄 Dataset & Paper: K2Q will be presented at the Findings of EMNLP, so feel free to dive into our paper for in-depth analyses and results! We’d love to see K2Q inspire your own projects and findings in Document AI.

2 Comments
2024/11/04
17:29 UTC

2

Looking for a dataset: Timeseries (monthly/weekly/daily) sales dataset of atleast 3 years with a minimum of 10 different products.

Hi all,

As the title describes, I am looking for a timeseries sales data set of atleast 3 years with minimum of 10 different products. The dataset should be monthly, weekly or daily.

Can someone recommend me one? I am really struggling to find one on Kaggle.

Hope you guys can help me out!!

1 Comment
2024/11/04
13:16 UTC

2

Gene Dependency scores for 17300 normal tissue samples

0 Comments
2024/11/03
21:51 UTC

2

[Research] Mushroom Observer Dataset

Hi,
Has anyone used the Mushroom Observer dataset for image classification? Unless I'm getting something badly wrong, they all reference image IDs but do not supply the images.
i think the images can be gathered through the API using the image ID but they do not want you to scrape them this way.
Does anyone have any experience woerkin with it? It's for an image classification application.

2 Comments
2024/11/03
14:02 UTC

6

[Vanityfair] advertisements published in each issue from 1913 to 2024

Ads data published in vanityfair magazines published from 1913 to November 2024.

Data Format:

    {
      [year]: {
        year: "1913",
        issues: [{
          id: "issue's month",
          ads: [
            articleKey: "articleKey",
            issueKye: "issueKey",
            title: "Ad title",
            slug: "ad-slug",
            coverDate: "coverDate",
            pageRange: "page number on which ad was published",
            wordCount: "word count"
          ]      
        }]
      }
    }

Link: Google Drive

NOTE: VF was shutdown in 1936 and relaunched in 1983, so in-between years data isn't available.

0 Comments
2024/11/02
19:51 UTC

Back To Top