/r/datasets

Photograph via //r/datasets

A place to share, find, and discuss Datasets.

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

  • Try to post original source whenever you can.
  • Low effort posts will be removed.
  • Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
  • Any Paid Dataset or Resource must be marked as such in the title with [PAID].
  • Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
  • All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.

Related Subreddits

/r/datasets

199,106 Subscribers

1

Is there any dataset that records eye movements of alzheimer's patients?

Hello Guys,

I intend to do a project on Alzheimer's detection based on eye movements. I read some papers on this but all of them used their own recorded data. Is there any publicly available dataset on this? I will be happy to know your suggestions on this project's implementation.

0 Comments
2024/12/18
18:04 UTC

2

Looking for a YOLO/Darknet-compatible dataset that can be used to scan an image/video and identify specific body parts

Hey all,

I'm working on a number of devices where I'd like to use machine learning and live video to identify specific parts of the human body.

This is a sex-positive project, and therefore rather than have a classifier that censors anything it thinks might be nudity, I'm looking for a dataset that will enable me to identify nipples, penises, vaginas, and other potentially erogenous zones on people of all genders, colours, and body shapes.

It feels to me that it should be possible, but I'm new to creating/training models and not sure where to start, so figure standing on the shoulders of others is probably a good place!

0 Comments
2024/12/18
11:05 UTC

3

Song Dataset with Mood/Vibe Parameters

I have an idea for a personal project and I could use some help finding a dataset.

Project:

I would like to make a playlist generator where I can specify different moods at different points of time in the paylist. So something along the lines of 1h Chill, 1h Pop, 1h Dance. Obviously I would like mush more refinement that I showed in the example. My thought was that I could find paths between different song types so that the genre transitions are smooth.

Maybe this already exists?

Dataset:

What I am looking for is a long list dataset with obviously the main parameters (name, artist, year etc) but also things like popularity, danceability, singablity, nostalgia factor, high vs low energy, happiness, tempo, and more.

Does a dataset like this exist? I also thought it could be possible to use sentiment analysis on the lyrics to generate some of these parameters.

Let me know if you have any ideas

2 Comments
2024/12/18
09:12 UTC

2

Dataset for US Spending at Federal, State, County Level?

Is there any detailed breakdown of US spending? I want something ideally that goes very granular. I have no idea how money is managed by the US which is why I’m asking

1 Comment
2024/12/18
03:24 UTC

1

Is there a dataset listing death/birth dates?

Is there a dataset that contains both the birth and death dates of real people?

This may be a bit of a morbid topic, but I've been talking to my wife about people dying close to their birthdays, and since I tend to do silly projects as a way to keep my knowledge alive, I figured an analysis of this data might tell us something (preferably that there's no correlation lol).

However, all government databases I found only provide aggregated data, such as death and birth rates, unfortunately. I know this may involve some data security and privacy concerns, but I would really just need these two linked dates to do the analysis, no names or anything.

If anyone has access to a structure like this, or perhaps an API that can make this data available, I would be very grateful. I promise to bring this complete study to reddit as soon as I finish it.

3 Comments
2024/12/18
01:14 UTC

1

Need Dataset for personalised learning pathways

I have to make a personalized learning pathways project for my ai/ml course please help in finding a dataset

7 Comments
2024/12/17
18:46 UTC

3

NBA Team stats datasets for multiple years

I was looking for a dataset where it is team stats for all the teams in the NBA for each year at least in the last decade. I couldn't find it so figure the best way is just to get the csv for each year then combine it. Anyone know any other ways to get it?

1 Comment
2024/12/17
04:08 UTC

1

dataset with categorical and numerical variables both

Hi, I'm looking for a dataset which at least three numerical variables and two categorical variables. It should be easy enough to look for, but I'm having trouble finding any which match the requirements. Any suggestions for resources where I can look?

The dataset is for a project, we aren't allowed to use in built or made up data, or from places like kaggle etc.

3 Comments
2024/12/16
21:34 UTC

2

[self-promotion] Giving back to the datasets community with some free data!

Hey guys,

I just wanted to share our project called Potarix (https://potarix.com/). It’s an AI-powered web scraping/data extraction tool that can pull data from any website. You can use it at (https://app.potarix.com). 

I wanted to give back to this community, so we’ve given everyone that signs up 5$ of credits. Scraping each page takes up $0.10 of your credits. You are not charged for unsuccessful scrapes! That should let you get data from 50 web pages.

So far, we’ve used this project (with some added features) to help clients:

  • Scrape betting data from the NFL, NBA, and NCAA.
  • Scrape all the Google reviews for each business in San Francisco  
  • Scrape business contact information on Google Maps for every single business in the Houston area

Looking ahead, we built some stuff in-house that we’d love to include in the SAAS platform shortly. We’ve built functionality to click, type, scroll, etc. on the page. AI also tends to be wrong sometimes, so we created a tweakable script in the backend, to control the agent's actions. That way, you're in control and can bring the script to 100% accuracy. We’ve also seen people battling to build infrastructure for their large-scale scraping projects. We wanna autonomously let folk set up parallelization and choose the infra for their project so everything is scraped as quickly and succinctly as possible from the SAAS. 

If any of these future features sound interesting, feel free to book some time, and we can discuss how we can help you with these now!

0 Comments
2024/12/16
20:58 UTC

3

Multi-sources rich social media dataset - a full month of global chatters!

Hey, data enthusiasts and web scraping aficionados!
We’re thrilled to share a massive new social media dataset that just dropped on Hugging Face! 🚀

Access the Data:

👉Social Media One Month 2024

What’s Inside?

  • Scale: 270 million posts collected over one month (Nov 14 - Dec 13, 2024)
  • Methodology: Total sampling of the web, statistical capture of all topics
  • Sources: 6000+ platforms including Reddit, Twitter, BlueSky, YouTube, Mastodon, Lemmy, and more
  • Rich Annotations: Original text, metadata, emotions, sentiment, top keywords, and themes
  • Multi-language: Covers 122 languages with translated keywords
  • Unique features: English top keywords, allowing super-quick statistics, trends/time series analytics!
  • Source: At Exorde Labs, we are processing ~4 billion posts per year, or 10-12 million every 24 hrs.

Why This Dataset Rocks

This is a goldmine for:

  • Trend analysis across platforms
  • Sentiment/emotion research (algo trading, OSINT, disinfo detection)
  • NLP at scale (language models, embeddings, clustering)
  • Studying information spread & cross-platform discourse
  • Detecting emerging memes/topics
  • Building ML models for text classification

Whether you're a startup, data scientist, ML engineer, or just a curious dev, this dataset has something for everyone. It's perfect for both serious research and fun side projects. Do you have questions or cool ideas for using the data? Drop them below.

We’re processing over 300 million items monthly at Exorde Labs—and we’re excited to support open research with this Xmas gift 🎁. Let us know your ideas or questions below—let’s build something awesome together!

Happy data crunching!

Exorde Labs Team - A unique network of smart nodes collecting data like never before

0 Comments
2024/12/16
11:05 UTC

1

I need help finding a data breaches data set. Where to look?

Hi! I am writing my thesis and I need a data set that contians data of data breaches, how they happend, the scale of it and possibly the sensitivity of the leaked data. I dont know where to find it. The only pleace I know is kaggle and it does not seem professional. Any advice?

2 Comments
2024/12/15
22:40 UTC

3

Looking for Fraud Detection Datasets

I am writing a book chapter on fraud detection using machine learning. I found that most of the current research is rather hard for a person actually building models to apply, every paper likes to highlight the lack of good datasets but no one provides a collection of good datasets that people reading their paper can use

I think that if I include some good datasets for people to train their models on in my chapter, then that will be a very good contribution from my side.

Do you know any good datasets that are used for this, or where I can look for such datasets?

I am honestly clueless when it comes to collecting and finding good datasets for industry grade applications, and I will be really grateful for any help that I get🙏🙏

1 Comment
2024/12/15
21:52 UTC

2

NFL Data Help for Expected Hypothetical Completion Probability

Currently trying to predict the 2025 super bowl winner for a college final presentation. Trying to use Expected Hypothetical Completion Probability from Big Data Bowl 2019 to help by seeing which teams best optimize their playbook for EHCP and if there is a correlation between that and how often they win / complete but having trouble finding a data source.

The EHCP metric requires two main types of data:

1. Play-by-Play Data:

  • Includes high-level information like down, distance, time remaining, score differential, and whether the pass was completed.

2. Player Tracking Data:

  • Tracks the location of players and the ball during each play.

Key elements:

  • Receiver and defender positions.
  • Ball location during the pass.
  • Receiver separation, speed, and direction.

I was directed to pff.com and https://nextgenstats.nfl.com/ so far but I am having trouble coming up with entire data sets for exactly what I need. Anything helps so please let me know!

0 Comments
2024/12/15
21:35 UTC

8

Looking for a free tool to extract structured data from a website

Hi everyone,
I'm looking for a tool (preferably free) where I can input a website link, and it will return the structured data from the site. Any suggestions? Thanks in advance!

7 Comments
2024/12/15
12:52 UTC

1

Dataset for my research paper please help

Are therw any datasets which contains images both generated by models like stability,midjourney,runway and real images and need data of noise for both of them

2 Comments
2024/12/14
22:00 UTC

3

Need to alert on companies that are hiring or firing. Any good APIs?

I need a way to alert like “Company X in your area has 5 new jobs posted”

And free or inexpensive APIs that could help me with this ?

2 Comments
2024/12/14
08:30 UTC

4

Institutional Data Initiative plans to release a dataset "5 times that of book3" in early 2025

https://institutionaldatainitiative.org/

https://www.wired.com/story/harvard-ai-training-dataset-openai-microsoft/

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright... with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries... In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line.

1 Comment
2024/12/14
08:08 UTC

1

Lookin for additional US National Pollutants & Animal Movement Datasets

Looking to do some analyses on animal movement in relation to pollutants and anthropogenic landscape features. I have a few datasets/sites collected already, but wondering if I'm missing anything. In particular looking for higher resolution lead/cognition-impairing or mutagenic substances and rodenticide.

Datasets below incase its of use for anyone --

Animal Movement:

Movebank: https://www.movebank.org/cms/movebank-main

Animal Telemetry Network: https://portal.atn.ioos.us/#map

Pollutants:

Enviroatlas: https://enviroatlas.epa.gov/enviroatlas/interactivemap/

Uranium mines: https://andthewest.stanford.edu/2020/uranium-mine-sites-in-the-united-states/

Oil Refineries: https://atlas.eia.gov/datasets/eia::petroleum-refineries-1/explore?location=33.922439%2C-118.375771%2C10.55

Superfund sites: https://www.epa.gov/superfund/search-superfund-sites-where-you-live

PFAS: https://www.ewg.org/interactive-maps/pfas_contamination/map/

Heavy Metals: https://www.sciencedirect.com/science/article/pii/S0048969724011112

ATTAINS water inventory: https://www.epa.gov/waterdata/get-data-access-public-attains-data
NATA /AQS air quality: https://aqs.epa.gov/aqsweb/documents/data_api.html#annual
Toxic release: https://www.epa.gov/toxics-release-inventory-tri-program

1 Comment
2024/12/13
19:30 UTC

2

What data streaming solutions do you use with your workflow?

Either while training an llm or writing apis to query through millions of rows, batch streaming can be a helpful solution to go through the data with by splitting data in batches and parallel processing. What streaming solutions do you use for these purposes in your workflow?

0 Comments
2024/12/13
21:16 UTC

2

Can we automate data quality assessment process for small datasets?

Recently, my friend and I have been thinking of working on a side project (for our portfolios) to automate data quality assessment for small tabular datasets that you often find in kaggle.

We acknowledge that such a tool can't be 100% accurate but it can definitely help nontech people and tech people to get started with working on their datasets. We aim to have a platform where the user will upload a dataset, the system will identify anomalies and give suggestions to the user with different ways to fix that anomaly (e.g. imputation of missing value, fixing an email that doesn't follow the email pattern, etc).

I would love to discuss the project further and get your thoughts on it. We have been researching similar projects and we found Cocoon, they use proceed column by column, and for each column they have a series of anomalies to fix using an LLM. But we want to have statistical methods for numerical columns, and use LLM only when it's needed. Can anyone help?

3 Comments
2024/12/13
11:13 UTC

3

Help to create voice mail prioritising system

How to find the suitable datasets for this (Focusing on medical reception voice mail assistance)

0 Comments
2024/12/11
10:46 UTC

2

Don't understand date format in dataset

I need assistance with a dataset on sea level rise that I downloaded from CSIRO. In the "time" column, there is a record labeled "1880.9583." Could you please clarify what the behind dot portion, ".9583," represents in this context? A decimal portion?

http://www.cmar.csiro.au/sealevel/GMSL_SG_2011_up.html

8 Comments
2024/12/11
08:44 UTC

1

Words that do not convey the subject of a sentence

Hi all! I'm building an application that automatically quizzes you on textual datasets! So far things are working brilliantly, but I'm running into an issue. I wish to remove words that are "uninteresting" for quizzing. Exactly my problem is that I don't know how to describe them, so don't know what to lookup. I'll show an example instead.

"The mitochondria is the powerhouse of the cell"

If I had a simple fill-in-the-blanks question, I want to avoid blanking "the" "is" and "of" as that would make for a very boring quiz question. I'm not a linguist, but from my rudimentary knowledge, I don't know of any linguistic term that applies to these words as they aren't just, in the general case, prepositons, for example.

Best case, someone already knows a dataset of words that I can use, but I would really appreciate any help for even what to look up on this topic.

I hope this is appropriate to ask here, else, forgive me and I'll happily take recommendations for where else to ask!

Many thanks

5 Comments
2024/12/10
23:33 UTC

9

Billion social media posts datasets / sample - dicussion

Hey fellow datasets enthusiasts!

We're excited to announce the release of a new, large-scale social media dataset from Exorde Labs. We've developed a robust public data collection engine that's been quietly amassing an impressive dataset via a distributed network.

The Origin Dataset

  • Scale: Over 1 billion data points, with 10 million added daily (3.5-4 billion per year at our current rate)
  • Sources: 6000+ diverse public social media platforms (X, Reddit, BlueSky, YouTube, Mastodon, Lemmy, TradingView, bitcointalk, jeuxvideo dot com, etc.)
  • Collection: Near real-time capture since August 2023, at a growing scale.
  • Rich Annotations: Includes original text, metadata (URL, Author Hash, date) emotions, sentiment, top keywords, and theme

Sample Dataset Now Available

We're releasing a 1-week sample from December 1-7th, 2024, containing 65,542,211 entries.

Access the Dataset: https://huggingface.co/datasets/Exorde/exorde-social-media-december-2024-week1

A larger dataset of ~1 month will be available next week, over the period: November 14th 2024 - December 13th 2024.

Key Features:

  • Multi-source and multi-language (122 languages)
  • High-resolution temporal data (exact posting timestamps)
  • Comprehensive metadata (sentiment, emotions, themes)
  • Privacy-conscious (author names hashed)

Use Cases: Ideal for trend analysis, cross-platform research, sentiment analysis, emotion detection, and more, financial prediction, hate speech analysis, OSINT, etc.

This dataset includes many conversations around the period of CyberMonday, Syria regime collapse and UnitedHealth CEO killing & many more topics. The potential seems large.

We hope you appreciate this Xmas Data gift.

Exorde Labs

2 Comments
2024/12/10
15:53 UTC

0

Can someone help with downloading a statista report please?

Hi, I would be grateful if anyone can provide report on oncology drugs. The link is below. Thanks in advance.

https://www.statista.com/outlook/hmo/pharmaceuticals/oncology-drugs/worldwide#revenue

1 Comment
2024/12/10
15:16 UTC

Back To Top