/r/datasets
A place to share, find, and discuss Datasets.
Datasets for Data Mining, Analytics and Knowledge Discovery
[PAID]
.[Synthetic]
.Unsure about your post?
Feel free to message the mods and discuss it before posting.
/r/datasets
Hey, I have to write a paper about applied data analysis and for that I am searching for a interesting dataset. I interestingliy can not think of any data by myself, I tried random Google Searches but didn't find any cool data for now. I think the one prequesite my professor set (he wants to learn something new from the analysis) made me weirdly judge all datasets as 'unworthy' if you know what I mean.
Are there any cool datasets from which my professor with background in datascience can learn? (optionally if would be nice if they where fun to work with and not a litteral pain to normalize but yeah just optionally xD)
I'm trying my best to find a company's financial data for my research's financial statements for Profit and Loss, Cashflow Statement, and Balance Sheet. I already found one, but it requires me to pay them $100 first. I'm just curious if there's any website you can offer me to not spend that big (or maybe get it for free) for a company's financial data. Thanks...
Hello Guys,
I intend to do a project on Alzheimer's detection based on eye movements. I read some papers on this but all of them used their own recorded data. Is there any publicly available dataset on this? I will be happy to know your suggestions on this project's implementation.
Hey all,
I'm working on a number of devices where I'd like to use machine learning and live video to identify specific parts of the human body.
This is a sex-positive project, and therefore rather than have a classifier that censors anything it thinks might be nudity, I'm looking for a dataset that will enable me to identify nipples, penises, vaginas, and other potentially erogenous zones on people of all genders, colours, and body shapes.
It feels to me that it should be possible, but I'm new to creating/training models and not sure where to start, so figure standing on the shoulders of others is probably a good place!
I have an idea for a personal project and I could use some help finding a dataset.
Project:
I would like to make a playlist generator where I can specify different moods at different points of time in the paylist. So something along the lines of 1h Chill, 1h Pop, 1h Dance. Obviously I would like mush more refinement that I showed in the example. My thought was that I could find paths between different song types so that the genre transitions are smooth.
Maybe this already exists?
Dataset:
What I am looking for is a long list dataset with obviously the main parameters (name, artist, year etc) but also things like popularity, danceability, singablity, nostalgia factor, high vs low energy, happiness, tempo, and more.
Does a dataset like this exist? I also thought it could be possible to use sentiment analysis on the lyrics to generate some of these parameters.
Let me know if you have any ideas
Is there any detailed breakdown of US spending? I want something ideally that goes very granular. I have no idea how money is managed by the US which is why I’m asking
Is there a dataset that contains both the birth and death dates of real people?
This may be a bit of a morbid topic, but I've been talking to my wife about people dying close to their birthdays, and since I tend to do silly projects as a way to keep my knowledge alive, I figured an analysis of this data might tell us something (preferably that there's no correlation lol).
However, all government databases I found only provide aggregated data, such as death and birth rates, unfortunately. I know this may involve some data security and privacy concerns, but I would really just need these two linked dates to do the analysis, no names or anything.
If anyone has access to a structure like this, or perhaps an API that can make this data available, I would be very grateful. I promise to bring this complete study to reddit as soon as I finish it.
I have to make a personalized learning pathways project for my ai/ml course please help in finding a dataset
I was looking for a dataset where it is team stats for all the teams in the NBA for each year at least in the last decade. I couldn't find it so figure the best way is just to get the csv for each year then combine it. Anyone know any other ways to get it?
Hey guys,
I just wanted to share our project called Potarix (https://potarix.com/). It’s an AI-powered web scraping/data extraction tool that can pull data from any website. You can use it at (https://app.potarix.com).
I wanted to give back to this community, so we’ve given everyone that signs up 5$ of credits. Scraping each page takes up $0.10 of your credits. You are not charged for unsuccessful scrapes! That should let you get data from 50 web pages.
So far, we’ve used this project (with some added features) to help clients:
Looking ahead, we built some stuff in-house that we’d love to include in the SAAS platform shortly. We’ve built functionality to click, type, scroll, etc. on the page. AI also tends to be wrong sometimes, so we created a tweakable script in the backend, to control the agent's actions. That way, you're in control and can bring the script to 100% accuracy. We’ve also seen people battling to build infrastructure for their large-scale scraping projects. We wanna autonomously let folk set up parallelization and choose the infra for their project so everything is scraped as quickly and succinctly as possible from the SAAS.
If any of these future features sound interesting, feel free to book some time, and we can discuss how we can help you with these now!
Hey, data enthusiasts and web scraping aficionados!
We’re thrilled to share a massive new social media dataset that just dropped on Hugging Face! 🚀
This is a goldmine for:
Whether you're a startup, data scientist, ML engineer, or just a curious dev, this dataset has something for everyone. It's perfect for both serious research and fun side projects. Do you have questions or cool ideas for using the data? Drop them below.
We’re processing over 300 million items monthly at Exorde Labs—and we’re excited to support open research with this Xmas gift 🎁. Let us know your ideas or questions below—let’s build something awesome together!
Happy data crunching!
Exorde Labs Team - A unique network of smart nodes collecting data like never before
Hi! I am writing my thesis and I need a data set that contians data of data breaches, how they happend, the scale of it and possibly the sensitivity of the leaked data. I dont know where to find it. The only pleace I know is kaggle and it does not seem professional. Any advice?
I am writing a book chapter on fraud detection using machine learning. I found that most of the current research is rather hard for a person actually building models to apply, every paper likes to highlight the lack of good datasets but no one provides a collection of good datasets that people reading their paper can use
I think that if I include some good datasets for people to train their models on in my chapter, then that will be a very good contribution from my side.
Do you know any good datasets that are used for this, or where I can look for such datasets?
I am honestly clueless when it comes to collecting and finding good datasets for industry grade applications, and I will be really grateful for any help that I get🙏🙏
Currently trying to predict the 2025 super bowl winner for a college final presentation. Trying to use Expected Hypothetical Completion Probability from Big Data Bowl 2019 to help by seeing which teams best optimize their playbook for EHCP and if there is a correlation between that and how often they win / complete but having trouble finding a data source.
The EHCP metric requires two main types of data:
1. Play-by-Play Data:
2. Player Tracking Data:
Key elements:
I was directed to pff.com and https://nextgenstats.nfl.com/ so far but I am having trouble coming up with entire data sets for exactly what I need. Anything helps so please let me know!
Hi everyone,
I'm looking for a tool (preferably free) where I can input a website link, and it will return the structured data from the site. Any suggestions? Thanks in advance!
Are therw any datasets which contains images both generated by models like stability,midjourney,runway and real images and need data of noise for both of them
I need a way to alert like “Company X in your area has 5 new jobs posted”
And free or inexpensive APIs that could help me with this ?
https://institutionaldatainitiative.org/
https://www.wired.com/story/harvard-ai-training-dataset-openai-microsoft/
Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright... with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries... In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line.
Looking to do some analyses on animal movement in relation to pollutants and anthropogenic landscape features. I have a few datasets/sites collected already, but wondering if I'm missing anything. In particular looking for higher resolution lead/cognition-impairing or mutagenic substances and rodenticide.
Datasets below incase its of use for anyone --
Animal Movement:
Movebank: https://www.movebank.org/cms/movebank-main
Animal Telemetry Network: https://portal.atn.ioos.us/#map
Pollutants:
Enviroatlas: https://enviroatlas.epa.gov/enviroatlas/interactivemap/
Uranium mines: https://andthewest.stanford.edu/2020/uranium-mine-sites-in-the-united-states/
Oil Refineries: https://atlas.eia.gov/datasets/eia::petroleum-refineries-1/explore?location=33.922439%2C-118.375771%2C10.55
Superfund sites: https://www.epa.gov/superfund/search-superfund-sites-where-you-live
PFAS: https://www.ewg.org/interactive-maps/pfas_contamination/map/
Heavy Metals: https://www.sciencedirect.com/science/article/pii/S0048969724011112
ATTAINS water inventory: https://www.epa.gov/waterdata/get-data-access-public-attains-data
NATA /AQS air quality: https://aqs.epa.gov/aqsweb/documents/data_api.html#annual
Toxic release: https://www.epa.gov/toxics-release-inventory-tri-program
Either while training an llm or writing apis to query through millions of rows, batch streaming can be a helpful solution to go through the data with by splitting data in batches and parallel processing. What streaming solutions do you use for these purposes in your workflow?
Recently, my friend and I have been thinking of working on a side project (for our portfolios) to automate data quality assessment for small tabular datasets that you often find in kaggle.
We acknowledge that such a tool can't be 100% accurate but it can definitely help nontech people and tech people to get started with working on their datasets. We aim to have a platform where the user will upload a dataset, the system will identify anomalies and give suggestions to the user with different ways to fix that anomaly (e.g. imputation of missing value, fixing an email that doesn't follow the email pattern, etc).
I would love to discuss the project further and get your thoughts on it. We have been researching similar projects and we found Cocoon, they use proceed column by column, and for each column they have a series of anomalies to fix using an LLM. But we want to have statistical methods for numerical columns, and use LLM only when it's needed. Can anyone help?
How to find the suitable datasets for this (Focusing on medical reception voice mail assistance)
I need assistance with a dataset on sea level rise that I downloaded from CSIRO. In the "time" column, there is a record labeled "1880.9583." Could you please clarify what the behind dot portion, ".9583," represents in this context? A decimal portion?
Hi all! I'm building an application that automatically quizzes you on textual datasets! So far things are working brilliantly, but I'm running into an issue. I wish to remove words that are "uninteresting" for quizzing. Exactly my problem is that I don't know how to describe them, so don't know what to lookup. I'll show an example instead.
"The mitochondria is the powerhouse of the cell"
If I had a simple fill-in-the-blanks question, I want to avoid blanking "the" "is" and "of" as that would make for a very boring quiz question. I'm not a linguist, but from my rudimentary knowledge, I don't know of any linguistic term that applies to these words as they aren't just, in the general case, prepositons, for example.
Best case, someone already knows a dataset of words that I can use, but I would really appreciate any help for even what to look up on this topic.
I hope this is appropriate to ask here, else, forgive me and I'll happily take recommendations for where else to ask!
Many thanks
Hey fellow datasets enthusiasts!
We're excited to announce the release of a new, large-scale social media dataset from Exorde Labs. We've developed a robust public data collection engine that's been quietly amassing an impressive dataset via a distributed network.
The Origin Dataset
Sample Dataset Now Available
We're releasing a 1-week sample from December 1-7th, 2024, containing 65,542,211 entries.
Access the Dataset: https://huggingface.co/datasets/Exorde/exorde-social-media-december-2024-week1
A larger dataset of ~1 month will be available next week, over the period: November 14th 2024 - December 13th 2024.
Key Features:
Use Cases: Ideal for trend analysis, cross-platform research, sentiment analysis, emotion detection, and more, financial prediction, hate speech analysis, OSINT, etc.
This dataset includes many conversations around the period of CyberMonday, Syria regime collapse and UnitedHealth CEO killing & many more topics. The potential seems large.
We hope you appreciate this Xmas Data gift.
Exorde Labs