/r/datasets

Photograph via //r/datasets

A place to share, find, and discuss Datasets.

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

  • Try to post original source whenever you can.
  • Low effort posts will be removed.
  • Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
  • Any Paid Dataset or Resource must be marked as such in the title with [PAID].
  • Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
  • All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.

Related Subreddits

/r/datasets

198,197 Subscribers

1

Built a one-click tool which analyses any CSV file and generates a PowerPoint

Hi all, I've created a data science tool that I hope will be very helpful and interesting to a lot of you!

https://www.csv-ai.com/

Its a one click tool to generate a PowerPoint/PDF presentation from a CSV file with no prompts or any other input required. Some AI is used alongside manually written logic and functions to create a presentation showing visualisations and insights with machine learning.

It can carry out data transformations, like converting from long to wide, resampling the data and dealing with missing values. The logic is fairly basic for now, but I plan on improving this over time.

My main target users are data users who want to quickly have a look at some data and get a feel for what it contains (a super version of pandas profiling), and quickly create some slides to present. Also non-technical users with datasets who want to better understand them and don't have access to a data scientist.

The tool is still under development, so may have some bugs and there lots of features I want to add. But I wanted to get some initial thoughts/feedback. Is it something you would use? What features would you like to see added? Would it be useful for others in your company?

It's free to use for files under 5MB (larger files will be truncated), so please give it a spin and let me know how it goes!

0 Comments
2024/11/22
12:37 UTC

2

Does anyone knows where to find an image dataset for vegetables

All the data Sets I find are fruit mainly and vegetables on the side or the take Like 6 types of vegetables and have less than 100 images for training

3 Comments
2024/11/22
07:46 UTC

2

API access to the National Blend of Models - weather forecasts history [self-promotion]

Disclosure first. https://gribstream.com/ is my indie hacking side project.

It has a free tier with a generous daily limit.

The original data is the NOAA National Blend of Models (NBM) https://vlab.noaa.gov/web/mdl/nbm and it is totally free. But if you've worked with grib2 datasets you know how cumbersome it can be for some usecases and that is what this API is for.

The API let's you query this dataset to extract timeseries for thousands of coordinates, for months at a time, for many weather parameters in a single http request taking a few seconds, without having to download tens of terabytes of grib2 files.

It supports as-of/time-travel which is priceless to do proper backtesting when using the dataset as features into other prediction models.

I'd really appreciate any feedback :)

Thank you!

1 Comment
2024/11/22
02:48 UTC

1

Need help in finding dataset on scientific or acdemic papers for summarization

So, I looking for dataset which has human generated summary of scientific papers and orginal pdfs of those papers.

0 Comments
2024/11/21
04:42 UTC

1

Need some help to catch data for my school project

Hi guys,

I'm working for my end of bootcamp project, and I'm still missing some data ! I'm looking for some tips or sources to get everything I need. I have a full dataset of nasdaq stock data since 1980, identified by their tickers. I now need to add the company name + some basic data to classify each one (sector, some tags about what they do, and business size). I'd like to give each one an "ESGish" score.

Seems like such data isn't free!

If anyone around here had any idea to help, i'd be really thankful =)

2 Comments
2024/11/20
19:44 UTC

2

Help me find an Allergy Dataset for a project

Hi I need an Allergy dataset which has the food item and the allergy associated with it. It needs to cover all allergies.

If someone could help me find it Thank you!

4 Comments
2024/11/20
19:07 UTC

1

Number and details data which include address and other details

If anyone need number and details data i got some. Feel free message me for those data

1 Comment
2024/11/20
08:22 UTC

1

Looking for up to date - PGA Tour Datasets

Does anyone know where I might be able to find up-to-date PGA Tour data? Or are there any APIs available for this?
Most datasets ive found online that are free dont provide enough data for the project Im working on and or the data is out of date.
Anyone have any recommendations?
Websites like https://datagolf.com/ or https://rickrungood.com/ cost too much in my opinion for the APIs, i just want a once off dataset.
If anyone has datasets they are willing to share it would be a great help or if anyone has a web scraping project done for the PGA tour i would love to check it out.

1 Comment
2024/11/19
18:28 UTC

1

Need data for my statistics class final

Hey everyone for my statistics class I am required to gather some data in order to explain my hypothesis. I need 100+ participants and hurd that this was a good place to get that done. The link below is a simple 1 question survey on who do you think would win in a fight Garfield or Snoopy. Please and thank you.

link  https://docs.google.com/forms/d/e/1FAIpQLSfTtUr7W14934Uz2JjZTRrWTQtLLofiMeiZWcAqAYDFuF6Haw/viewform?usp=sf_link

0 Comments
2024/11/20
02:12 UTC

3

Where to find water datasets for Peru?

I'm doing a project on ArcGIS Pro about water management in Peru, but I'm struggling to find available data about water and land use in Peru. Does anyone know where I can find data for my project?

Here is a summary of my project:

Lime production is a critical industry in Peru, supporting sectors such as mining, agriculture, and construction. However, lime processing is water-intensive, often located near scarce water resources, potentially impacting local ecosystems and communities. Sustainable management of water resources is essential to balance industrial needs with environmental conservation and community access to water. This project will use GIS analysis to assess the environmental and community impact of water consumption by lime production facilities in Peru.

I will be addressing the following questions: What is the spatial relationship between lime production facilities and local water sources? How does water usage by these facilities affect nearby communities and ecosystems? Which areas are most at risk of water scarcity as a result of high industrial water demand from lime production? By addressing these questions, my project seeks to identify high-risk areas, assess the environmental impact, and offer insights into sustainable water management practices for this critical industry.

2 Comments
2024/11/19
23:08 UTC

2

Seeking dataset on earnings by age (or years experience) and occupation (or occupational cluster)

I'd like to look at how earnings correlate with age (or years of experience), ideally within each occupation, but even within general industries would suffice.

1 Comment
2024/11/19
22:43 UTC

1

Need technical data for multiple ransomware attacks

Hi guys, I am looking to train a machine learning model for the following data types any leads would be appreciated to find datasets that might contain these values -

  • Filter_size (bytes): The size of the encrypted file in bytes;
  • File Entropy: The degree to which the encrypted file’s contents are unpredictable or random;
  • Network Traffic (KB): The total quantity of data transferred over the network during the ransomware attack;
  • Number_of_Encrypted_Extensions: How many different types of files the ransomware can encrypt;
  • Time_to_Encrypt (seconds): The number of seconds needed for the ransomware to encrypt the data;
  • Cloud Provider: The name of the cloud storage provider where the secret information is stored;
  • Number_of_Shared_Folders: The total number of infected shared folders;
  • Encryption Strength: How secure the ransomware’s encryption algorithm is;
  • CPU Usage (%): Ransomware CPU use as a percentage;
  • Suspicious_Activity: An attack-related suspiciousness indicator expressed as a binary variable;
  • Ransomware_Type (Output): The ransomware strain (the dependent variable) that was used in the attack.
1 Comment
2024/11/19
16:08 UTC

1

Looking for multi-class classification datasets in Finance 6.5 blue eyes

Most of the datasets I have came across are binary, so I am here looking for some suggestions :) My scope is only tabular data.

2 Comments
2024/11/18
22:30 UTC

2

Seeking US Presidential Election Time-Series Data (any election)

Hello! I am seeking time-series data for any previous US presidential election (or really, any nationwide election). I am looking to use this data to experiment with election visualizations that display the state of the US's voting as the night progresses (like found on Google or any major journal on Election night). If anyone knows how I may find such data, or reconstruct it myself, I would appreciate it greatly.

I specifically am looking for time-series data, not final vote counts alone, as I'm interested in creating a live-updating visualization for the votes as they come in. I thought about just gradually interpolating towards the final vote counts to simulate the votes over time, but this wouldn't communicate the flip-floppy nature that makes watching an updating visualization exciting/stressful. If you linearly interpolate, whoever wins that state will always be ahead in that state, which is typically not the case. The rate at which counties return voting data, the populations of those counties, and the political leanings of those counties, and timezones all vary greatly nationwide.

I know this is a long shot - seems like election data is surprisingly hard to come by in the first place - but I appreciate any leads or suggestions!

3 Comments
2024/11/19
04:19 UTC

1

Looking for a master list of "Live at KEXP" performances on youtube

Has anyone compiled a list of every KEXP "Live on KEXP" performance on You Tube? I'm looking for a master list.

1 Comment
2024/11/18
23:49 UTC

2

Pitchbook access/reports for certain companies needed for Masters

My sister is doing her Masters degree and her Uni can't provide her with an access to Pitchbook. Was wondering if somebody here could help her out with an access for a few minutes or with screenshots of entries.

Any help is much appreciated

0 Comments
2024/11/18
21:01 UTC

3

Looking for soft or carbonated beverage importer data

I am looking for data for beverage importers. Anyone can help me?

1 Comment
2024/11/18
20:27 UTC

3

looking for Datasets of Tweets, Reddit, Discord, or Email from December 2014 or Before

I’m looking for English text-only datasets from December 2014 or earlier. Specifically, I’m interested in datasets that cover a broad range of topics, and it would be useful if they are free of spam or low-quality content. I'd like them to be from twitter, reddit, Discord, or emails.

If anyone knows where I can find those kind of datasets or has access to them, please let me know. Your help is greatly appreciated!

Thanks in advance!

(I'm making an LLM for my games dialogue system and the game is set in 2014)

5 Comments
2024/11/18
17:57 UTC

1

Datasets of close up images of trucks and cars?

Hi guys, i've been trying to train a neural network that recognizes trucks, cars and a specific type of golf cart of which i already have many pictures, my question is, are there out there any datasets with specifically that type of images but close? Most of what i've found also include images from afar and i only need the neural to recognize from up close, at most 5 to 10 meters from the vehicle

Also, regarding my "golf cart" dataset, that set sometimes has cars or trucks in the background, should i label those as well? even though i only want the neural to learn about the specific type of vehicle?

Thanks in advance!

1 Comment
2024/11/18
16:41 UTC

3

I’m looking for data (preferably excel, but in general) on DUIs. Per month, per year, by state

Please help!

1 Comment
2024/11/18
12:56 UTC

1

here is my 2.5 million midi file dataset [self-promotion]

i spend like a month collecting and scraping midi files https://huggingface.co/datasets/breadlicker45/toast-midi-dataset

1 Comment
2024/11/17
13:24 UTC

0

[WILLING TO PAY] Need dataset of resumes with applicant gender data

Does anyone happen to know of a specific dataset containing resume information and gender? I'm doing a study on the language men and women use in describing their work and need a dataset containing both. Can be in any format.

4 Comments
2024/11/18
03:11 UTC

2

Seeking Recommendations for Low-Cost Mobility Data Providers for People Density Analysis in Stores and City Areas

Hi everyone,

I'm working on a project to understand people density, both within stores and across different areas of the city, to analyze foot traffic patterns. I know that location data providers like SafeGraph, Cuebiq, and Factori offer these types of mobility datasets, but I’m concerned about the potential cost, which I’ve heard can be quite high.

I’m hoping to find some alternative providers or potentially lower-cost options that could still give me the insights I need without breaking the bank. My ideal dataset would allow me to:

  • See density and movement patterns around specific POIs (like retail stores or malls)
  • Understand general population density fluctuations across city areas

If you have experience working with affordable mobility data providers (like Veraset, Quadrant, etc.), I’d love to hear about your recommendations, especially if you’ve found options that provide flexibility in pricing or smaller, more budget-friendly packages. In general there's no options available for small pet projects?

Thanks in advance for any tips!

0 Comments
2024/11/17
23:19 UTC

1

Hi, I need a relational dataset (with 5-10 tables) for my database lecture project!!

I searched a lot but I found very few datasets that meet my requirements :( It needs to have primary and foreign keys and meaningful data.

8 Comments
2024/11/17
18:58 UTC

1

Help with ML Project for Damage Detection

Hey guys,

I am currently working on creating a project that detects damage/dents on construction machinery(excavator,cement mixer etc.) rental and a machine learning model is used after the machine is returned to the rental company to detect damages and 'penalise the renters' accordingly. It is expected that we have the image of the machines pre-rental so there is a comparison we can look at as a benchmark

What would you all suggest to do for this? Which models should i train/finetune? What data should i collect? Any other suggestion?

If youll have any follow up questions , please ask ahead.

10 Comments
2024/11/17
12:42 UTC

1

I search for dataset to train model for my graduation project

my graduation project is to train security model in code Vulnerability
anyone knows where can i find data like that because i don't find it on Kaggle or hugging face?

1 Comment
2024/11/17
05:15 UTC

1

Request for a dataset for Rasch analysis

Hello, Reddit community!

I am currently working on a project involving the analysis of student performance using the Rasch model. I’m looking for a dataset that includes individual student responses to exam questions, specifically with data indicating whether each response was correct or incorrect.

If anyone knows of any publicly available datasets that fit this description, or if you have recommendations on where I might find such data, I would greatly appreciate your help!

Thank you in advance for your assistance!

1 Comment
2024/11/16
10:43 UTC

1

Interesting or ‘niche’ Film Datasets?

Just out of interest does anyone have any interesting or niche film data sets? (I’m not talking about standard top 250 IMDB films etc)

Thanks

0 Comments
2024/11/16
20:28 UTC

1

Vertebrae for cobb angle measurement

Hello guys, is there any dataset for vertebrae with keypoints and bounding box available online?

0 Comments
2024/11/16
14:46 UTC

Back To Top