/r/datasets
A place to share, find, and discuss Datasets.
Datasets for Data Mining, Analytics and Knowledge Discovery
[PAID]
.[Synthetic]
.Unsure about your post?
Feel free to message the mods and discuss it before posting.
/r/datasets
I wanted to do a quick analysis of a subreddit. Can someone teach me on how to use this? https://github.com/pushshift/api please
I was using gdc cancer portal but they dont have annotation I was wondering is there any resourse for it plsss help me out
Greetings! I am currently conducting research on the US. To start the analysis I require data from BEA that dates back to 1990s (specifically 1997, when the NAICS has been introduced). I am pretty new to the BEA website, so I may be lost. The data I need is county-level. When I head to the archive for GDP by county and metro level, the only data that's available dates back to 2017. Maybe I am doing something wrong? Where can I find older data for county and metro? I may need other county level data from other categories on the website. Maybe there is a website like nhgis but for BEA data?
Hi!
I struggled a lot to find the inflation data for France from an official source. I either found articles from INSEE (National Institute for Statistics and Economic Studies) on the inflation for each month which had a link for that data, and even that was only a subset of all the data for that month. Or I found auxiliary websites that didn't cite the source for their data.
I also looked for official APIs but didn't find something that directly provided the consumption index (inflation index) or a preprocessing of it (year-over-year variation for example). But I stumbled randomly on this https://www.insee.fr/fr/statistiques/series/102342213 (it's an official source, it's the INSEE) for which the title might be confusing. The title suggests that the data there is grouped by products and detailed products (a special nomenclature named COICOP).
I preprocessed it here https://github.com/ReinforcedKnowledge/france-inflation-data-cleaned (includes raw data, preprocessing scripts and preprocessed data). The README is in French but it explains the data a bit and explains how I got granular datasets from that big raw data. I found it a bit messy and confusing at the beginning when I started looking at it, but I was able to extract every unique combination of the modalities (region/department, index type, index variation, if product is under the COICOP nomenclature, household type).
I hope it can help if someone is looking for that data or understand it because it really took me some time and effort to find it and make sense of it.
I wanna make local dataset i don t know how and where to start i need help
Hello everyone, I am currently in a class at the moment that requires me to use a classification dataset and a regression dataset that is not from the UCI ML repository and I want to do my project about something in the social sciences (I have a poli sci background) however I’ve been struggling to find datasets that align with what I’m looking for. Does anyone have good recs for places to look for the kind of datasets I wan?
I'm looking for a dataset/database of good quality (NO Al) food recipes with PICTURES that go alongside with instruction steps for commercial use. I would like to use it in an app l'm creating.
I don't mind paying for it- preferably one time payment, rather than a subscription.
I would have to translate the instructions anyway, so what l'm really worried about are the pictures because of the copyright issues.
And NO APIs, I want to store the database locally.
Thank you
Hello everyone, I need a spam messages dataset to train a LLM based spam message detection bot for Telegram. Any help is appreciated. (Data from Discord would be enough also)
Feel free to request datasets on the platform, and take a look to see if there are any datasets you could source or produce.
These are non-free datasets that will pay generously for your work.
With community help, we can connect data suppliers with data consumers.
I’m looking for a dataset/database of good quality (NO AI) food recipes with PICTURES that go alongside with instruction steps, for commercial use. I would like to use it in an app I’m creating.
I don’t mind paying for it- preferably one time payment, rather than a subscription type of thing.
I would have to translate the instructions anyway, so what I’m really worried about are the pictures because of the copyright issues.
And NO APIs, I want to store the database locally.
Thank you
I'm trying to figure out how to essentially automate the production of monthly data report with nice clean visuals and written summaries based off of the excel spreadsheets that are provided. I'm not sure if chatgpt is best for this, or another AI tool, or some combination of a python code and something else. Any advice would be appreciated!
Hi everyone!
I’m part of a team working on a capstone project focused on crime scene reconstruction and analysis using machine learning and 3D simulations(blender/unity )
What We're Doing: 3D Crime Scene Reconstruction: Creating an interactive model that lets investigators explore and "rewind" scenes to see potential sequences of events (e.g., weapon use, bullet trajectories).
Simulated Evidence Analysis: Replaying crime scenes based on data to visualize how evidence like blood spatter patterns or object placements might have occurred
We’re specifically looking for datasets that contain information related to crime scenes, including data on:
Crime types (especially homicide) Evidence details (e.g., weapon type, trajectory info, blood spatter)
If anyone has worked on a similar project before or knows where we can find reliable and detailed crime scene datasets, we’d greatly appreciate any guidance! We’re especially curious if there’s any open-source or academic dataset available, or if there are any other resources that might be useful for this type of project.
Also any other help related to any aspect of this project will be appreciated and is needed
Thanks in advance for any help, suggestions, or shared experiences!
I've developed a program that uses multiple language models that talk to each other to create databases from scientific papers. I'm looking to use it to build custom datasets for medicinal neural networks. I'm considering deploying it as a website to see if it could be useful for others, but I'm looking for input on how to make it more robust and accessible for broader use.
For those with experience in dataset creation, AI applications in medicine, or similar fields, what features or improvements would make this tool more valuable or realistic for researchers and practitioners? Any insights would be greatly appreciated!
Any leads on costa coffee’s datset. I m a BBA undergrad and require it for a project can someone please help me how to find datasets?
Hello everyone. I'm an undergrad student currently conducting a thesis related to VC-funded firms. I found that Pitchbook may have lots of information (financials) that I need for my paper, but it's really pricey. Wanting to see if there is anyone in the community who can share access with me or pull the data for free 😅 This would really help me kickstart my research. Help this broke student graduate
I need some source of information for a data science project (academic research). Specifically, I need to retrieve an historical record of news about certain topic so I am thinking of using a news API instead of web scraping because these APIs seem to return the kind of data I am searching for.
I've came upon some of them such as newsdata.io, newsapi.org and newsapi.ai, but I am wondering if its usage is legal and realiable? I mean, are they legal themselves? And if so, am I inherently allowed to use them for my personal (academic) purposes?
Term & Conditions say this:
"We don't have the right to authorise any user to use the data for their personal and professional purposes. However, the users can use the data for their personal or professional purposes"
I mean, should I have any concern about this? It's not like Twitter or Reddit's API where data belongs to them and they deliberately give it to you. (In fact, I’m asking this because I planned to extract data from these platforms but I’ve just realized it’s just not possible at all so I am wondering if there’s another alternative I can use to meet my requirment)
Well... in essence, my questions are: Are these platforms/tools (APIs) legitimate and meant for data science? or, in other words: is it a common/familiar practice to use these kind of "news APIs" for data science?
I didn't even knew them. Have you ever tried them before? Should I do web scraping instead or can you see another alternative you could advise me to use?
I'd appreciate your help.
Hi everyone!
Does anyone know where to get it? I need to link regions beloning to certain former entities within the HRE to current geographical locations within Germany (at the municipality level).
I hope someone can help!
I tried extracting images from this dataset but couldn't. It is in DICOM format and I guess in a URL, which I haven't worked with before. Can anyone explain how to access these images?
looks impossible? all the shit i find on kaggle either has no good columns, or many but are just var_1, var_2, var_3, then I search UCI all the datasets are most specific things on the planet, like consumption of energy on a dog´s poop, i am losing my mind
Hello, I'm making a ML algorithm that uses a city infrastructure as features and want to predict its populations.
With OSM library I was able to easly extract the infrastructure data, however I am not able to find a data set with enough european cities. So far all data sets I've encontered only contain data from 50-80 european cities and the rest is Asian cities.
I've tried to use Population density and city area to create the data set for population my self but the numbers I got were terribly wrong.
If someone has any idea of how to get this data I would love the help.
I am trying to build a cnn model for classification purposes. And I need a data set with x-ray (even MRI is fine) images of patients with RA. Preferably images of hands. At least 100 images.
I'm planning on making an application scorecard for home loans as my bachelor thesis for University.
One of my(along with my academic supervisor's) main concern is having a reliable dataset or rather the dataset being from a reliable source. One of the big questions that I'm going to be potentially challenged on in such a thesis is the dataset's reliability so it can't be from somewhere like Kaggle, but for a example somewhere like Experian/Equifax would be okay. I work at a bank and deal with such models but unfortunately I can't use any company data (even if it gets anonymized). So far I've seen some promising stuff in FFIEC's website but would like some additional sources so I can make a more educated decision
Roughly I would need the data to contain these fields:
Age
Job
Income
Education
Marriage Status
Information about previous defaults ( something like a Y/N if the applicant has defaulted on a loan in the last 5 years for example)
Type of property that would be purchased with the loan
Some other fields that I could potentially exclude in further analysis
Pretty much as title.
Hi All, I am planning to host a Datathon as a competition for college students. The sizes which I could find were too small. Share the direct links, websites or any way to get some. Thanks.
Hello everyone!
I’m a researcher-in-training working on exercise physiology, and I’m currently looking for datasets on VO2max or incremental exercise tests that include VO2 and, ideally, blood lactate measures. My goal is to practice determining ventilatory and lactate thresholds to refine my analytical skills in these areas.
If you have access to any anonymized data or know of open-source datasets, I’d be very grateful for any pointers! I’ve checked platforms like OSF and PhysioNet but haven’t found exactly what I need, so any help would be highly appreciated.
Thank you in advance!
Hi everyone! I’m urgently looking to source a HIPAA-compliant database that includes identifiable PHI (Protected Health Information), such as names and specific diagnosis histories, for a research project with rigorous data protection standards.
I need a reputable third-party vendor experienced in securely handling identifiable health data, with all necessary patient consent and compliance protocols in place. Does anyone know of reliable sources or vendors for acquiring such data legally and ethically? Any insights or recommendations are greatly appreciated—thanks!
Hello all,
I am looking for an inventory dataset, however, I would also need the name of the company where the dataset is coming from (not any government data).
Hello,
I am currently researching the correlation between weather patterns and the usage of shared mobility services, specifically focusing on e-scooter rides. I am looking for a dataset containing daily e-scooter ride counts in a city (any city) covering at least one year.
Details of the request:
Any help or direction to relevant data sources would be greatly appreciated.
Thank you very much in advance for your assistance!
I'm not sure of the terminology. But I'm attempting to do research surrounding event based studies when companies speak out. And I've been banging my head against the wall on this for weeks! 😂
Possibly if it's on social issues, on political issues, if they comment on a humanitarian crisis or on an international conflict, etc. But I'm having trouble finding any day sets or any proxies that would measure or rank the number of times they "speak out" other than perhaps things like trading volume of the underlying stock or trending on social media.
Is there any datasets you could suggest or point me towards which can help serve as a proxy for companies that stand up and speak out on societal issues?
Thank you kindly for any thoughts!