/r/datasets

Photograph via //r/datasets

A place to share, find, and discuss Datasets.

Datasets for Data Mining, Analytics and Knowledge Discovery

Rules

  • Try to post original source whenever you can.
  • Low effort posts will be removed.
  • Self-promotion(of a website/domain you work for or own) without disclosure will be removed.
  • Any Paid Dataset or Resource must be marked as such in the title with [PAID].
  • Any Synthetic/Mock data must be marked as such in the title with [Synthetic].
  • All Survey posts are subject to approval. Message the mods before posting.

Unsure about your post?

Feel free to message the mods and discuss it before posting.

Related Subreddits

/r/datasets

191,431 Subscribers

5

AI Books4 Dataset for training LLMs further

What?

  • More than 400,000 fiction and non-fiction book full-texts. Multiple languages, curated, deduplicated.

  • More than 6,000,000 scholarly publications, magazines, and manuals full-texts. Multiple languages, curated, deduplicated.

  • 150,000,000 metadata records

Format

Zstd compressed file, JSON lines, one per book/publication.

  • abstract, content - description and content in markdown format

  • issued_at - time of issuing of the object (not of the record itself)

  • metadata - ISBNs, publishers, series etc

  • id - identifier in external systems, if applicable (i.e. DOI)

other fields should be self-descriptive

Download:

magnet:?xt=urn:btih:a904e660355c49006b2e7d43893d31bf3c2be9cc&dn=libstc2.jsonl.zst&tr=udp://tracker.opentrackr.org:1337/announce&tr=https://tracker1.ctix.cn:443/announce&tr=udp://open.demonii.com:1337/announce

0 Comments
2024/05/19
06:40 UTC

2

I will create free data pipeline + analytics dashboard for you

I am an experienced data engineer and I have three free days next week.

If you have a dataset for which you would like to create a data pipeline for continuous ingestion, and you would like a dashboard built and/or AI-based Q&A on top of that, I am available to help. I will take on the project if it is interesting enough and if you can benefit from it - for free :).

The dashboard/Q&A would be made available at dataflick.dev 's free tier.

Let us see if there are some interesting usecases

0 Comments
2024/05/18
20:28 UTC

0

Construction Schedule Database !!!!!

Looking for Construction Schedule database that shows project location, start date, finish date and different activities to meet the finish date.

2 Comments
2024/05/18
16:49 UTC

1

Looking for synonym database in sqlite

Hi all,
I'm looking to program a fun CLI tool in Rust that will take a string and then replace all of the words with a random synonym. I plan on implementing this using a sqlite3 package to make queries to an already existing (SQLite) database containing a bunch of synoyms.
The only issue now is that I can't seem to find a page for said database, and writing one by hand sounds like a terribly daunting task 😅

Would somebody be able to help me find this?

0 Comments
2024/05/17
23:50 UTC

3

Looking for Car Theft Data either City, State, or National

Hi I’m looking for a dataset that has car theft data. I’m looking for make/model, time of theft, location, recovered(y/n), and details if possible. This is for a school project that I hope becomes a helpful tool to mitigate car thefts.

I reached out to the FBI and local PD, but haven’t received a response. I don’t care much for the location of the dataset but am prioritizing location of thefts.

1 Comment
2024/05/16
22:06 UTC

2

Data set of 50-state rankings of US states by various criteria

This is kind of a weird thing I’m doing, but it’s to test out a theory. I need as many 50-state rankings of US states by any kind of criteria as I can get my hands on. Can be anything from ranked by size to ranked by baseballs per capita or something. Does anyone have any suggestions for places to look short of manual collection?

0 Comments
2024/05/16
17:34 UTC

1

Hello, Im currently working on my academic thesis, I need refinitiv workspace access to gather esg data, The school granted me limited access and now im looking for someone who can help. Thank you

Hello, Im currently working on my academic thesis, I need refinitiv workspace access to gather esg data, The school granted me limited access and now im looking for someone who can help. Thank you

0 Comments
2024/05/16
15:25 UTC

1

Popular streaming services (eg. Netlifx, AmazonPrime, Disney+, ect.) metadata

I'm looking to do a python-based data analysis and visualisation project. I was looking to focus on the data and metadata of most, if not all, available movies and TV series provided by the most popular streaming services.

I see most online projects use this kaggle source: https://www.kaggle.com/datasets/shivamb/netflix-shows/data

As nice as it is, it's not as up to date as I would have liked, as it only goes up to 2021.

Is anyone aware or any other public, free dataset similar to the above which could fit my purpose?

I'm aware there are many sites such as https://flickmetrix.com/ and https://flixable.com/ which seem to have a large amount of movie's data but I can't seem to be able to find their source and/or if they have shared it publicly.

Thank you

2 Comments
2024/05/16
11:09 UTC

0

Open source data sharing project for research labs / individuals

Hey guys! I have noticed that there is not much in the realm of open source datasharing services, so I created a Django REST / React app that allows for upload, download, reviewing, etc, of files. Not sure if would be useful to people. Also, please feel free add features. This is meant to be an open source project that allows research labs / people to share and review datasets without needing to pay for any online subscriptions. https://github.com/lxaw/DataDock

1 Comment
2024/05/16
11:55 UTC

0

Need a college dataset for a AI i’m making

Hello!

I have spent hours looking for a dataset that includes information over college courses + a description briefly describing the course.

I have had some luck having found thorough datasets explicit to certain colleges. Perhaps I can just use those and call it good; I assume most colleges have roughly the same courses, some differ slightly.

But before I continue my journey I just wanted to see if this community knows of any decent datasets in regards to college information including, but definitely not limited to, the majors and a brief description of the majors?

0 Comments
2024/05/15
23:53 UTC

2

Recommend me a Dataset for hands on project

Hey there, I am learning apache spark and aws cloud. I am planning to make a project basically an ETL project using Glue. I want to perform transformations using spark but I haven’t came around any good dataset, it’s not like there are not datasets but I want a big dataset of thousands of rows and some under 10 columns, like I have found out some myself like UFO, World Bank etc either it is too big or it just not have a good source. Are there any fellow redditors who have worked on something similar or you just have a good Recommendation??

1 Comment
2024/05/15
23:36 UTC

2

[self-promotion] ICYMI: You can now get notified when any new code is released for a given paper or topic!

ICYMI: You can now get notified when any new code is released for a given paper or topic! Just install the code finder extension (Chrome: https://chromewebstore.google.com/detail/ai-code-finder-for-papers/aikkeehnlfpamidigaffhfmgbkdeheil | Firefox: https://addons.mozilla.org/en-US/firefox/addon/code-finder-catalyzex/ | Edge: https://microsoftedge.microsoft.com/addons/detail/get-papers-with-code-ever/mflbgfojghoglejmalekheopgadjmlkm), click on any bell/alert icon you come across while browsing the web and follow the next steps on the screen 🙂 Also, with alerts

  • get the latest developments in your area of interest delivered straight to your inbox.
  • Author's newest work: be the first to know when an author releases new papers.
1 Comment
2024/05/15
19:23 UTC

1

Datasets or just historical data for individual gas stations?

I’m trying to look a the trends in the South Ontarios gas prices for each location (going as local as an individual gas station), but I can only find the gas price monthly averages. Any ideas where I can find historical data for individual gas stations? I know that Google and Apple Maps display them, but I’m not sure if they store historical data.

0 Comments
2024/05/15
17:04 UTC

3

How to price image data for data monetization?

I'm currently researching how satellite imagery data (or any other type of Image data), especially hyperspectral and multispectral data, is priced by different companies. I'm particularly interested in how these companies determine the cost for various sectors like agriculture, mining, and environmental monitoring.

Here's some context:

Service Tiers: Companies often offer different service tiers (e.g., tasking, archive access, subscription models).

Resolution and Coverage: Pricing seems to vary based on image resolution (e.g., 5-meter vs. sub-meter) and the area covered.

Applications: Different use cases might influence pricing (e.g., crop health monitoring, yield prediction, soil analysis).

Technology: Advances in satellite technology, such as deployable optics, might impact cost.

I've seen companies like Wyvern Space, Planet Labs, and Pixxel offering these services but haven't found detailed public pricing information.

Could anyone share insights or resources on:

- General pricing strategies for satellite imagery (and image data in general) data and any approximate numbers?

- How factors like resolution, coverage area, and application affect pricing?

- Any case studies or examples from companies in this field?

Thanks in advance for your help!

3 Comments
2024/05/15
14:01 UTC

1

RGB Ship Satellite Imagery Dataset for Segmentation Needed

Hi!
For a project of mine, I need to find multiple datasets which contain RGB images of ships (multiple ships in a single image) for a segmentation task.
I'd be grateful if you guys can recommend any.

1 Comment
2024/05/14
13:25 UTC

1

DataSet for Training Models for Detecting levels of depression

Hi everyone! I wish to create a dataset with phrases depicting various levels of depression.

I am aware of the fact that I can easily scout through reddit posts and create a dataset, but I wish to create it using a model, which could give me an endless supply of “human-like” phrases which mimics actual people describing their depression.

I was thinking of maybe scraping through some medical journals which could give me some symptoms of depression and related issues, and then create a model which takes these symptoms and creates “human-like” phrases related to these symptoms, but am not sure how I could implement this.

Any help would be appreciated. Thanks a lot!

1 Comment
2024/05/14
13:34 UTC

1

I’m having troubles finding economic data about the Democratic People's Republic of Korea (North Korea) - Bachelor Thesis

Hi, I’m Paula

I'm working on my bachelor's thesis and need to find some reliable economic data on North Korea. It's pretty tricky to locate good sources for this, so I thought I'd ask if you have any suggestions on where to look or who to talk to. I'm looking for data spanning from 1960 to 2023, covering the following indicators:

  1. GDP at constant prices

  2. Investment (Gross Fixed Capital Formation, GFCF)

  3. State intervention: public spending as a percentage of GDP

  4. Country openness: the sum of exports plus imports divided by GDP ((X+M)/GDP)

  5. Real exchange rate

  6. Economic structure (GDP by sector)

Sorry if this is not the right place to post this, but I'm quite lost and don't know where else to look. I already have some of the data, but it's either not for all years or it's incomplete. I've also checked the Bank of Korea and World Bank data, but most of it only covers a few years or isn't very old.

2 Comments
2024/05/14
10:32 UTC

1

Seeking Dataset for Internet Traffic Analysis (Malicious vs. Legitimate)

I'm currently working on my bachelor's thesis, that is aimed at building a classification model to differentiate between malicious and legitimate internet traffic. I'm trying to gather the data on my own but I'm unable to get the ammount of data needed to train a decent model. I'm in need of a dataset containing internet traffic labeled as either malicious or legitimate (binary classification).

The dataset should ideally include features commonly associated with internet traffic analysis, such as IP addresses, timestamps, protocols, packet sizes, etc. Any additional contextual information would be highly beneficial.

If you know of any publicly available datasets or have access to such data, including well-done synthetic datasets, please let me know.

3 Comments
2024/05/13
19:57 UTC

1

Country wise natural resources deposits

I got this data from wikipedia. I had a hypothesis that the country with more natural resources is richer. But the data didn't support my hypothesis. Heres the data though.

https://drive.google.com/drive/folders/1JftfuxdMDiqAFVenl7wXWTMpQaAGR8vO?usp=drive_link

2 Comments
2024/05/13
18:27 UTC

6

Article: How To Price A Data Asset; What criteria go into such a calculation.

Large article on data pricing.
Really good overview and information.
https://pivotal.substack.com/p/how-to-price-a-data-asset

0 Comments
2024/05/13
17:20 UTC

2

Search engine and dataset for local government meetings in US and Canada [self-promotion]

I wanted to share a new search engine called CivicSearch. You can type in a keyword like “pickleball” or “affordable housing” and get a list of mentions in government meetings from 600+ US and Canadian cities: civicsearch.org

For an example of what’s possible with this data, we’ve written (and are writing) a series of newsletters that explore specific topics in detail, like Black History Month, school absenteeism, and bus rapid transit. You can subscribe to receive these updates by email, as well as personalized alerts for any location or keyword.

I created this tool, and I hope you find it useful. I’m here if you have any questions or suggestions.

1 Comment
2024/05/11
14:54 UTC

1

What exactly is Clickstream data and where to find it?

Several analytics companies that offer "competitor analysis" can get data on website visits, direct traffic, referral traffic, app downloads, app searches, time on site, bounce rate, etc.

When I contact them to ask where they source the data, they mutually say "from Clickstream" but refuse to elaborate more.

What is Clicksream? is it a single data provider? or multiple? where to find them?

Google search hasn't really revealed much, I guess it is a very niche b2b area where you need connections and good sources...

4 Comments
2024/05/12
00:36 UTC

0

anyone into data science? need some career advice

20 year old statistics student(2nd year) from BHU. 2nd year is here and I've been feeling the need to get serious about career . Latelu I've been wanting to get into data analytics/ data science and AI.But i have absolutely 0 idea as to how to go about it.as of skills I am learning python these days. anyone who's already into this field that can help me out? Maybe as in what courses can I take online or like a rough road map. I wish to eventually bag an internship by 3rd year.

0 Comments
2024/05/11
23:25 UTC

0

anyone into data science? need some career advice

20 year old statistics student(2nd year) from BHU. 2nd year is here and I've been feeling the need to get serious about career . Latelu I've been wanting to get into data analytics/ data science and AI.But i have absolutely 0 idea as to how to go about it.as of skills I am learning python these days. anyone who's already into this field that can help me out? Maybe as in what courses can I take online or like a rough road map. I wish to eventually bag an internship by 3rd year.

1 Comment
2024/05/11
23:20 UTC

Back To Top