/r/datamining

Photograph via snooOG

News, articles and tools for data mining: the process of extracting useful information from large data sets.

News, articles and tools for data mining: the process of extracting useful information from large data sets.

Smokey says: file green shareholder resolutions to fight climate change! [see more tips]

Resources:

Other subreddits you may like:

Does this sidebar need an addition or correction? Tell me here

/r/datamining

15,211 Subscribers

2

In PCA what does the borderline eigenvalues function represent? And which 2-way matrix does it come from?

My professor told us of course that it can never be increasing, it is decreasing by definition, but he told us that there is a borderline case (which does not come from a square matrix), but I can’t understand. Thank you in advance

0 Comments
2024/05/01
20:02 UTC

1

A data mining work in a chess database

Hi to everyone

As a work to finish my degree on statistics I'm doing a work on data mining techniques with a chess database. I have more than 500.000 chess games with variables about the number of turns, elo and how many times each piece has been moved (for example, B_K_moves is how many times Black has moved the King)

Problem is, I'm supposed to do the decision tree with all the steps but ... the decision tree only has 3 nodes of depth. This is the tree, and I'm supposed to do steps like the pudding but ... it's very simple and I don't know why the algorithm doesn't use variables like W_Q_moves (how many times white has moved the queen) or B_R_moves (how many times Black has moved a rook).

This is the code I've used with the library caret in R:

control <- trainControl(method = "cv", number = 10)
modelo <- train(Result ~ ., data = dataset, method = "rpart", trControl = control)
print(modelo)
## CART
##
## 212282 samples
## 15 predictor
## 3 classes: ’Derrota’, ’Empate’, ’Victoria’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 191054, 191054, 191054, 191054, 191053, 191054, ...
1
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01444892 0.6166044 0.2417333
## 0.02930692 0.5885474 0.1931878
## 0.13442808 0.5668073 0.1448201
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01444892

And the code to plot the tree:

library(rpart.plot)
## Loading required package: rpart
rpart.plot(modelo$finalModel, yesno = 2, type = 0, extra = 1)

As I said, I don't know why the depth is so small and I don't know what to change in the code to make it deeper

0 Comments
2024/04/30
23:52 UTC

3

Clustering Embeddings - Approach

Hey Guys. I'm building a project that involves a RAG pipeline and the retrieval part for that was pretty easy - just needed to embed the chunks and then call top-k retrieval. Now I want to incorporate another component that can identify the widest range of like 'subtopics' in a big group of text chunks. So like if I chunk and embed a paper on black holes, it should be able to return the chunks on the different subtopics covered in that paper, so I can then get the sub-topics of each chunk. (If I'm going about this wrong and there's a much easier way let me know) I'm assuming the correct way to go about this is like k-means clustering or smthn? Thing is the vector database I'm currently using - pinecone - is really easy to use but only supports top-k retrieval. What other options are there then for something like this? Would appreciate any advice and guidance.

0 Comments
2024/04/30
06:39 UTC

4

Grandma sent photo links

My grandma passed and had sent photos but they have expired and no longer work as it was a link I'm wondering if lts possible to extract the data using the links

0 Comments
2024/04/16
20:37 UTC

1

Scoring scale for KDD2024 conference reviews

Does anyone know what the scoring scale for KDD Conference reviews is this year? I only see numbers proposed by reviewers on OpenReview but can not find the overall scale anywhere.

0 Comments
2024/04/05
12:36 UTC

4

Historical Stock Market Data

I'm looking to perform some data analysis on stock market data going back about 2 years at 10 second intervals and compare it against real time data. Are there any good resources that provide OHLC and volume data at that level without having to pay hundreds of dollars?

1 Comment
2024/03/16
01:35 UTC

2

Grey-hat email mining

In light of the decision on Meta v. Bright Data, Instagram data mining is back on the lunch table.

What would be a way to market this - is SaaS a good move? I've done plenty of research on how to defeat Meta and their devious anti-scraping mechanisms...but there's no point to this code if it is not profitable.

There are others in this sphere that are charging way too much, so I am clueless as to how (and if) they are getting any customers.

Sorry if this comes off as elementary or trivial, I'm a hacker and coder - not a businessman.

0 Comments
2024/03/12
03:35 UTC

1

Apriori Algorithm Output based on different min_support thresholds.

After running my algorithm on keywords of tweets I got these results at different min_support values.

These were the top frequent patterns found for a min_support of 5000.

flu shot (23150)

flu got shot (20868)

flu get shot (20865)

flu get (7642)

flu got (7388).

These were the top frequent patterns found for a min_support of 500.

flu shot (23150)

flu got shot (20868)

flu get shot (20865)

flu getting shot (8289)

flu get (7642)

flu got (7388)

Does this make sense? How does raising the min_support eliminate this pattern, even though its support is over 5000?

0 Comments
2024/03/02
20:35 UTC

1

Data Mining

I'm diving into big data applications and looking to explore the wide array of data mining tools out there. Can you share your favorite data mining tool that you've used in a big data application?

I'm particularly interested in hearing about tools that shine in specific applications. So, if you've used a tool for something like sentiment analysis, fraud detection, recommendation systems, or any other big data application, I'd love to hear about your experience with it!

0 Comments
2024/03/01
17:49 UTC

1

Any developers here wanting to shape the future of Docker?

0 Comments
2024/03/01
13:24 UTC

4

Mining Twitter using Chrome Extension

I'm looking to mine large amounts of tweets for my bachelor thesis.
I want to do sentiment polarity, topic modeling, and visualization later.

I found TwiBot, a Google Chrome Extension that can export them in a .csv for you. I just need a static dataset with no updates whatsoever, as it's just a thesis. To export large amounts of tweets, I would need a subscription, which is fine for me if it doesn't require me to fiddle around with code (I can code, but it would just save me some time).

Do you think this works? Can I just export... let's say, 200k worth of tweets? I don't want to waste 20 dollars on a subscription if the extension doesn't work as intended.

22 Comments
2024/02/19
20:09 UTC

3

Social media sentiment analysis for total beginners

I'm not sure if this is the right subreddit but I can post it somewhere else if it isn't.

Does anyone know a really good tutorial for beginners to navigate the basics of KNIME and how to use it for SMSA?

I need it for my uni thesis and I also have no experience in KNIME, but I've used basic PowerBI and SPSS Statistics

Thank you in advance :)

3 Comments
2024/02/15
17:29 UTC

2

I need help

there is a guy is spamming phone calls in the last 3days

In need more information about him and all I have is his phone number

and the police can't do anything about it

please help me so I can stop him

2 Comments
2024/02/09
17:27 UTC

5

Algorithm to find patterns in temporal sequences?

I have a large database with different types of errors in temporal sequence. Example: A, C, F, C, G, D, A, G,...., F, G, D, A... F, S, G, D, H, A... What algorithms can I use to find repeating patterns? (In the example: to discover that when F, G and D occur, A subsequently occurs). Thanksssss :)

1 Comment
2023/12/26
19:22 UTC

2

Adding variable to scored data

Hi guys, I made a predictive model in Enterprise Miner, and now I have to score the data set. I just want to ask how to add a binary variable to the scored data set in Enterprise Miner. Thank you

0 Comments
2023/12/20
00:06 UTC

2

HELP - Find the next value based on 100k Results

Hello all,

I'm new to the data analysis and mining. I have a list of 100k entries in a CSV file having a just single column.

The values are as follows
0
1
1
1
0
0
1
1
0
1
1
1
.
..
...
1
1
0
0

Based on these data, can I predict the 100001 results? Will it be 0 or 1? If So, what is the best method for it? I'm learning Python and trying GradientBoosting, Support Vector Machines (SVM) and Basic Neural Networks. But I'm not able to achieve it.

1 Comment
2023/11/16
09:52 UTC

1

A way to get the whole table load at once or get it to Excel?

Hi, is there a way to load all the table form:

All Cryptocurrencies | CoinMarketCap

or get it to Excel?

2 Comments
2023/11/12
17:13 UTC

0

FB accounts for mining

Mods if not allowed please delete.

I need one or two established Facebook accounts. I've found multiple places to buy them but they want a credit card, don't have PayPal and that's too shady for my taste. Some take crypto but coinbase gladly accepted my money and put it on hold for going on a week now.

Does anyone have suggestions on how to buy said accounts without giving my credit card directly to the prince of Nigeria?

0 Comments
2023/11/10
05:54 UTC

4

Type 1 diabetes data mining

Hello. I read today that 1 in 10 kids is getting type 1 diabetes (T1D) worldwide. Has anyone data-mined diabetes? Why are so many kids getting it. What event in the kids life caused this to happen?

I understand the human body is complex, but the solution might be shown in data analysis.

3 Comments
2023/10/16
04:44 UTC

2

Splitting and using Nominal to Binominal in Rapidminer

Hi!

I am using Rapidminer for a project. We have a CSV-file with a lot of data regarding movies. We want to look at the keywords related to the movies to see which keywords are most associated with succesful movies. To do this, we want to use association rule mining. The file had every keyword related to a specific movie in a string, example: "spain-rome italy-vatican-pope-pig-possession-conspiracy-devil-exorcist-skepticism-catholic priest-1980s-supernatural horror". We have split these keywords and then used Nominal to Binominal. The problem here is that every attribute gets like an id based on where it was in the string, looking like this: "keywords_1 = spain". In another movie, spain might occur further back in the string and Rapidminer creates a new attribute, maybe looking like this: "keywords_7 = spain". We want every unique keyword to only be in one attribute. Is this possible in Rapidminer and if so, how?

Thanks!

0 Comments
2023/10/13
09:17 UTC

5

I collect Rental Data, need suggestions on what more to add ...

Hello,

As the title suggests I collect rental data for major Canadian cities.

What other statistical metrics should I add apart from the metrics that I currently process ?

The data I collect consists of the location - rent and date.

Resource that I'm talking about.

Thanks !

2 Comments
2023/10/04
18:14 UTC

4

Split a JSON-string inside a CSV-file

Hi!

I have a CSV file that consists of an id, which is an unique movie, and the keywords for this movie. It looks something like this: 15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392, 'name': 'best friend'}, {'id': 179431, 'name': 'duringcreditsstinger'}, {'id': 208510, 'name': 'old men'}]"

I want to split the data so every movie (the id) gets every keyword. But using read csv-file, it only gets me a column with the id and then one column with all the keywords, including keyword-id and 'name'. Is there any solution to only get the specific keyword?

5 Comments
2023/10/04
11:48 UTC

4

Tiktok Data Mining?

I have a project i talked to customers in ecommerce industry willing to pay.

I tried many github repos not working.The projectt involves really heavy scraping/data mining from tiktok which i couldnt get it done on my own.

Can someone tag somebody whom i can pay/or partner up with me on this project?

3 Comments
2023/09/23
13:45 UTC

Back To Top