/r/datamining

Photograph via snooOG

News, articles and tools for data mining: the process of extracting useful information from large data sets.

News, articles and tools for data mining: the process of extracting useful information from large data sets.

Smokey says: come join the transition to a sustainable future! [see more tips]

Resources:

Other subreddits you may like:

Does this sidebar need an addition or correction? Tell me here

/r/datamining

15,477 Subscribers

2

Frequent Pattern Mining question

I'm performing a Frequent Pattern Mining analysis on a dataframe in pandas.

Suppose I want to find the most frequent patterns for columns A, B and C. I find several patterns, let's pick one: (a, b, c). The problem is that with high probability this pattern is frequent just because a is very frequent in column A per se, and the same with b and c. How can I discriminate patterns that are frequent for this trivial reason and others that are frequent for interesting reasons? I know there are many metrics to do so like the lift, but they are all binary metrics, in the sense that I can only calculate them on two-columns-patterns, not three or more. Is there a way to to this for a pattern of arbitrary length?

One way would be calculating the lift on all possible subsets of length two:

lift(A, B)

lift((A, B), C)

and so on

but how do I aggregate all he results to make a decision?

Any advice would be really appreciated.

0 Comments
2024/11/09
14:03 UTC

3

What are some books about what companies do with data they collect?

0 Comments
2024/10/06
09:09 UTC

3

setting up the Sentinel-Analysis on Google-Colab - see how it goes..

Scraping Data using Twint - i tried to setup according this colab - notebook

https://colab.research.google.com/github/vidyap-xgboost/Mini\_Projects/blob/master/twitter\_data\_twint\_sweetviz\_texthero.ipynb#scrollTo=EEJIIIj1SO9M

Let's collect data from twitter using twint library.

Question 1: Why are we using twint instead of Twitter's Official API?

Ans: Because twint requires no authentication, no API, and importantly no limits

import twint

# Create a function to scrape a user's account.
def scrape_user():
print ("Fetching Tweets")
c = twint.Config()
# choose username (optional)
c.Username = input('Username: ') # I used a different account for this project. Changed the username to protect the user's privacy.
# choose beginning time (narrow results)
c.Since = input('Date (format: "%Y-%m-%d %H:%M:%S"): ')
# no idea, but makes the csv format properly
c.Store_csv = True
# file name to be saved as
c.Output = input('File name: ')
twint.run.Search(c)


# run the above function
scrape_user()
print('Scraping Done!')

but at the moment i think this does not run well

1 Comment
2024/09/30
16:52 UTC

13

Thoughts on API vs proxies for web scraping?

New to scraping. What would you say are the main pros and cons on using traditional proxies vs APIs for large data scraping project?

Also, are there any APIs worth checking out? Appreciate any input.

9 Comments
2024/09/16
21:52 UTC

3

Chapter 1,2,3 of Mining of Massive Datasets

As someone with no background of Computer Science, I dont know what are the learning outcomes of this book chapters. It has Introduction of Hadoop, Mapreduce and Finding Similar datasets.

0 Comments
2024/09/11
17:03 UTC

4

Processing data feeds according to configurable content filters

I'm developing a RSS++ reader for my own use. I already developed an ETL backend that retrieves the headlines from local news sites which I can then browse with a local viewer. This viewer puts the headlines in a chronological order (instead of an editor-picked one), which I can then mark down as seen/read, etc. My motivation is this saves me a lot of *attention* and therefore time, since I'm not influenced by editorial choices from a news website. I want "reading the news" to be as clear as reading my mail: a task that can be consciously completed. It has been running for a year, and it's been great.

But now my next step is I want to make my own automated editorial filters on content. For example, I'm not interested in football/soccer whatsoever, so if some news article is saved in the category "Sports - Soccer" then I would like to filter them out. That sounds simple enough right? Just add 1 if statement, job done. But mined data is horribly inconsistent, because a different editor will come along (on perhaps a different news site) that will post their stuff in "Sports - Football", so I would have to write another if statement.

At some point I would have a billion other subjects/people/artists I couldn't care less about. In addition I may also want to create exceptions to a rule. E.g. I like F1 but I'm not interested in spare side projects of Lewis Hamilton (like music, etc.). So I cannot simply throw out all articles that contain "Lewis Hamilton", because otherwise I wouldn't see much F1 news anymore. I would need to add an exception whenever the article is recognized to be about Formula 1, e.g. when it is posted in a F1 news feed etc.

I think you get the point.. I don't want to manually write a ton of if-else spaghetti to massaging such filters & data feeds. I'm looking for some kind of package/library that can manage this, which has preferably some kind of (web) GUI too.

And no, for now I'm not interested in some AI or large language model solution.. I think some software that looks for keywords (with synonyms) in an article with some filtering rules could work pretty well.. perhaps. have tried to write something generic like this before many years ago, but it was in Python (use C# now) and pretty slow.

I'm just throwing this idea/question out there in the off chance I'm oblivious to some OSS package/library that solves this problem. Anyone has ideas, suggestions or inspiration?

0 Comments
2024/09/06
20:51 UTC

1

Seeking guidance

Please direct me elsewhere if you see fit. Unsure where to start. I’m looking to extract specific data points from multiple documents within an online database. The database: https://adviserinfo.sec.gov The data: firm name, city, state, and assets under management. The location: within the database a search must be done for a firm (investment advisory). Example: Abound Wealth. Click latest form ADV filed. A pdf is generated and there is a table on page 5 called type of client. This table has the total of the assets under management.

Is this possible to do?

0 Comments
2024/09/06
00:44 UTC

0

Exporting Decision Tree Graphics on SPSS Modeler

0 Comments
2024/09/03
21:17 UTC

21

Thoughts on API vs proxies for web scraping?

Can someone give me the ELI5 on what the main pros and cons are on using traditional proxies vs APIs for large data scraping project?

Also, are there any APIs worth checking out? (apologies in advance if this isn't the right place to ask)

8 Comments
2024/08/28
18:43 UTC

1

Getting emails

Hi, Dear Friends!

I publish a scholarly newsletter once a week. Many people in my scholarly community want this info. It is free (in the meantime), but they don't even know it exists.

I have done a lot of research this week about harvesting emails and sending them the link to sign up. I know this technically, that four-letter word SP$#M, and is against the law, but I said to all those self-righteous who were preaching to me about ethics, "Stop cheating on your tax returns and then come back to preach to me."

I have checked many email harvester apps, and none do what I need. They give me too many emails that would not be interested in what I have to offer.

But I discovered a way to do this:

  1. Prompt Google with this prompt:---> site:Mysite.com "@gmail.com" <-- (where mysite is a website totally dedicated to the subject we are talking about and it is safe to assume that all those emails WANT my content.

  2. Google can return, say, 300 results of indexed URLs

  3. Now, there are add-ons to Chrome that can get all the emails on the current page, so if I would manually show more, show more, show more, and run the Chrome addon, it does the job, but I cannot manually do this for so many pages.

  4. In the past, you could tell Google to show 100 results per page, but that seems to have been discontinued.

SO... I want to automate going to the next page, scraping, moving on, scraping, etc., until the end, or automating getting the list of all the index URLs that prompt returns, going to those pages, getting the mails, and then progressing to the next page.

This seems simple, but I have not found any way to automate this.

I promise everyone that this newsletter is not about Viagra or Pe$%S enlargement. It is a very serious historical scholarly newsletter that people WANT TO GET.

Thank you all, as always, for superb assistance

Thank you, and have a good day!

Susan Flamingo

1 Comment
2024/08/08
05:23 UTC

16

Oxylabs vs Bright data vs IProyal reviews. Best proxies for data mining?

Data mining pros, what are the best proxy services for data mining? Looking for high quality resi (not data center) that could be used to run large projects without getting burnt too quickly. Tired of wasting money with cheapo datacenter stuff that requires constant replacement.

Thoughts on established premium providers like Bright data, Oxylabs, IProyal, etc?

Thanks.

7 Comments
2024/07/25
21:03 UTC

4

What is the best API/Dataset for Maps Data?

Hello everyone,

I am currently building an app that tells about streets. I need a large dataset that has information about every single street in the world (Description, length, Hotels, etc etc etc)

Is there any API (It’s fine if paid) you recommend for this purpose?

It doesn’t have to be about streets. just information about places in the whole globe

And thank you for reading my question! 

0 Comments
2024/06/27
22:26 UTC

5

Data Mining Projects

I wanted to do unique and industry level data mining project in my masters course. I don't want to go with the typical boring and common projects mentioned on the google.

Please suggest some industry level latest trend in the field of data mining i can work on.

3 Comments
2024/06/26
13:42 UTC

1

Text mining: methods and techniques differences

I'm just learning about text mining and reading this artiche https://rpubs.com/vipero7/introduction-to-text-mining-with-r I had some difficulties understanding the difference between methods, that are TBM, PBM, CBM and PTM, and techniques, that are Information Extraction, Information Retrieval, Categorization, Clustering, Visualization and Summarization. I can't understand how methods and techniques are connected, or if they are alternatives to each other, or if you first need to choose a method and then carry out the analysis of the techniques using that method. Can someone give me an explanation and an example of when use methods and when techniques. Thanks

0 Comments
2024/06/04
11:13 UTC

1

Large-scale Wave Energy Farm Dataset question

Sorry if this is not the right place to ask this question, if not then please redirect me.

I'm taking an ML course and am asked to apply the various data mining techniques on THIS dataset. It is about regressing power output of different configurations (coordinates) of wave energy coverters in the cities of Sydney and Perth, two set per city: one of 49 converters, the other 100 converters, for a total of four datasets.

My question is how should I handle this case? Choose the largest dataset and simply work on it? I dont think combining the Sydney and Perth datasets is a good Idea (otherwise why distinguish in the first place?)

Thank you.

0 Comments
2024/05/21
15:29 UTC

2

In PCA what does the borderline eigenvalues function represent? And which 2-way matrix does it come from?

My professor told us of course that it can never be increasing, it is decreasing by definition, but he told us that there is a borderline case (which does not come from a square matrix), but I can’t understand. Thank you in advance

0 Comments
2024/05/01
20:02 UTC

1

A data mining work in a chess database

Hi to everyone

As a work to finish my degree on statistics I'm doing a work on data mining techniques with a chess database. I have more than 500.000 chess games with variables about the number of turns, elo and how many times each piece has been moved (for example, B_K_moves is how many times Black has moved the King)

Problem is, I'm supposed to do the decision tree with all the steps but ... the decision tree only has 3 nodes of depth. This is the tree, and I'm supposed to do steps like the pudding but ... it's very simple and I don't know why the algorithm doesn't use variables like W_Q_moves (how many times white has moved the queen) or B_R_moves (how many times Black has moved a rook).

This is the code I've used with the library caret in R:

control <- trainControl(method = "cv", number = 10)
modelo <- train(Result ~ ., data = dataset, method = "rpart", trControl = control)
print(modelo)
## CART
##
## 212282 samples
## 15 predictor
## 3 classes: ’Derrota’, ’Empate’, ’Victoria’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 191054, 191054, 191054, 191054, 191053, 191054, ...
1
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01444892 0.6166044 0.2417333
## 0.02930692 0.5885474 0.1931878
## 0.13442808 0.5668073 0.1448201
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01444892

And the code to plot the tree:

library(rpart.plot)
## Loading required package: rpart
rpart.plot(modelo$finalModel, yesno = 2, type = 0, extra = 1)

As I said, I don't know why the depth is so small and I don't know what to change in the code to make it deeper

0 Comments
2024/04/30
23:52 UTC

4

Clustering Embeddings - Approach

Hey Guys. I'm building a project that involves a RAG pipeline and the retrieval part for that was pretty easy - just needed to embed the chunks and then call top-k retrieval. Now I want to incorporate another component that can identify the widest range of like 'subtopics' in a big group of text chunks. So like if I chunk and embed a paper on black holes, it should be able to return the chunks on the different subtopics covered in that paper, so I can then get the sub-topics of each chunk. (If I'm going about this wrong and there's a much easier way let me know) I'm assuming the correct way to go about this is like k-means clustering or smthn? Thing is the vector database I'm currently using - pinecone - is really easy to use but only supports top-k retrieval. What other options are there then for something like this? Would appreciate any advice and guidance.

0 Comments
2024/04/30
06:39 UTC

1

Scoring scale for KDD2024 conference reviews

Does anyone know what the scoring scale for KDD Conference reviews is this year? I only see numbers proposed by reviewers on OpenReview but can not find the overall scale anywhere.

0 Comments
2024/04/05
12:36 UTC

4

Historical Stock Market Data

I'm looking to perform some data analysis on stock market data going back about 2 years at 10 second intervals and compare it against real time data. Are there any good resources that provide OHLC and volume data at that level without having to pay hundreds of dollars?

1 Comment
2024/03/16
01:35 UTC

2

Grey-hat email mining

In light of the decision on Meta v. Bright Data, Instagram data mining is back on the lunch table.

What would be a way to market this - is SaaS a good move? I've done plenty of research on how to defeat Meta and their devious anti-scraping mechanisms...but there's no point to this code if it is not profitable.

There are others in this sphere that are charging way too much, so I am clueless as to how (and if) they are getting any customers.

Sorry if this comes off as elementary or trivial, I'm a hacker and coder - not a businessman.

0 Comments
2024/03/12
03:35 UTC

1

Data Mining

I'm diving into big data applications and looking to explore the wide array of data mining tools out there. Can you share your favorite data mining tool that you've used in a big data application?

I'm particularly interested in hearing about tools that shine in specific applications. So, if you've used a tool for something like sentiment analysis, fraud detection, recommendation systems, or any other big data application, I'd love to hear about your experience with it!

0 Comments
2024/03/01
17:49 UTC

1

Any developers here wanting to shape the future of Docker?

0 Comments
2024/03/01
13:24 UTC

6

Mining Twitter using Chrome Extension

I'm looking to mine large amounts of tweets for my bachelor thesis.
I want to do sentiment polarity, topic modeling, and visualization later.

I found TwiBot, a Google Chrome Extension that can export them in a .csv for you. I just need a static dataset with no updates whatsoever, as it's just a thesis. To export large amounts of tweets, I would need a subscription, which is fine for me if it doesn't require me to fiddle around with code (I can code, but it would just save me some time).

Do you think this works? Can I just export... let's say, 200k worth of tweets? I don't want to waste 20 dollars on a subscription if the extension doesn't work as intended.

22 Comments
2024/02/19
20:09 UTC

2

I need help

there is a guy is spamming phone calls in the last 3days

In need more information about him and all I have is his phone number

and the police can't do anything about it

please help me so I can stop him

2 Comments
2024/02/09
17:27 UTC

Back To Top