/r/datamining
News, articles and tools for data mining: the process of extracting useful information from large data sets.
News, articles and tools for data mining: the process of extracting useful information from large data sets.
Resources:
Other subreddits you may like:
Does this sidebar need an addition or correction? Tell me here
/r/datamining
Hey there,
After exhaustively searching Google and trying to find APIs that would allow me to generate keyword search or post or comment frequency on any platform on a daily basis, I have been unable to find any providers of this type of data. Considering that this is kind of a niche request, I am dropping this inquiry here for the Data Mining Gods of Reddit to assist.
Basically, I'm trying to create an ML model that can predict future increases/decreases in keyword usage (whether that be on Google Search or X posts; dosen't matter) on a daily basis. I've found plenty of monthly average keyword search providers but I cannot find any way to access more granulated, daily search totals for any platform. If you know of any sources for this kind of data, please drop them here... Or just tell me to give up if this is an impossible feat.
I got into an MS AI program in the top CS school of my country by luck. By luck because I had a non CS bachelors.
In first semester we studied Mathematical Foundations of AI (Linear Algebra, Calculus, Optimization, Probability) and Foundations of AI (Basic Machine Learning and some Deep learning)
It was a crazy semester for me as I didn't have prerequisite knowledge for a lot of stuff.
I want to choose a comparatively easy course im next semester which will help me in the future as well.
I'm considering taking Data Mining over Data Engineering and LLM course because my programming skill is just basic.
My logic is that if I choose easy courses they will build a foundation for Data Engineering/ LLM which I can take later on.
My goals are to be able to write a thesis in 3rd semester. Will Data Mining course be good enough for a thesis or is data mining kind of outdated?
I belong to a persecuted community in my country. My goal is to leave the country through a PhD route. For that I have to keep my cgpa 3+ and a nice thesis.
Here are the courses I'm going to take. Please tell me if I should skip Data Mining for a specialized course:
1st semester: Mathematical Foundations of AI (core course) Foundations of AI (core course)
2nd semester: Machine Learning (more advanced ml than in 1st sem, core course) Principles and techniques of data science (elective course, similar to foundations of AI, more basic) Data Mining (elective)
Are they good enough for a thesis?
I'm performing a Frequent Pattern Mining analysis on a dataframe in pandas.
Suppose I want to find the most frequent patterns for columns A, B and C. I find several patterns, let's pick one: (a, b, c). The problem is that with high probability this pattern is frequent just because a is very frequent in column A per se, and the same with b and c. How can I discriminate patterns that are frequent for this trivial reason and others that are frequent for interesting reasons? I know there are many metrics to do so like the lift, but they are all binary metrics, in the sense that I can only calculate them on two-columns-patterns, not three or more. Is there a way to to this for a pattern of arbitrary length?
One way would be calculating the lift on all possible subsets of length two:
lift(A, B)
lift((A, B), C)
and so on
but how do I aggregate all he results to make a decision?
Any advice would be really appreciated.
Scraping Data using Twint - i tried to setup according this colab - notebook
Let's collect data from twitter using twint library.
Question 1: Why are we using twint instead of Twitter's Official API?
Ans: Because twint requires no authentication, no API, and importantly no limits
import twint
# Create a function to scrape a user's account.
def scrape_user():
print ("Fetching Tweets")
c = twint.Config()
# choose username (optional)
c.Username = input('Username: ') # I used a different account for this project. Changed the username to protect the user's privacy.
# choose beginning time (narrow results)
c.Since = input('Date (format: "%Y-%m-%d %H:%M:%S"): ')
# no idea, but makes the csv format properly
c.Store_csv = True
# file name to be saved as
c.Output = input('File name: ')
twint.run.Search(c)
# run the above function
scrape_user()
print('Scraping Done!')
but at the moment i think this does not run well
New to scraping. What would you say are the main pros and cons on using traditional proxies vs APIs for large data scraping project?
Also, are there any APIs worth checking out? Appreciate any input.
As someone with no background of Computer Science, I dont know what are the learning outcomes of this book chapters. It has Introduction of Hadoop, Mapreduce and Finding Similar datasets.
I'm developing a RSS++ reader for my own use. I already developed an ETL backend that retrieves the headlines from local news sites which I can then browse with a local viewer. This viewer puts the headlines in a chronological order (instead of an editor-picked one), which I can then mark down as seen/read, etc. My motivation is this saves me a lot of *attention* and therefore time, since I'm not influenced by editorial choices from a news website. I want "reading the news" to be as clear as reading my mail: a task that can be consciously completed. It has been running for a year, and it's been great.
But now my next step is I want to make my own automated editorial filters on content. For example, I'm not interested in football/soccer whatsoever, so if some news article is saved in the category "Sports - Soccer" then I would like to filter them out. That sounds simple enough right? Just add 1 if statement, job done. But mined data is horribly inconsistent, because a different editor will come along (on perhaps a different news site) that will post their stuff in "Sports - Football", so I would have to write another if statement.
At some point I would have a billion other subjects/people/artists I couldn't care less about. In addition I may also want to create exceptions to a rule. E.g. I like F1 but I'm not interested in spare side projects of Lewis Hamilton (like music, etc.). So I cannot simply throw out all articles that contain "Lewis Hamilton", because otherwise I wouldn't see much F1 news anymore. I would need to add an exception whenever the article is recognized to be about Formula 1, e.g. when it is posted in a F1 news feed etc.
I think you get the point.. I don't want to manually write a ton of if-else spaghetti to massaging such filters & data feeds. I'm looking for some kind of package/library that can manage this, which has preferably some kind of (web) GUI too.
And no, for now I'm not interested in some AI or large language model solution.. I think some software that looks for keywords (with synonyms) in an article with some filtering rules could work pretty well.. perhaps. have tried to write something generic like this before many years ago, but it was in Python (use C# now) and pretty slow.
I'm just throwing this idea/question out there in the off chance I'm oblivious to some OSS package/library that solves this problem. Anyone has ideas, suggestions or inspiration?
Please direct me elsewhere if you see fit. Unsure where to start. I’m looking to extract specific data points from multiple documents within an online database. The database: https://adviserinfo.sec.gov The data: firm name, city, state, and assets under management. The location: within the database a search must be done for a firm (investment advisory). Example: Abound Wealth. Click latest form ADV filed. A pdf is generated and there is a table on page 5 called type of client. This table has the total of the assets under management.
Is this possible to do?
Can someone give me the ELI5 on what the main pros and cons are on using traditional proxies vs APIs for large data scraping project?
Also, are there any APIs worth checking out? (apologies in advance if this isn't the right place to ask)
Hi, Dear Friends!
I publish a scholarly newsletter once a week. Many people in my scholarly community want this info. It is free (in the meantime), but they don't even know it exists.
I have done a lot of research this week about harvesting emails and sending them the link to sign up. I know this technically, that four-letter word SP$#M, and is against the law, but I said to all those self-righteous who were preaching to me about ethics, "Stop cheating on your tax returns and then come back to preach to me."
I have checked many email harvester apps, and none do what I need. They give me too many emails that would not be interested in what I have to offer.
But I discovered a way to do this:
Prompt Google with this prompt:---> site:Mysite.com "@gmail.com" <-- (where mysite is a website totally dedicated to the subject we are talking about and it is safe to assume that all those emails WANT my content.
Google can return, say, 300 results of indexed URLs
Now, there are add-ons to Chrome that can get all the emails on the current page, so if I would manually show more, show more, show more, and run the Chrome addon, it does the job, but I cannot manually do this for so many pages.
In the past, you could tell Google to show 100 results per page, but that seems to have been discontinued.
SO... I want to automate going to the next page, scraping, moving on, scraping, etc., until the end, or automating getting the list of all the index URLs that prompt returns, going to those pages, getting the mails, and then progressing to the next page.
This seems simple, but I have not found any way to automate this.
I promise everyone that this newsletter is not about Viagra or Pe$%S enlargement. It is a very serious historical scholarly newsletter that people WANT TO GET.
Thank you all, as always, for superb assistance
Thank you, and have a good day!
Susan Flamingo
Data mining pros, what are the best proxy services for data mining? Looking for high quality resi (not data center) that could be used to run large projects without getting burnt too quickly. Tired of wasting money with cheapo datacenter stuff that requires constant replacement.
Thoughts on established premium providers like Bright data, Oxylabs, IProyal, etc?
Thanks.
Hello everyone,
I am currently building an app that tells about streets. I need a large dataset that has information about every single street in the world (Description, length, Hotels, etc etc etc)
Is there any API (It’s fine if paid) you recommend for this purpose?
It doesn’t have to be about streets. just information about places in the whole globe
And thank you for reading my question!
I wanted to do unique and industry level data mining project in my masters course. I don't want to go with the typical boring and common projects mentioned on the google.
Please suggest some industry level latest trend in the field of data mining i can work on.
I'm just learning about text mining and reading this artiche https://rpubs.com/vipero7/introduction-to-text-mining-with-r I had some difficulties understanding the difference between methods, that are TBM, PBM, CBM and PTM, and techniques, that are Information Extraction, Information Retrieval, Categorization, Clustering, Visualization and Summarization. I can't understand how methods and techniques are connected, or if they are alternatives to each other, or if you first need to choose a method and then carry out the analysis of the techniques using that method. Can someone give me an explanation and an example of when use methods and when techniques. Thanks
Sorry if this is not the right place to ask this question, if not then please redirect me.
I'm taking an ML course and am asked to apply the various data mining techniques on THIS dataset. It is about regressing power output of different configurations (coordinates) of wave energy coverters in the cities of Sydney and Perth, two set per city: one of 49 converters, the other 100 converters, for a total of four datasets.
My question is how should I handle this case? Choose the largest dataset and simply work on it? I dont think combining the Sydney and Perth datasets is a good Idea (otherwise why distinguish in the first place?)
Thank you.
My professor told us of course that it can never be increasing, it is decreasing by definition, but he told us that there is a borderline case (which does not come from a square matrix), but I can’t understand. Thank you in advance
Hi to everyone
As a work to finish my degree on statistics I'm doing a work on data mining techniques with a chess database. I have more than 500.000 chess games with variables about the number of turns, elo and how many times each piece has been moved (for example, B_K_moves is how many times Black has moved the King)
Problem is, I'm supposed to do the decision tree with all the steps but ... the decision tree only has 3 nodes of depth. This is the tree, and I'm supposed to do steps like the pudding but ... it's very simple and I don't know why the algorithm doesn't use variables like W_Q_moves (how many times white has moved the queen) or B_R_moves (how many times Black has moved a rook).
This is the code I've used with the library caret in R:
control <- trainControl(method = "cv", number = 10)
modelo <- train(Result ~ ., data = dataset, method = "rpart", trControl = control)
print(modelo)
## CART
##
## 212282 samples
## 15 predictor
## 3 classes: ’Derrota’, ’Empate’, ’Victoria’
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 191054, 191054, 191054, 191054, 191053, 191054, ...
1
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.01444892 0.6166044 0.2417333
## 0.02930692 0.5885474 0.1931878
## 0.13442808 0.5668073 0.1448201
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01444892
And the code to plot the tree:
library(rpart.plot)
## Loading required package: rpart
rpart.plot(modelo$finalModel, yesno = 2, type = 0, extra = 1)
As I said, I don't know why the depth is so small and I don't know what to change in the code to make it deeper
Hey Guys. I'm building a project that involves a RAG pipeline and the retrieval part for that was pretty easy - just needed to embed the chunks and then call top-k retrieval. Now I want to incorporate another component that can identify the widest range of like 'subtopics' in a big group of text chunks. So like if I chunk and embed a paper on black holes, it should be able to return the chunks on the different subtopics covered in that paper, so I can then get the sub-topics of each chunk. (If I'm going about this wrong and there's a much easier way let me know) I'm assuming the correct way to go about this is like k-means clustering or smthn? Thing is the vector database I'm currently using - pinecone - is really easy to use but only supports top-k retrieval. What other options are there then for something like this? Would appreciate any advice and guidance.
Does anyone know what the scoring scale for KDD Conference reviews is this year? I only see numbers proposed by reviewers on OpenReview but can not find the overall scale anywhere.
I'm looking to perform some data analysis on stock market data going back about 2 years at 10 second intervals and compare it against real time data. Are there any good resources that provide OHLC and volume data at that level without having to pay hundreds of dollars?
In light of the decision on Meta v. Bright Data, Instagram data mining is back on the lunch table.
What would be a way to market this - is SaaS a good move? I've done plenty of research on how to defeat Meta and their devious anti-scraping mechanisms...but there's no point to this code if it is not profitable.
There are others in this sphere that are charging way too much, so I am clueless as to how (and if) they are getting any customers.
Sorry if this comes off as elementary or trivial, I'm a hacker and coder - not a businessman.
I'm diving into big data applications and looking to explore the wide array of data mining tools out there. Can you share your favorite data mining tool that you've used in a big data application?
I'm particularly interested in hearing about tools that shine in specific applications. So, if you've used a tool for something like sentiment analysis, fraud detection, recommendation systems, or any other big data application, I'd love to hear about your experience with it!
I'm looking to mine large amounts of tweets for my bachelor thesis.
I want to do sentiment polarity, topic modeling, and visualization later.
I found TwiBot, a Google Chrome Extension that can export them in a .csv for you. I just need a static dataset with no updates whatsoever, as it's just a thesis. To export large amounts of tweets, I would need a subscription, which is fine for me if it doesn't require me to fiddle around with code (I can code, but it would just save me some time).
Do you think this works? Can I just export... let's say, 200k worth of tweets? I don't want to waste 20 dollars on a subscription if the extension doesn't work as intended.