/r/q?req.query.q -- Subreddit Search

87,887 Subscribers

Free Data Analyst Learning Path - Feedback and Contributors Needed

Hi everyone,

I’m the creator of www.DataScienceHive.com, a platform dedicated to providing free and accessible learning paths for anyone interested in data analytics, data science, and related fields. The mission is simple: to help people break into these careers with high-quality, curated resources and a supportive community.

We also have a growing Discord community with over 50 members where we discuss resources, projects, and career advice. You can join us here: https://discord.gg/FYeE6mbH.

I’m excited to announce that I’ve just finished building the “Data Analyst Learning Path”. This is the first version, and I’ve spent a lot of time carefully selecting resources and creating homework for each section to ensure it’s both practical and impactful.

Here’s the link to the learning path: https://www.datasciencehive.com/data_analyst_path

Here’s how the content is organized:

Module 1: Foundations of Data Analysis

•	Section 1.1: What Does a Data Analyst Do?
•	Section 1.2: Introduction to Statistics Foundations
•	Section 1.3: Excel Basics

Module 2: Data Wrangling and Cleaning / Intro to R/Python

•	Section 2.1: Introduction to Data Wrangling and Cleaning
•	Section 2.2: Intro to Python & Data Wrangling with Python
•	Section 2.3: Intro to R & Data Wrangling with R

Module 3: Intro to SQL for Data Analysts

•	Section 3.1: Introduction to SQL and Databases
•	Section 3.2: SQL Essentials for Data Analysis
•	Section 3.3: Aggregations and Joins
•	Section 3.4: Advanced SQL for Data Analysis
•	Section 3.5: Optimizing SQL Queries and Best Practices

Module 4: Data Visualization Across Tools

•	Section 4.1: Foundations of Data Visualization
•	Section 4.2: Data Visualization in Excel
•	Section 4.3: Data Visualization in Python
•	Section 4.4: Data Visualization in R
•	Section 4.5: Data Visualization in Tableau
•	Section 4.6: Data Visualization in Power BI
•	Section 4.7: Comparative Visualization and Data Storytelling

Module 5: Predictive Modeling and Inferential Statistics for Data Analysts

•	Section 5.1: Core Concepts of Inferential Statistics
•	Section 5.2: Chi-Square
•	Section 5.3: T-Tests
•	Section 5.4: ANOVA
•	Section 5.5: Linear Regression
•	Section 5.6: Classification

Module 6: Capstone Project – End-to-End Data Analysis

Each section includes homework to help apply what you learn, along with open-source resources like articles, YouTube videos, and textbook readings. All resources are completely free.

Here’s the link to the learning path: https://www.datasciencehive.com/data_analyst_path

Looking Ahead: Help Needed for Data Scientist and Data Engineer Paths

As a Data Analyst by trade, I’m currently building the “Data Scientist” and “Data Engineer” learning paths. These are exciting but complex areas, and I could really use input from those with strong expertise in these fields. If you’d like to contribute or collaborate, please let me know—I’d greatly appreciate the help!

I’d also love to hear your feedback on the Data Analyst Learning Path and any ideas you have for improvement.

0 Comments

2024/12/04
05:19 UTC

R-Girls-School Network!

Wow, this is inspiring! Two-year project to establish the R-Girls-School (R-GS) network, addressing the underrepresentation of women, particularly from deprived and ethnically diverse backgrounds, in data science

https://r-consortium.org/posts/empowering-girls-in-data-science-the-r-girls-school-network-initiative/

0 Comments

2024/12/03
22:35 UTC

exploring all options in a logistic regression

This set of code is fairly simple and uses some example from a tutorial online

# import and rename dataset
library(kmed)
dat <- heart
library(dplyr)

# rename variables
dat <- dat |>
  rename(
    chest_pain = cp,
    max_heartrate = thalach,
    heart_disease = class
  )

# recode sex
dat$sex <- factor(dat$sex,
                  levels = c(FALSE, TRUE),
                  labels = c("female", "male")
)

# recode chest_pain
dat$chest_pain <- factor(dat$chest_pain,
                         levels = 1:4,
                         labels = c("typical angina", "atypical angina", "non-anginal pain", "asymptomatic")
)

# recode heart_disease into 2 classes
dat$heart_disease <- ifelse(dat$heart_disease == 0,
                            0,
                            1
)

m3 <- glm(heart_disease ~ .,
          data = dat,
          family = "binomial"
)

# print results
summary(m3)

However, what should I use if I want to automatically run all columns of predictors in dat, or automatically seek the highest AIC model?

1 Comment

2024/12/03
21:17 UTC

R-Universe newest Top Level Project under R Consortium

R Consortium has announced their newest top level project, R-Universe.

R-universe is a platform for improving publication and discovery of research software in R, developed by rOpenSci

R projects that need support over a longer time period are evaluated by the Infrastructure Steering Committee (ISC) for long-term status. Being designated Top Level gives a project guaranteed funding for 3 years, along with a voting seat on the ISC.

https://r-consortium.org/posts/r-universe-named-r-consortiums-newest-top-level-project/

0 Comments

2024/12/03
21:12 UTC

{SLmetrics}: New R package

Hi guys,

I have built an R package on Rcpp and RcppEigen, its (almost) entirely base R and built on S3. The package is all about Machine Learning performance evaluation for supervised applications.

Its currently in pre-release state, and I intend to submit it to CRAN around March. Until then I am looking for testers and collaborators. I would appreciate some feedback from you.

The package closely resembles MLmetrics (hence the name), but is an upgrade as it includes much more, and is way faster. Currently, for 20.000 obs, SLmetrics is between 20-70 times faster than the remaining packages.

Give the package a spin, or visit the repository on GitHub to see what its all about: https://github.com/serkor1/SLmetrics

Best,

1 Comment

2024/12/03
19:30 UTC

9 FLAWS of ‘Summary’ Function You DIDN’T Know About and How to Fix Them Short video for details: https://youtu.be/BxfNyDzULmg

9 Comments

2024/12/03
15:19 UTC

Issue with Rtools (?) and packages lme and matrix

Newbie on R here. I have to do some geostatistical plot on R, and for that I need the lme4 and Matrix packages. When I run my code, I get the error message

function 'cholmod_factor_ldetA' not provided by package 'Matrix'

From some googling the issue seems to be that I need to install a binary version of Matrix. However, when I try, I get the warning

WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

Except, I already have Rtools installed (4.3, my version of R is 4.3.2 and RStudio 2023.12.0). From other answers online it seems to be a path issue but I don't know how to solve it. Also I'm working on a company laptop and I don't have the privileges to install and uninstall software.

Any help is appreciated!

2 Comments

2024/12/03
12:05 UTC

Ecological Temporal Analysis questions

I collected animal samples without replacement over three time periods from the same locality: seven years apart in the deep-sea (eg, no known seasonality). I want to know whether the mean difference in viral infections changes significantly over time. The data is not normal. Would I consider this as independent (sampled without replacement, and although it's over time there are no inherent patterns eg seasonality that would contribute to variation) and therefore do a Kruskal Wallis test? Or would I need to consider the data as dependent because, although animals were sampled without replacement, they came from the same population? If the latter, are we saying that although no animal was sampled once, the removal of an individual could theoretically influence the viral infections of another in the population? And is it ok to treat the temporal aspect of the data as independent? Would a Friedman test be appropriate?

Thanks!

4 Comments

2024/12/02
18:38 UTC

Advent of code 2024

Hi! Is anyone doing the advent of code this year in R? Most of the people I know are doing other languages, would love to discuss the solutions with anyone interested!

5 Comments

2024/12/02
18:00 UTC

Need Guidance

Hello I need some guidance for this project I recently started.

The dataset I am working with contains information on movie scores and age ratings. The problem I am facing is that the age rating feature contains over 40% missing values. Initially, I dropped all the missing values and arrived at some conclusions. But now the more I think about it the more this approach seems wrong to me. The data on ratings is MNAR. What approach can be considered positive?

5 Comments

2024/12/02
14:32 UTC

Stacked bar chart using employment, unemployment, and inactivity.

Hello, I am a newbie at programming and have only been doing it for a month (no one mentioned programming in my degree but I should have seen it coming). I have used different examples online and made a Frankenstein's monster so far, but nothing seems to work. I am trying to make a stacked bar chart with years on the x axis and the percentage of the population on the y axis to show how the population is made up over the years. I would be grateful for any help and will probably name my first born after you. Sincerely, a drowning first year undergrad.

df <- data.frame(Year= rep (c('2000','2001', '2002', '2003', '2004', '2005',

'2006', '2007', '2008','2009', '2010', '2011',

'2012', '2013', '2014', '2015', '2016','2017',

'2018', '2019', '2020', '2021', '2022', '2023'), each = 1),

Inactive = c(23.3, 23.4, 23.4, 23.3, 23.4, 23.3, 23.0, 23.2, 23.0,

23.1, 23.5, 23.3, 22.8, 22.5, 22.2, 22.1, 21.8, 21.6,

21.3, 21, 21.2, 21.7, 21.8, 21.7),

Unemployed = c(5.4, 5.1, 5.2, 5.0, 4.8, 4.8, 5.4, 5.3, 5.7, 7.6, 7.9,

8.1, 8.0, 7.6, 6.2, 5.4, 4.9, 4.4, 4.2, 3.9, 4.7, 4.6,

3.9, 4.0),

Employed = c(72.5, 72.6, 72.7, 72.8, 72.9, 72.9, 72.8, 72.7, 72.6, 70.9,

70.4, 70.3, 71.0, 71.5, 72.8, 73.6, 74.2, 74.8, 75.4, 75.8,

75.0, 74.7, 75.1, 75.1))

population = (C(Inactive, Unemployed, Employed))

ggplot(df, aes(fill=Population, y=population, x=Year)) +

geom_bar(position="stack", stat="identity")

6 Comments

2024/12/01
19:11 UTC

Daily averaged time series comparison -Linking plankton and aerosols emissions?

Hi everyone, so we have this dataset of daily averaged pytoplankton time series over a full year; coccolithophores, chlorophytes, cyanobacteria, diatoms, dinoflagellates, phaecocystis, zooplankton.
Then we have atmospheric measurements on the same time intervals of a few aerosols species; Methanesulphonic acid, carboxylic acids, aliphatics, sulphates, ammonium, nitrates etc...
Our goal is to establish all the possible links between plankton types and aerosols, we want to find out which planktons matter the most for a given aerosols species.

So here is my question; Which mathematical tools would you use to build a model with these (nonlinear) time series? Random Forest, Wavelets, transfer entropy, fractals analysis, chaos theory, Bayesian statistics? The thing that puzzle me most is that we know there is a lag between the plankton bloom and aerosols eventually forming in the atmosphere, it can take weeks for a bloom to trigger aerosols formation, so far many studies have just used lagged Pearson´s correlation, which I am not too happy with as correlation really isn´t reliable, would you know of any advanced methods to find out the optimal lag? What would be the best approach in your opinion?
I would really appreciate any ideas, so please don´t hesitate to write down yours and I´d be happy to debate it, have a nice Sunday, cheers :)

0 Comments

2024/12/01
09:40 UTC

How do I find if the difference between two slopes is statistically significant?

I ran separate regressions for different ethnic groups to calculate the slopes (ex: BMI vs. Sleep Apnea score). I then combined these slopes into one graph to visually compare them across ethnicities.

How can I statistically test if the differences in slopes between the ethnic groups are significant? I'm researching and cant figure out of I should use a T test, Z test, or ANOVA, and if so, what type?

I have the slope, X&Y intercepts, standard deviation, and standard error. Each ethnic group is a sub-sample pulled from a larger sample pool containing multiple ethnic groups.

9 Comments

2024/11/30
10:01 UTC

I cant use my web crawl

library(rvest)

library(tidyverse)

library(tidytext)

# Function to scrape reviews from an Amazon product page

scrape_amazon <- function(url) {

page <- read_html(url)

reviews <- page %>%

html_nodes(".review-text-content") %>%

html_text()

return(reviews)

}

# Define the target URL

url <- "https://www.amazon.com/NVIDIA-GeForce-RTX-4090-Graphics-Card/dp/B0B17X488H/review/R329975Z1190C"

# Scrape reviews

reviews <- scrape_amazon(url)

# Create a data frame

df <- data.frame(review_text = reviews)

# Clean and tokenize the text

df <- df %>%

unnest_tokens(word, review_text) %>%

anti_join(stop_words)

# Perform sentiment analysis using Bing lexicon

df <- df %>%

inner_join(get_sentiments("bing"))

# Calculate sentiment scores

df %>%

group_by(sentiment) %>%

summarise(n = n())

# Create a word cloud for positive and negative reviews

positive_words <- df %>%

filter(sentiment == "positive") %>%

count(word, sort = TRUE)

negative_words <- df %>%

filter(sentiment == "negative") %>%

count(word, sort = TRUE)

wordcloud(words = positive_words$word, freq = positive_words$n, min.freq = 1, max.words = 100, color = "blue")

wordcloud(words = negative_words$word, freq = negative_words$n, min.freq = 1, max.words = 100, color = "red")

4 Comments

2024/11/30
08:39 UTC

Statistical test in R

Hello: I have a rate of infections per year and want to know two things:

is the reduction in this population significant. So for this one I just want to take the rate for the year and compare it to prior years.
is there an impact of a certain product contributing to this rate? So i have the rate of utilization of this product also calculated by year.

for the second question, I would think Linear regression (infection is dependent and utilization is independent variable? I tried to write this in R but I keep getting error messages, so also if anyone can provide a simplified R code using Tidyverse that would be amazing.

Thank you!

example of my data:

year infection rate (per 1000) utilization rate (%)

2021 4.32 14%

2022 2.11 10%

2023 0.00 11%

2 Comments

2024/11/29
17:52 UTC

zip xlsx's and download from shiny

Hi all,

I'm trying to build an app that lets me upload multiple xlsx files to a shiny app, does something with them, and then puts them into a zip and downloads that. My problem is that my code succesfully creates the xlsx files and zips them, but the files in the zip seem to be empty/corrupted (1KB size, cant open them). Weirdly enough, it works perfectly if I use write_csv() instead, and the created xlsx files in the temp folder are perfect too.

Can anyone help me out here? Many thanks in advance.

output$downrate <- downloadHandler(
    filename = function() {"down.zip"},
    content = function(file) {
      # definition of content to download
      namefile <- as.list(input$files$name)
      namefile <- lapply(namefile, function(x) tools::file_path_sans_ext(x))
      to_dl <- list(
        dataname = namefile,
        data = results_RAC()
      )
     
      # temp dir for the csv's
      twd <- setwd(tempdir())
      on.exit(setwd(twd))
      files <- NULL
     
      # loop on data to download and write individual xlsx's
      for (i in 1:length(to_dl[[1]])) {
 
        fileName <- paste0(to_dl[[1]][i], ".xlsx") # xlsx file name
        fileCont <- as.data.frame(to_dl[[2]][i])
 
        write.xlsx(fileCont, fileName) # write xlsx in temp dir
        files <- c(files, fileName) # store written file name
      }
     
      # create archive from written files
      archive_write_files(file, files)
    }
  )

5 Comments

2024/11/29
16:30 UTC

Feedback on an idea

Hi everyone,

I’m a Management student currently working on a project for my Opportunity Recognition and Business Model Development course. Our team has come up with a social entrepreneurship idea: a Skill Sharing Platform where people can exchange skills and knowledge for free!

Here’s the concept:

You offer what you know (e.g., statistics) and, in return, learn something you’re curious about (e.g., programming). It’s all about creating a direct value exchange between users, making learning accessible and empowering for everyone.

Think of it as Tinder for skill sharing—you’ll be matched with people based on the skills you want to learn and those you can teach.

We’d love your feedback and input! Does this idea resonate with you? Would you be interested in participating in such a platform?

Looking forward to your thoughts and ideas.

Thank you!

1 Comment

2024/11/29
15:18 UTC

Mediation Model Comparison

I'm looking for advise on how to run a mediation model. I have cross sectional data where I believe A -> B -> C.
However, a colleague has suggested the possibility of A->C->B and I can see the possibility. Can I just run both models and do a model comparison? It seems like there are some extra considerations I should be thinking about.

6 Comments

2024/11/27
17:55 UTC

Curly brackets

I’m trying to input this code for a bio stats problem set and I keep getting error: unexpected symbol. Am I putting the curly brackets into R wrong? I know not to put in the text associated with the hashtag. Everything with an indent goes on the same line, right? I’ve read online that the end bracket goes on its own line, but how does that work? Do I put in all the code from the first line and then the indented code on the same line, with the opening bracket, and then the ending bracket by itself on its own line after?

6 Comments

2024/11/26
23:12 UTC

Filtering census data by two variables?

I dont know if this is the right place to ask, but I am using tidycensus package to filter ACS census data and really struggle with filtering by two variables.

I want to filter demographics like race and ethnicity so that its just those under 45 years old. I have googled forever and even tried to manually make this table and I am not sure it is possible. does anyone know if i can do this with tidycensus?

16 Comments

2024/11/26
21:21 UTC

How to join multiple excel sheets into 1 dataframe using inner joins?

Hi! So I have this excel file with 7 different sheets (different waves of research) that I loaded into R like this:

wages <- read_csv('C:\\file_directory\\wages.xlsx')
read_excel_allsheets <- function(wages) {
sheets <- readxl::excel_sheets(wages)
x <-    lapply(sheets, function(X) readxl::read_excel(wages, sheet = X))
names(x) <- sheets
x
}
my_excel_list <- read_excel_allsheets("wages.xlsx")
list2env(my_excel_list, .GlobalEnv)

So far so good. But I have a problem with joining all the waves together into 1 dataframe. I tried:

wages <- wages %>%
inner_join(wages, by = "wave")
glimpse(wages)

but it returned an error:

which I don't get because the "wave" column is right there. :(
What am I doing wrong?

Error in `inner_join()`:
! Join columns in `x` must be present in the data.

4 Comments

2024/11/25
00:47 UTC

Fatal errors

Hi all

I cannot run any code without getting the error: "Fatal error: Unexpected exception: bad allocation". THen the session disconnects/aborts.

I have looked everywhere for a solution. I tried verifying xfun and the xfun is 0.43.

Any suggestions?

0 Comments

2024/11/24
19:58 UTC

Applying a negative subset to a list

I have a list of vectors of varying lengths and I want to get the same list but with the two first elements of each vector removed. So basically mylist[[1]][-c(1,2)] but for every vector in the list. Is that possible with lapply or do I need to loop and join? I've tried the lapply "[[" thing but it doesn't seem to support negative subsets nor multiple elements.

2 Comments

2024/11/22
19:22 UTC

Train-test split evaluation

I have an sPLS model where I split my data into a training set (80%) and testing set (20%). My model is trained and cross-validated on Y (continous) and X (continous, n=24 variables).

My assumption is for a linear association between Y and x within the model.

After tuning my model how do I compare the performance of the model? As in, I will use my training model to predict the Y values of the testing set by use of X and then I now have predicted values of Y versus actual values of Y in the testing dataset.

Am I supposed to use a pearson/spearman and see how high the r value is? Use linear models and do a paired t-test? Other?

1 Comment

2024/11/22
16:25 UTC

Question about using dyplr for growth rates?

Hi, I'm sorry if this is a silly question; I'm totally new to R. I have a dataset that looks like this (but much larger and with more dates), and I was trying to use dyplr to calculate the daily growth rate of each plant. I have this code written, but the values that I'm getting appear to be the growth rate on each day including both species. I assumed the group_by function would separate them? How would I go about doing that?

flowersprop<- flowers %>%

group_by(species) %>%

arrange(species, day) %>%

mutate(growthrate=(height-lag(height))/lag(height))

species	Day	plantid	height
c	1	c24	30
c	1	c12	24
s	1	s1	0
s	1	s2	2
c	3	c24	35
c	3	c12	23
s	3	s1	3
s	3	s2	5

3 Comments

2024/11/21
13:46 UTC

Math for programmers 2024 book bundle. Manning

0 Comments

2024/11/21
11:32 UTC

Very slow code execution with 4.4.0

Hi,

I recently upgraded to 4.4.0 version of R and 2024.04.1 build 748 version of RStudio. Since then, the code execution has slowed down considerably.

Any ideas on how it can be fixed.

Regards

1 Comment

2024/11/20
21:45 UTC

Hi,

I recently installed R 4.4.0 version and upgraded RStudio to 2024.04.01 build 748 version. Since then, the code execution has become painfully slow.

Any ideas how it can be made better?

Thanks!

2 Comments

2024/11/20
21:43 UTC

Why are my plm() and felm() results so different?

Hi everyone. I'm taking a data analysis and research class and I am running into a couple issues in learning fixed effects vs random effects for panel data.

The first issue I have is with the code below, specifically model t1m2 and the test model. I understand theoretically what fixed effects are and in this case both models should be doing fixed effects for person ("nr") and years. However, I wanted to test that using both plm() and felm() would produce the same results but they don't. As you can see in 2 and 3 in the output table, the coefficients are completely different. Could anyone explain why I'm getting this difference?

Additionally, if anyone could explain to me exactly how my RE model differs from the others I would also really appreciate that because I'm struggling to understand what it actually does. My current understanding is that it basically takes into account that observations in the data are not independent and may be correlated along unit (nr) and year, then does something weighting and uses that to change the coefficient from the pooled model? And by doing this it creates one intercept, unlike FE, but also provides a better estimate than the pooled model. But also if there are really strong unobserved factors related to to the units or years then fixed effects are still needed? Is any of this accurate? Thanks for the help.

================================================
                     Dependent variable:        
             -----------------------------------
                            lwage               
               OLS     panel     felm    panel  
                       linear            linear 
               (1)      (2)      (3)      (4)   
------------------------------------------------
union        0.169*** 0.070*** 0.083*** 0.096***
             (0.018)  (0.021)  (0.019)  (0.019) 
                                                
married      0.214*** 0.242*** 0.058*** 0.235***
             (0.016)  (0.018)  (0.018)  (0.016) 
                                                
Constant     1.514***                   1.523***
             (0.011)                    (0.018) 
                                                
------------------------------------------------
Observations  4,360    4,360    4,360    4,360  
================================================
Note:                *p<0.1; **p<0.05; ***p<0.01

t1 = list(
  t1m1 = lm(lwage ~ union + married, data=wagepan_data, 
          na.action = na.exclude), #pooled
  t1m2 = plm(lwage ~ union + married, data = wagepan_data, model = "within", 
             index = c("nr", "year"), na.action = na.exclude), #fixed effect
  # could also build FE model with felm() -> 
  test = felm(lwage ~ union + married | nr + year, data = wagepan_data , na.action = na.exclude),
  t1m3 = plm(lwage ~ union + married, data = wagepan_data, model = "random", 
            index = c("nr", "year"),na.action = na.exclude) # random effect
)

stargazer(t1, type = "text", keep.stat = "n")

2 Comments

2024/11/20
19:32 UTC