/r/rstats

Photograph via snooOG

A subreddit for all things related to the R Project for Statistical Computing. Questions, news, and comments about R programming, R packages, RStudio, and more.

PLEASE READ THIS BEFORE POSTING

Welcome to /r/rstats - the subreddit for all things R (the programming language)!

For code problems, Stack Overflow is a better platform. For short questions, Twitter #rstats tag is a good place. For longer questions or discussions, RStudio Community is another great resource.

If your account is new, your post may be automatically flagged and removed. If you don't see your post show up, please message the mods and we'll manually approve it.

Rules:

  1. Be polite and good to each other.
  2. Post only R-related content. This also means no "Why is Other Language better than R?" threads
  3. No blatant self-promotion ("subscribe to my channel!"). This includes affiliate links!
  4. No memes (for that, go to /r/rstatsmemes/)

You can also check out our sister sub /r/Rlanguage

/r/rstats

82,181 Subscribers

0

Struggling with Marginal Effects

(Using the marginaleffects package)

I am trying to see the marginal effect of various policy objectives on the success of the policy

For example if the code is:

logit <- glm(success ~ factor(objective), data = data, family = binomial(link = "logit"))

Where success is a 0/1 Objective is 8 categorical objectives

When I try to use the plot slopes function I only receive interval plots but I was expecting a fitted line with going from 0 to 1. The intervals looks the same for 0 to 1 when the but the average_slopes to my understanding show an effect.

Any help is greatly appreciated!

2 Comments
2024/05/17
06:18 UTC

5

Looking up function documentation as a beginner

1 Comment
2024/05/17
05:00 UTC

1

What the hell am I doing wrong...

Why is 999 being converted to "NA" for one variable and "<NA>" for another? Im a relatively new R user... please be nice :)

str(excel_data$Comm_with_ECHQ)

num [1:56] 5 5 999 5 5 999 4 4 3 999 ...

tail(excel_data$Comm_with_ECHQ)

[1] 999 999 999 4 4 5

str(excel_data$Enjoy_Mentor)

num [1:56] 5 5 5 5 4 5 5 5 5 5 ...

tail(excel_data$Enjoy_Mentor)

[1] 5 5 4 999 5 5

excel_data$Comm_with_ECHQ[excel_data$Comm_with_ECHQ == 999] <- NA

excel_data$Enjoy_Mentor[excel_data$Enjoy_Mentor == 999] <- NA

excel_data$Comm_with_ECHQ <- factor(excel_data$Comm_with_ECHQ, levels = c(1:5))

excel_data$Enjoy_Mentor <- factor(excel_data$Enjoy_Mentor, levels = c(1:5))

str(excel_data$Comm_with_ECHQ)

Factor w/ 5 levels "1","2","3","4",..: 5 5 NA 5 5 NA 4 4 3 NA ...

tail(excel_data$Comm_with_ECHQ)

[1] <NA> <NA> <NA> 4 4 5

Levels: 1 2 3 4 5

str(excel_data$Enjoy_Mentor)

Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 4 5 5 5 5 5 ...

tail(excel_data$Enjoy_Mentor)

[1] 5 5 4 <NA> 5 5

Levels: 1 2 3 4 5

6 Comments
2024/05/16
22:23 UTC

0

How to Install Packages to Projects (Not Pollute Global Environment)

In programming (and Anaconda), a separate project can be created, and the libraries are installed in each project. This allows for clean code (select the specific library/version for a given project), and it can be shared (github/docker, etc.)

What's the easiest way to do this with R? I just found out about the projects, but then when I go to install a package, it wants me to install in globally. I would want to reinstall the specific libraries I use, with each package, to that folder.

Thanks!

7 Comments
2024/05/16
18:33 UTC

6

👩‍💻👨‍💻 Calling all R enthusiasts!

Gergely Daróczi and the Budapest Users of R Network have some exciting updates to share. After facing pandemic challenges, they're back with in-person events on bioinformatics, large language models, and more.

"In a university room, it felt like there were only a dozen R users from academia. However, a lot has changed since then, as we now have almost 2,000 members in the local R User group, which exceeded my original expectations for such a small country like Hungary. It has been an interesting and great experience."

Discover their journey and how they're empowering the R community in Hungary. Whether you're a seasoned R user or just starting, this is a community you don't want to miss. Join the conversation and connect with fellow R users!

Read more: https://www.r-consortium.org/blog/2024/05/16/gergely-daroczis-journey-empowering-r-users-in-hungary 

1 Comment
2024/05/16
17:01 UTC

1

display p value with more than 1 decimal in gtsummary

how can I do to show a p value with 3 digits after the comma in add_p() of gtsummary?

2 Comments
2024/05/16
15:49 UTC

2

Binomial glm help

Hey I'm a biologist and I did my experiment checking different bacteria and their effect on my live model. So my data is basically 6 replicates of each bacteria and 10 live models in each replicate. I did a basic line graph with standard deviation. But one of my colleagues suggested that I do a binomial glm. I followed some tutorials online and got the values but I don't know how to best present this data in a graph.

7 Comments
2024/05/16
14:16 UTC

1

Create dataframe with a decreasing sequence per column (for weights)

Hello! May I ask how can I create a datafame like this? As shown, the sequence stops at its repective year row. I wish to use this data as weights.

Thank you very much!

4 Comments
2024/05/16
13:28 UTC

1

Degrees of freedom in LSD pairwise comparison is deemed infinite. Why?

Hello all!

I can give you all more information about my model if you would like, but I would like to keep this simple. I ran zero-inflated negative binomial mixed model (glmmTMB). I saved the model and calculated their estimated marginal means (emmeans). Then I compared those estimated marginal means against each other. Instead of my numerator df being listed as a value they are listed as "inf" meaning infinite. I have no idea why. I have done similar tests in SPSS before and I have always received df.

An example of the code I ran was:

contrast(estimated marginal means of ZINB model, method = "pairwise', adjust = "bonferroni")

I received a message "NOTE: Results may be misleading due to involvement in interactions" and the results below:

 contrast              estimate    SE  df z.ratio p.value
 Diploid - Tetraploid     0.733 0.224 Inf   3.270  0.0032
 Diploid - Triploid       0.020 0.226 Inf   0.088  1.0000
 Tetraploid - Triploid   -0.713 0.227 Inf  -3.144  0.0050

Results are averaged over the levels of: P 
Results are given on the log (not the response) scale. 
P value adjustment: bonferroni method for 3 tests 

Again - I am happy to share all my code. Thank you all!

Edit: Ben Boulker, the man himself, has information about his in his GLMM FAQ. Anyway, it seems that df of GLMMs cannot be computed yet (if ever). https://stackoverflow.com/questions/73536308/how-to-get-emmeans-to-print-degrees-of-freedom-for-glmer-class

4 Comments
2024/05/16
00:27 UTC

11

Enhancing R: The Vision and Impact of Jan Vitek's MaintainR Initiative

Join us as we delve into Jan Vitek's MaintainR Initiative, aiming to provide essential maintenance to prolong the usefulness of the R ecosystem.

"Our effort is focused on providing the necessary maintenance to prolong R's usefulness." - Jan Vitek

Read the full article: Enhancing R: The Vision and Impact of Jan Vitek's MaintainR Initiative

0 Comments
2024/05/15
21:10 UTC

2

Package for text classification (R)

Hi all

I work on a project in which I classify units based on their names using a description of the categories used to classify them with. I have tried dictionary approaches, but would like to use a more context based classification approach based on the descriptions.

Which packages do you have the best experience with and can you provide code examples hereof?

Thanks!

3 Comments
2024/05/15
20:22 UTC

2

tensorflow package error in R

Hi. good time. currently, I am running deep learning codes in R using reticulate and keras and tensorflow packages. I have got an error about tensorflow package. my python version is 3.11.4 . would it be possible to help me in solving my error ? thanks a lot

Error: Valid installation of TensorFlow not found. Python environments searched for 'tensorflow' package: C:\Users\Sony\Documents\.virtualenvs\r-reticulate\Scripts\python.exe Python exception encountered: Traceback (most recent call last): File "C:\Users\Sony\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py", line 122, in _find_and_load_hook return _run_hook(name, _hook) File "C:\Users\Sony\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py", line 96, in _run_hook module = hook() File "C:\Users\Sony\AppData\Local\R\win-library\4.3\reticulate\python\rpytools\loader.py", line 120, in _hook return _find_and_load(name, import_) ModuleNotFoundError: No module named 'tensorflow' You can install TensorFlow using the install_tensorflow() function.

1 Comment
2024/05/15
18:32 UTC

7

Running R project in a shared google drive folder

Hey All,

I am hoping to run an R project in a shared google drive folder with my lab so others can process weekly data. I have had issues with files getting updated and other weirdness when I have attempted this before. I was wondering if anyone has experience with making this functional or some other solution that would be helpful to let non-programming people be able to run my scripts on csv files in the easiest way possible.

17 Comments
2024/05/15
18:08 UTC

2

Trouble conceptualizing how I can fix my 2-way RMANOVA when my current code spits out weird degrees of freedom.

Basically what the title says:

I am trying to conduct a two-way repeated measures ANOVA in rstudio. I have a dataset that's got columns for "Condition", "Intox_score", "Point", "Day", and "ID".

I'd like to look at intox score, over time (Point - broken down into 1-12) by Condition (T, F, M).

My output looks like this:

Error: Within Df Sum Sq Mean Sq F value Pr(>F)
Point 1 15.4 15.444 14.774 0.000144 ***

Condition 2 36.8 18.410 17.611 5.13e-08 ***

Point:Condition 2 0.4 0.176 0.169 0.844842
Residuals 352 368.0 1.045

I believe the issue is that R is taking every single row into account as if they're all individual subjects, and that is what's creating an issue. That being said, I cannot wrap my mind around how I would need to update things to remedy this. Am I using using the right test for this?

Code pasted below. Happy to add detail if it'd be helpful. Any help is much appreciated!

Code:

behint_rm_anova <- aov(Intox_score ~ Point * Condition + Error(ID/Point), data = Behavioral_intox_data_v4_for_R)

summary(behint_rm_anova)

7 Comments
2024/05/15
18:03 UTC

1

Creating a new data frame from values and column names from other data frames/ t-test results

I had a data frame that was like this:

MethodAlexJoe
A1.232.34
B3.214.32

Then I did a T-test and was able to get the p.values.

However, I am now interested in creating a new data frame like this bottom table but I am struggling. I want to pull the column names from the first data frame to become rows within a column. Then, I want to use the p-value from the T-test to be in one column as well.

IndexPeopleP-value
1Alex0.51
2Joe0.47

This is what I have done so far:

Sample_data <- data.frame(Index = numeric(), People = character(), 'P-value' = numeric())

1 Comment
2024/05/15
17:45 UTC

2

Realtime updating plot in R using echarts4r or other interactive charts

Hi everyone, I was trying to create a shiny app which generates lively updating time series trend chart

I saw this javascript example : https://codesandbox.io/p/sandbox/react-echarts-realtime-56vdc?file=%2Fsrc%2FApp.js%3A4%2C1 and wanted to implement something like this which updates in real time. If anyone could give an example that would be great.

7 Comments
2024/05/15
15:37 UTC

2

Is there an mgcv equivalent for python that can do mixed-effects GAMs?

Asking for a friend

3 Comments
2024/05/15
14:35 UTC

2

📢 Update from the Melbourne R Business User Group!

We're excited to share that the Melbourne R Business User Group, organized by Maria Prokofieva, has evolved to focus on business consultancy. This initiative offers graduate students valuable industry experience and mentorship opportunities. The group is committed to ethical data governance and fostering an inclusive community.

As Maria says, "The backbone of my community comprises my current and former Master's students, who completed a course on business analytics. They are passionate about using R in everyday tasks and already possess some knowledge and experience, which they are happy to share." 🌐📊

Learn more about this amazing journey and the group's evolution here: https://www.r-consortium.org/blog/2024/05/13/the-evolution-of-melbournes-business-analytics-and-r-business-user-group

0 Comments
2024/05/15
14:22 UTC

0

Finding proportion mediated by levels of moderator

Hi everyone,

I’m running a moderated mediation model. I need to find the proportion mediated by different levels of the moderator (binary - yes/no) variable.

Would I simply run the mediation model once with the sample only for those that selected “yes” and then those that selected “no” and calculate proportion mediated for each ?

Or is there another way to do this with conditional indirect effect?

Thank you

0 Comments
2024/05/15
03:40 UTC

1

Predicting with Geographic and Temporal Weighted Regression

Hi,

Wanted to ask if anyone had an experience using a GTWR model for prediction. Both the gtwr and GWModel packages don't have a trained GTWR model into the predict function.

Wondering if anyone has figured out any workaround.

Cheers

0 Comments
2024/05/14
20:23 UTC

1

Help computing a ratio based on a condition with panel data

Hi everyone, I have panel data in the following format:

IDXDate_XX_lagDate_X_lagRatio
1192020-03-14452020-03-130.42
1462020-03-15192020-03-142.4
1402020-03-16462020-03-150.87
1452020-09-19402020-03-161.13

I.e., patients have given blood samples to measure a biomarker X over time, and I computed a ratio between the biomarker X and its prior value (X_lag).

Instead of taking the same row of X_lag like I have done here, I want to take the lowest value of X_lag in the previous rows (if it exists) but, only if the difference between those dates is lower than 6 days, and then I want to compute a ratio between that lowest value, otherwise I want to compute a ratio as I have done here using the same row. For example, for the third row I don't want to compute 40/46, but 40/19 because 19 is the lowest value that falls in the 6 day time frame.

I tried the following code, which happens to work with the toy data, but not with my actual data because it just calculates a ratio between the lowest value all the time. So I am stuck on how to specify that it shouldn't search in future rows, but just prior rows:

df <- data.frame(ID = c(1, 1, 1, 1), X = c(19, 46, 40, 45), Date_X = c("14/03/2020", "15/03/2020", "16/03/2020", "19/09/2020"), X_lag = c(45, 19, 46, 40), Date_X_lag = c("13/03/2020", "14/03/2020", "15/03/2020", "16/03/2020"))

df$Date_X <- as.Date(df$Date_X, format = "%d/%m/%Y") 
df$Date_X_lag <- as.Date(df$Date_X_lag, format = "%d/%m/%Y")

data <- df %>% mutate(diff_date = as.numeric(difftime(Date_X,Date_X_lag, units='days'))) %>% mutate(Ratio = Ratio_function(X,X_lag, Date_X,Date_X_lag, diff_date)) %>% 
group_by(ID) %>% mutate(Ratio=ifelse(row_number()==1, X/X_lag,Ratio))

Ratio_function <- function(X,X_lag, Date_X,Date_X_lag, diff_date) {
min_X_lag <- X_lag[which.min(X_lag)]
min_X_lag_date <- Date_X_lag[which.min(min_X_lag)]

ifelse(diff_date <=7, X/min_X_lag, X/X_lag)
}

If anyone could help me out, I would appreciate it immensely.

4 Comments
2024/05/14
16:40 UTC

1

[Q] Help with simulating non-linear associations with fixed coordinates

Hi, first post here on r/rstats for me.

I'm trying to fit a line between two coordinates (x = 0, y = 50 and x = 30, y = 0).

A simple linear interpolation can be simulated as

data.frame(x = c(1:30), y = seq(from = 50, to = 0, length.out = 30))

Now I wish to simulate a relationship that is non-linear, e.g. one that slightly increases initially while decreasing exponentially, as well as one that decreases exponentially initially and thereafter flattens. Importantly, both lines need to end at (x = 30, y = 0).

Is there any good way of doing this? I thought about simply manually adding the data points and fitting a loess curve, but I would like this to be less manual, preferably using two separate functions. Many thanks in advance!

3 Comments
2024/05/14
14:46 UTC

3

Any advice on Multivariate Granger causality test on panel data?

Hi Reddit
My study group and I are trying to do a granger causality test with multiple x-variables at once. We are using panel data (20 countries over 35 years) with around 7 control variables.
Is this even possible? The plm package's granger test only seems to allow 1 x-variabel. We have also tried the tseries+vars packages, yet we can't figure out how to control for different countries here.
Thank you for reading through, any help is appreciated

1 Comment
2024/05/14
12:53 UTC

1

Question about weights and building an index

Hi everyone I have a question regarding weighting of data when building an index:

I am attempting to build an index (let's say, an index of living standard for ease of communication purpose) using some large scale survey data from different countries.

The index contains different components which are extracted/calculated from the data. Variables contain responses from opinion surveys and also tests with objective results (e.g. IQ)

Since its such a large sample, the data was collected using stratified sampling. My understanding is, in general analysis where we compare differences or make predictions, we would apply weights to the data so that results is more representative of the actual population.

However since I am building an index here here, I am not sure if I should apply weights.

On one hand it seems to me applying weights would make the results more representative of the population, but on the other hand I do not think it makes sense to apply weights to variables like IQ tests results.

I wonder if you all can give me some answers on the matter. Thanks in advance!

4 Comments
2024/05/14
10:45 UTC

1

Help on McFadden R-squared

Need some help.

Currently, I'm trying to use the modeling approach for a Best-worst Scaling (BWS) study. Following this guide, I tried to calculate a McFadden R-square value manually for a model without intercept.

LL0 <- - 90 * 7 * log(12) # the value of log-likelihood at zero  
LLb <- as.numeric(md.out$logLik) # the value of log-likelihood at convergence
1 - (LLb/LL0)  # McFadden's R-squared

Based on the guide given, my best guess is
90 = number of observations

7 = total number of variables (including omitted "washfree")

12 = "Frequencies of alternatives:choice"

The issue however is when I tried to perform the calculation on my own study, my McFadden R-squared value is negative.

Number of observations: 282, number of variables: 13, Frequencies of alternative choice: 4

Where did I go wrong? Perhaps my understanding of the guide is wrong?

4 Comments
2024/05/14
06:23 UTC

4

Hexbin plots in R

I'm having trouble improving on this plot, as it does not look aesthetically pleasing. What are some ways that the plots can be further improved?

The code that displays this plot is:
library(ggplot2)

Create a hexbin plot with the full dataset and custom fill colors based on count

ggplot(MSD, aes(x = tempo, y = artist_familiarity)) + geom_hex(aes(fill = ..count..), color = "black") + # Specify fill color based on count scale_fill_gradient(low = "lightblue", high = "darkblue") + # Adjust the gradient color scale labs(x = "Tempo", y = "Artist Familiarity") + ggtitle("Hexbin Plot: Tempo vs Artist Familiarity") + theme_minimal()

https://preview.redd.it/ltsue1fpqa0d1.jpg?width=538&format=pjpg&auto=webp&s=969051ecf1631ada0cbff9e80390485a0fb807b1

9 Comments
2024/05/14
01:41 UTC

0

Wandering Redditor Seeking Guidance on CS datasets

Hello strangers,

I’m currently a full time student that switched my focus from wanting to get into the medical field to computer science. Am I procrastinating on my homework atm ? Yes. I was searching through Kaggle for datasets and I ended up on this sub-reddit. Which brings me to ask:

Is there any particular place I can find a dataset in Computer Science that links to a social problem? Any help is appreciated.

9 Comments
2024/05/13
22:54 UTC

0

Why doesn't my p value give the same in gtsummary()?

I have this df

df
# A tibble: 248 × 2
   asignado     mxsitam
   <chr>        <chr>  
 1 Control      No     
 2 Control      No     
 3 Intervencion No     
 4 Intervencion Si     
 5 Intervencion Si     
 6 Intervencion Si     
 7 Control      No     
 8 Intervencion Si     
 9 Control      Si     
10 Control      Si     
# ℹ 238 more rows

I want to use add_difference() and also calculate the p-value of the result obtained.

This is the code.

aticamama %>%
  select(c("asignado",
           mxsitam)) %>%
  mutate(mxsitam= as.integer(if_else(mxsitam== "No", 0,1))) %>%
  tbl_summary(by= "asignado",
              missing = "always",
              digits = list(all_categorical() ~ c(0,1)),
              statistic = list(all_categorical() ~ "{n} ({p})"),
              missing_text= "Casos perdidos",
              percent= "column") %>% 
  add_overall() %>%
  modify_header(label = "") %>%
  add_difference() 

This is the output

https://preview.redd.it/j89g0gpsk80d1.png?width=601&format=png&auto=webp&s=0b78d1ffe6987bc0c33a53080164f0497c20238e

As you can see my diference is -6,9% and my p-value is 0,5.

But when I use prop.test() to calculate my CI it gaves me another p value.

aticamama$variable1 <- factor(aticamama$asignado)
aticamama$variable2 <- factor(aticamama$mxsitam)

tabla_contingencia <- table(aticamama$variable1, aticamama$variable2)
tabla_contingencia
> tabla_contingencia
   
    No Si
  0 92 33
  1 82 41

resultado_prueba <- prop.test(tabla_contingencia)

resultado_prueba
> resultado_prueba

2-sample test for equality of proportions with continuity correction

data:  tabla_contingencia
X-squared = 1,1116, df = 1, p-value = 0,2917
alternative hypothesis: two.sided
95 percent confidence interval:
 -0,05236089  0,19102756
sample estimates:
   prop 1    prop 2 
0,7360000 0,6666667 

Now it shows that my p-value is 0,2917. Why?

Also, why with add_p() it doesn't give me a CI?

5 Comments
2024/05/13
18:24 UTC

Back To Top