/r/rstats

Photograph via snooOG

A subreddit for all things related to the R Project for Statistical Computing. Questions, news, and comments about R programming, R packages, RStudio, and more.

PLEASE READ THIS BEFORE POSTING

Welcome to /r/rstats - the subreddit for all things R (the programming language)!

For code problems, Stack Overflow is a better platform. For short questions, Twitter #rstats tag is a good place. For longer questions or discussions, RStudio Community is another great resource.

If your account is new, your post may be automatically flagged and removed. If you don't see your post show up, please message the mods and we'll manually approve it.

Rules:

  1. Be polite and good to each other.
  2. Post only R-related content. This also means no "Why is Other Language better than R?" threads
  3. No blatant self-promotion ("subscribe to my channel!"). This includes affiliate links!
  4. No memes (for that, go to /r/rstatsmemes/)

You can also check out our sister sub /r/Rlanguage

/r/rstats

81,646 Subscribers

2

Comparing Subsets

I don't use R that often, so I apologize for the novice question. I have a dataset and I need to find a way to sort and describe the data by sex and location. I essentially need: This is the summary description of the information for females in Seattle, this is the summary description for males in Seattle, this is the summary description for females in Denver, this is the summary description for males in Denver.

I can subset "Sex" into "Male" and "Female" and subset "Location" into "Seattle" and "Denver." I don't know how to combine those two subsets to generate the summary statistics for the other 11 factors based on those two subsets...

I'm getting errors if I use something like describe(mydata$Female~mydata$Denver) or describe(mydata,group=Female$Seattle) so I don't know where to go from here...

1 Comment
2024/04/26
20:56 UTC

5

emmeans multiplicity adjustment for CIs keeps reverting to Bonferroni?

I'm reporting results of a binomial GLM predicting a binary outcome (agreement vs. disagreement with a survey item) from a categorical predictor (primary language). I've used emmeans() to test the effect of each language using method = del.eff (since pairwise would produce too many comparisons).

I'm trying to obtain the 95% CIs for each effect for plotting; however, emmeans keeps applying the Bonferonni correction to the CIs it calculates, even when I specify to use the Holm correction. This is problematic because some effects that are statistically significant based on the (Holm-adjusted) p-values are not significant based on the (Bonferonni-adjusted) CIs: i.e., the CIs of the relative risk include 1.0. How can I force the confidence intervals calculated by emmeans to use a specific adjustment?

I've included an example of the issue occurring with the `mtcars` data set.

library(tidyverse)
library(emmeans)

# Make number of cylinders a factor
car_dat <- mtcars %>% mutate(cyl = factor(cyl))

# Predict auto vs. manual by number of cylinders
(m1 <- glm(data = car_dat, am ~ cyl, family = binomial))
summary(m1)

# EMMs by cylinder
(emm1 <- emmeans(m1, ~cyl))

# Get effects vs. mean, with Holm adjustment to p-value
(con1 <- emm1 %>% 
    regrid("log") %>% # Regrid to get relative risks (vs odds ratios)
    contrast("del.eff", 
             type = "response", 
             adjust = "holm"))

# Getting confidence intervals for the above reverts to Bonferroni!
(confint(con1, adjust = "holm"))

4 Comments
2024/04/26
18:28 UTC

1

Question about a time-dependent Cox model

I have a variable for age that does not meet the proportional hazards assumption. I am trying to control for it using a time-dependent variable in the model (even though age is time-independent for this project). My question is, how do I do this in R?

This is the code I currently have for a model that includes 2 other time-dependent variables.

data_tmerge <- tmerge(data, data, id=id, var1_dependent=event(cox_time, var1), 
                          var2_dependent=tdc(time_to_var1), var3=tdc(time_to_var3))

coxph(Surv(tstart, tstop, var1) ~ var2 + var3, data = data_tmerge)

I'd like to include the variable age to this model as a time-dependent variable. Any advice is appreciated!

UPDATE: I tried using tt(age) but the HR for this variable shows as 1.00 with 95% CI: (1.00, 1.00) with p-value=0.2.

Is this the correct way to do it and is it okay that the confidence interval is this narrow?

0 Comments
2024/04/26
13:39 UTC

0

SAS sem model with moderator

For my master thesis (sociology) im doing research on dating behavior during the pandemic. I'm doing structural equation modeling in SAS using mainly manifest variables. I want to include gender as a moderator in my model but I keep getting errors and it seems to be impossible to find any examples of sas code/syntax of sem-models with a moderator. Can someone please help?

1 Comment
2024/04/26
13:09 UTC

5

Hide bars with no data

Hello, I’ve made this bar chart (using geom_col) with ggolot2. The red circles are sections where there is no data, but R is leaving a gap. Is there anyway to remove this gap?

9 Comments
2024/04/26
12:34 UTC

0

weights argument in lm()

0 Comments
2024/04/26
03:58 UTC

2

Any good resources or tutorials on really good forest plots in metafor? Base/grid graphics is killing me

Howdy,

I was wondering if anyone had any videos or documentation on how to fully flex the forest() function? It seems to be a lot of trial and error for me at the moment and the end product just doesn't look as good as other plots I've seen.

0 Comments
2024/04/26
03:04 UTC

0

Rpy2 issues predicting using a previously trained state space model

Not sure whether I should post this here or in a python sub, but I've been having some trouble with running R code through rpy2. Basically I want to use the rucm package to train a model from python, then use that model to make a prediction later on, but this doesn't work out of the box. I think the issue I'm running into is that once the model object comes into python, it's stored as a python dict, so then when passing it back into R, the rucm (or rather the underlying KFAS) package doesn't recognize it as a valid SSModel.

I've tried re-specifying an SSModel explicitly using the fitted python object, but KFAS doesn't seem to really support specifying a model from its parameters rather than training it (or at least I couldn't figure out how to do it).

I've also tried wrapping the training and prediction within a single function defined like robjects.r('''my code here'''), hoping that rpy's R session would hold on to the model structure, however I'm getting a very unhelpful error message of "Error in var(data[, as.character(dep.var)]) : is.atomic(x) is not TRUE", despite the same code snippet running fine in non-rpy2 R.

If anyone has any experience with anything like this I'd appreciate it. I'm coming close to wither just wrapping the prediction step in an API call and ditching rpy2 altogether, or else ditching R altogether and using statsmodels.

0 Comments
2024/04/25
16:16 UTC

1

Vegan package. Bray Curtis distances

Hi everyone,

I have a (probably) very stupid question.

I am trying to analize bacterial composition in different niches. 2 of those niches have the bacterial counts of each species normalised per individual. The other niches do not have that normalization.

Therefore, given that Bray Curtis takes into account absence/presence and abundance, I assume I cannot compare all the niches, unless the data of all niches is somewhat normalized (i.e relative abundances)

My quesiton is if calculating the bray curtis distances with the vegan package(vegdist(df,method="bray")), the relative abundances are automatically generated (because that's how the Bray Curtis distances work I think) or I have to calculate them first (decostand(df, method = "total")).

I did both (calculating the nmds directly and generating the relative abundances and then running the nmds), and the results are completely diferent.

Thanks!

3 Comments
2024/04/25
13:54 UTC

6

Recoding two variables into one

Hi! R newbie here.

I had one course on R basics in my previous semester at uni, and I'm now writing my thesis using R (a survival analysis). And yes, I tried to search for help on google.

I'm working with NHIS data, and none of their race/ ethnicity variables includes hispanic people. they have a whole separate variable for hispanic people.

I now want to create a new variable that includes all given races and ethnicities. I also know that the way I recoded my variables probably isn't the best one, but it's how I learned it.

In the pictures you'll see that I recoded the the variable racesr into race, and hispyn into hispanic. + my attempt at combing the two variables, and that Hispanic isn't in the output of the 2nd table.

I never combined variables before, only recoded them to group the categories differently.

Is it even possible to combine the two variables? I obviously have to keep the number of observations the same during all of my analysis and can't just "add" the hispanic people on top of the numbers in the other race variable (I hope this makes sense, english is not my first language).

I'm glad for every help!

https://preview.redd.it/gihsfhdtqmwc1.png?width=596&format=png&auto=webp&s=2f33cb53240c8740c34b29d923d91bf725b0d765

11 Comments
2024/04/25
13:33 UTC

0

Help with Moderated Mediation

Hi everyone! I am currently trying to learn data analysis with R and need to do a moderated mediation analysis. Here is an example of the model type I am trying to analyze.

https://preview.redd.it/lcqjpv1lhmwc1.png?width=348&format=png&auto=webp&s=face5f8916f51e7e2bfbdea2e8a02f6041eb66ae

Here is the code I have been trying to run, but I realized I unfortunately am too new with R to be able to fully understand the outputs and thus, what I am missing. Could someone please help me figure out how to get the strengths of the effects in my model, and the significance? Thank you so much in advance!

#2. Load your packages (i.e. mini-programs)

#first time: install.packages(""), later library("")

library("lavaan")

library("foreign")

library("multilevel")

library("mediation")

library("sjPlot")

library("lme4")

#3. Load your data

#Long format data

data<-read.spss("Long NEW.sav", use.value.labels=FALSE,

to.data.frame=TRUE, use.missings=TRUE)

hrdata<- as.data.frame(data)

names(hrdata)

#4. Is multilevel necessary?

hrs.mod1<-aov(ENG_Sum~as.factor(Res_ID),data=hrdata)

ICC1(hrs.mod1)

ICC2(hrs.mod1)

#5. Multilevel like a pro

##LMX(mediator) as outcome

model1 <-lme(LMX_Sum ~ EMP_Sum

, random=~1|Res_ID

,method="ML",na.action=na.omit, data=hrdata)

summary(model1)

##Engagement (dependent) as outcome

model2 <-lme(ENG_Sum ~ LMX_Sum + EMP_Sum

, random=~1|Res_ID

,method="ML",na.action=na.omit, data=hrdata)

summary(model2)

##Make interaction term

hrdata$I_EMP_Sum_Gender_Dissimilarity <- hrdata$EMP_Sum * hrdata$Gender_Dissimilarity

##Moderation

model3 <-lme(LMX_Sum ~ EMP_Sum + Gender_Dissimilarity

  • I_EMP_Sum_Gender_Dissimilarity

, random=~1|Res_ID

,method="ML",na.action=na.omit, data=hrdata)

summary(model3)

#6. Test for mediation and moderated mediation (can take a while!)

#Preparation (other package is used for compatibility,

please enter centered variables [c_] for the interaction [*])

model4 <-lmer(LMX_Sum ~ EMP_Sum + Gender_Dissimilarity + I_EMP_Sum_Gender_Dissimilarity

  • (1|Res_ID), data = hrdata,

REML = F, start = NULL,

verbose = 0L, na.action=na.omit,

contrasts = NULL, devFunOnly = F)

model5 <-lmer(ENG_Sum ~ LMX_Sum + EMP_Sum + Gender_Dissimilarity

  • (1|Res_ID), data = hrdata,

REML = F, start = NULL,

verbose = 0L, na.action=na.omit,

contrasts = NULL, devFunOnly = F)

mod.med<- mediate(model4, model5, treat = "EMP_Sum",

mediator = "LMX_Sum", sims=1000,

group.out = "Res_ID", dropobs=F,

boot=F, data=hrdata)

summary(mod.med)

0 Comments
2024/04/25
12:59 UTC

0

Need to create a function for Specific Growth Rate

I need to create a function that I can use to calculate specific growth rate into a new column in my data frame. The function is ln(N2/N1)/(t2-t1) where and N1 and N2 are the weight on days t1 and t2, respectively. It needs to run similar to this delta weight line:

sgr$deltaweight <- ave(sgr$weight, sgr$id, FUN = function(x) c(NA, diff(x)))

This is my dataset

4 Comments
2024/04/25
00:33 UTC

4

What R packages can I use to test statistical power of zero-inflated negative binomial model?

Hello all!

In R, I have used the glmmTMB package to run a zero-inflated negative binomial model. This model had random and nested factors in the model. I originally did a quick and dirty posteriori power analysis using pwr.f2.test from the pwr package. Obviously that analysis was not appropriate because this function is for general linear models. I was planning on using the SIMR package to calculate power for my model since SIMR can be used for generalized linear mixed models. It was built for LME4 models, but I think it can be used for models from other packages. Can I use a model from glmmTMB with the powerSim function from the SIMR package? If not, what other statistical power testing functions can I use that will complement my glmmTMB model? Also a more general question: I mean should I even do a power analysis? The experiment is already done.

Documentation about the SIMR package: https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.12504

Thank you!

3 Comments
2024/04/24
19:14 UTC

1

Filtering a data set

This is a sample data set similar to my problem.

I want to filter my data frame to only include the rows of the author and status = consensus, or if there is no consensus only status = reviewer_1.

I have tried this code
filtered_df <- filter(original_df, status == "consensus" | status == "reviewer_1" )

But for authors that have consensus, reviewer_1 and reviewer 2 - it keeps the consensus and the reviewer 1.

https://preview.redd.it/ox1chkx90hwc1.png?width=265&format=png&auto=webp&s=8836d85672ce7c3d15a6f34255f365f07171962e

4 Comments
2024/04/24
18:33 UTC

1

First SEM model and issues with unique factor covariances among reverse scored items in lavaan

Hi everyone: This is my first time doing SEM and I feel super insecure about my code/results. I do not know colleagues that know SEM and also use R, so I will give a try here:

I wanted to test a 'simple' model with 1 predictor (continuos variable) and 1 outcome (latent bi-factorial continuos variable with 4 factors). The structure of the outcome was identified from previous literature.

First, I conducted a CFA to check the measurement model. I used the author's code and I reversed the negatively worded items before conducting CFA or SEM (the author did not mention any of this). I also include a chunk to estimate unique factor covariances among reverse scores (albeit the reverse scores are now following the same direction than the positively worded items). The code from the original paper is:

model<- 'A =~ stress1 + stress10rev + stress15rev

B =~ stress2N_reversed + stress6N_reversed + stress11rev + stress14rev

C =~ stress4rev + stress7rev + stress8rev + stress9rev + stress17rev

D =~ stress3N_reversed + stress5N_reversed + stress12rev + stress13N_reversed + stress16rev

GENERAL =~ stress3N_reversed + stress5N_reversed + stress12rev + stress13N_reversed +

stress16rev + stress1 + stress10rev + stress15rev + stress4rev + stress7rev +

stress8rev + stress9rev + stress17rev + stress2N_reversed + stress6N_reversed +

stress11rev + stress14rev

#estimate unique factor covariances among reverse scored items (not the original, they are reversed)

stress2N_reversed ~~ stress3N_reversed + stress5N_reversed + stress6N_reversed + stress13N_reversed

stress3N_reversed ~~ stress5N_reversed + stress6N_reversed + stress13N_reversed

stress5N_reversed ~~ stress6N_reversed + stress13N_reversed

stress6N_reversed ~~ stress13N_reversed

#set orthogonal latent factors

GENERAL ~~

0*A

  • 0*B

  • 0*C

  • 0*D

A ~~

0*B

  • 0*C

  • 0*D

B~~

0*C

  • 0*D

C~~

0*D'

Then, I conducted SEM, using the same code than for model, and adding the regressions:

#regressions

A + B + C +D ~ predictor

GENERAL ~ predictor

Does it make sense? Particularly: 1. does it make sense to estimate unique factor coviances among reverse scored items (that now follow the direction of the positively worded items)?, 2. does it makes sense to regress each factor and the general factor? My hypothesis is focused on the general factor, but I want to include the factors in case a reviewer asks for them.

Any insights will be super appreciated!

EDIT: the codes run and the results are in line with previous literature.

0 Comments
2024/04/24
18:00 UTC

1

Using cluster option in mediate function

 am trying to conduct a mediation analysis with multi-level data, i.e., individual observations nested in countries. To control for the clustered structure, I want to use the "cluster" option of the mediation package in R.

However, I get the following error message:

Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels

In addition: Warning message: In mediate(model.m, model.y, sims = 50, treat = "treatment", mediator = "mediator", : length of cluster vector differs from # of obs for mediator model

There are 112 levels to the grouping variable. I tried to replicate the data, but I cannot get the exact same error message. However, similarly to the actual regression I am running, the code below only works without the cluster option. So I would hope that if I understand what is wrong in the replication example below, I will also figure out the error with my actual data.

df <- data.frame(
  med = sample(1:4, 5000, replace = TRUE),  
  outcome = sample(c(0, 1), 5000, replace = TRUE),  
  predictor = runif(5000, 0, 12),  
  group_var = sample(1:50, 5000, replace = TRUE)  
)

model.m <- lm(med ~  predictor, data=df)

model.y <- glm(outcome ~  predictor + med, data=df, family=binomial(link="logit"))

out.1 <- mediate(model.m, model.y, sims = 50, treat = "predictor", mediator = "med", cluster = "group_var")

Both in the actual data and the replication data above I tried transforming the group variable into a factor variable and to generate a vector from the data to use as a grouping variable:

outputVector <- as.numeric(levels(df$group_var)[as.integer(df$group_var)])
0 Comments
2024/04/24
08:50 UTC

1

Moderated Mediation in Lavaan

Hi all,

I am trying to set up a mediated moderation model in lavaan.

It seems trickier than I first anticipated. And the advice available around it appears pretty variable (e.g. in terms of syntax).

I would be hugely appreciative if I could get some feedback on my syntax.

I am interested in plotting simple slopes at several quantiles for Moderator1, and Moderator2 is just a binary coded variable (male/female).

I scaled all of the data prior to running the model (centered, scaled)

I was attempting to follow the following code:

https://groups.google.com/g/lavaan/c/doiIqqVc5TY/m/ndSXbDIuBAAJ

https://gist.github.com/TomTrebs/15b84bb7d09dcf7f4a25201b9ca66629

https://ademos.people.uic.edu/Chapter15.html#5_moderated_mediation_analyses_using_%E2%80%9Clavaan%E2%80%9D_package

https://preview.redd.it/0xdjuh9lj6wc1.jpg?width=1621&format=pjpg&auto=webp&s=9c1c83cfda257d495e4ae54c2ac6e0c57f396916

#Mediated Moderation

Y ~ b1*MEDIATOR + b2*MEDIATOR_X_Moderator1 + b3*MEDIATOR_X_Moderator2 + control1 + Control2+ cprime1*Predictor + Moderator2 + Moderator1 + cprime2*Predictor_X_Moderator1 + cprime3*Predictor_X_Moderator2 +Control3 + Control4+Control5+ Control6+ Control7

MEDIATOR ~ a1*Predictor + Moderator2 + Moderator1 + a2*Predictor_X_Moderator1 + a3*Predictor_X_Moderator2 + Control3 + Control4 + Control2+ control1+Control5+ Control6+ Control7

#Allow residuals to correlate between interactions and original vars

MEDIATOR~~MEDIATOR_X_Moderator1 +MEDIATOR_X_Moderator2

MEDIATOR_X_Moderator1 ~~MEDIATOR_X_Moderator2

Predictor ~~Predictor_X_Moderator1 +Predictor_X_Moderator2

Predictor_X_Moderator1 ~~Predictor_X_Moderator2

Moderator2 ~~MEDIATOR_X_Moderator2 +Predictor_X_Moderator2

Moderator1 ~~MEDIATOR_X_Moderator1 +Predictor_X_Moderator1

MEDIATOR_X_Moderator1 ~~Predictor_X_Moderator1

MEDIATOR_X_Moderator2 ~~Predictor_X_Moderator2

#Simple slope of M on X at quantiles of Moderator 1

SS_MxX_Q1 := a1+a2*(-0.91555474209508)

SS_MxX_Q2 := a1+a2*(-0.91555474209508)

SS_MxX_Q3 := a1+a2*(-0.326983836462529)

SS_MxX_Q4 := a1+a2*(0.653967672925057)

SS_MxX_Q5 := a1+a2*(1.32101469930862)

SS_MxX_NoMod := a1+a2*(0)

#Simple slope of M on X at levels of Moderator 2

SS_MxX_Moderator2Male := a1+a3*(-2.01303598658837)

SS_MxX_Moderator2Female := a1+a3*(0.494429891442757)

SS_MxX_Moderator2_NoMod := a1+a3*(0)

#Simple slope of Y on M at quantiles of Moderator 1

SS_YxM_Q1 := b1+b2*(-0.91555474209508)

SS_YxM_Q2 := b1+b2*(-0.91555474209508)

SS_YxM_Q3 := b1+b2*(-0.326983836462529)

SS_YxM_Q4 := b1+b2*(0.653967672925057)

SS_YxM_Q5 := b1+b2*(1.32101469930862)

SS_YxM_NoMod := b1+b2*(0)

#Simple slope of Y on X at levels of Moderator 2

SS_YxM_Moderator2Male := b1+b3*(-2.01303598658837)

SS_YxM_Moderator2Female := b1+b3*(0.494429891442757)

SS_YxM_Moderator2_NoMod := b1+b3*(0)

#Indirect effects conditional on moderators

IE_cond_Q1 := SS_MxX_Q1*SS_YxM_Q1

IE_cond_Q2 := SS_MxX_Q2*SS_YxM_Q2

IE_cond_Q3 := SS_MxX_Q3*SS_YxM_Q3

IE_cond_Q4 := SS_MxX_Q4*SS_YxM_Q4

IE_cond_Q5 := SS_MxX_Q5*SS_YxM_Q5

IE_cond_NoMod := SS_MxX_NoMod*SS_YxM_NoMod

IE_cond_Moderator2Male := SS_MxX_Moderator2Male*SS_YxM_Moderator2Male

IE_cond_Moderator2Female := SS_MxX_Moderator2Female*SS_YxM_Moderator2Female

IE_cond_Moderator2_NoMod := SS_MxX_Moderator2_NoMod*SS_YxM_Moderator2_NoMod

#Direct effects conditional on moderator 1

DE_cond_Q1 := cprime1+cprime2*(-0.91555474209508)

DE_cond_Q2 := cprime1+cprime2*(-0.91555474209508)

DE_cond_Q3 := cprime1+cprime2*(-0.326983836462529)

DE_cond_Q4 := cprime1+cprime2*(0.653967672925057)

DE_cond_Q5 := cprime1+cprime2*(1.32101469930862)

DE_cond_NoMod := cprime1+cprime2*(0)

#Direct effects conditional on moderator 2

DE_cond_Moderator2Male := cprime1+cprime3*(-2.01303598658837)

DE_cond_Moderator2Female := cprime1+cprime3*(0.494429891442757)

DE_cond_Moderator2_NoMod := cprime1+cprime3*(0)

#Total effects conditional on moderator 1

total.Q1 := DE_cond_Q1 + IE_cond_Q1

total.Q2 := DE_cond_Q2 + IE_cond_Q2

total.Q3 := DE_cond_Q3 + IE_cond_Q3

total.Q4 := DE_cond_Q4 + IE_cond_Q4

total.Q5 := DE_cond_Q5 + IE_cond_Q5

total.NoMod := DE_cond_NoMod + IE_cond_NoMod

#Total effects conditional on moderator 2

total.Moderator2Male := DE_cond_Moderator2Male + IE_cond_Moderator2Male

total.Moderator2Female := DE_cond_Moderator2Female + IE_cond_Moderator2Female

total.Moderator2_NoMod := DE_cond_Moderator2_NoMod + IE_cond_Moderator2_NoMod

#Proportion mediated

propmediated.Q1 := IE_cond_Q1 / total.Q1

propmediated.Q2 := IE_cond_Q2 / total.Q2

propmediated.Q3 := IE_cond_Q3 / total.Q3

propmediated.Q4 := IE_cond_Q4 / total.Q4

propmediated.Q5 := IE_cond_Q5 / total.Q5

propmediated.NoMod := IE_cond_NoMod / total.NoMod

propmediated.Moderator2Male := IE_cond_Moderator2Male / total.Moderator2Male

propmediated.Moderator2Female := IE_cond_Moderator2Female / total.Moderator2Female

propmediated.Moderator2Female := IE_cond_Moderator2_NoMod / total.Moderator2_NoMod

0 Comments
2024/04/24
00:42 UTC

0

Rank deficiency in a linear mixed model

Hello,

I have a linear mixed model with 4 fixed factors and 1 random factor: V ~ A + B + C + D + (1|E). I want to compare this model with a model with an additional interaction term.

When I add an interaction term to this model (e.g. V ~ A + B + C + D + C:D + (1|E)), I get a rank deficiency error message in R: "fixed-effect model matrix is rank deficient so dropping 1 column / coefficient".

Why do I get this error message?

If I replace C:D with A:D, I do not get this error message.

Any suggestion is welcome.

11 Comments
2024/04/23
23:00 UTC

1 Comment
2024/04/23
21:43 UTC

1

Creating summary of a lavaan.mi object (runMI())

Hi. So I am trying to use the runMI() function of the semTools package to fit a multiple impute ordinal MG-CFA model:

rm(list=ls()) # Clean workspace
options(scipen=10000) # Set scientific notation
library(tidyverse)
library(lavaan) # For CFA
library(semTools) # For measurement of invariance in MGCFA
set.seed(121012) # For reproducibility

# Create simulated categorical dataset with missing values
popModel <- "
f1 =~ 0.6*y1 + 0.6*y2 + 0.6*y3 + 0.6*y4
y1 ~*~ 1*y1
y2 ~*~ 1*y2
y3 ~*~ 1*y3
y4 ~*~ 1*y4
f1 ~~ 1*f1
y1 | 0.5*t1
y2 | 0.25*t1
y3 | 0*t1
y4 | -0.5*t1
"

dat <- simulateData(popModel, sample.nobs  = 200L)
miss.pat <- matrix(as.logical(rbinom(prod(dim(dat)), 1, 0.2)), nrow(dat), ncol(dat))
dat[miss.pat] <- NA

# Convert columns to factors with specified levels
dat$y1 <- factor(dat$y1, levels = c(1, 2))
dat$y2 <- factor(dat$y2, levels = c(1, 2))
dat$y3 <- factor(dat$y3, levels = c(1, 2))
dat$y4 <- factor(dat$y4, levels = c(1, 2))

# Define categories for a grouping variable school
schools <- c("Marx", "Durkheim", "Weber", "Comte")

# Sample the school categories randomly for each row
dat$school <- sample(schools, size = nrow(dat), replace = TRUE)

# Define Ordinal MG-CFA model
analyzeModel <- "
f1 =~ y1 + y2 + y3 + y4
y1 ~*~ 1*y1
y2 ~*~ 1*y2
y3 ~*~ 1*y3
y4 ~*~ 1*y4
"
# Create predictor matrix
predictormatrix<-quickpred(dat)

# Impute 5 datasets (common practice)
dat_imp <- mice(data = dat, 
                           predictorMatrix = predictormatrix,
                           m = 5, 
                           # Specify imputation method for ordinal data
                           method = "polr")

mice.imp <- NULL
for(i in 1:5) mice.imp[[i]] <- complete(dat_imp, action=i, inc=FALSE)  

# run lavaan with previously imputed data using runMI
out2 <- semTools::runMI(analyzeModel, 
                        data = mice.imp,
                        fun = "cfa",
                        meanstructure = TRUE,
                        group = "school", ordered = TRUE, 
                         = TRUE)

detach("package:tidyverse", unload = TRUE)
library(semTools)

summary(out2)std.lv

However, it seems like R isn't finding the correct summary() methods for lavaan.mi-class objects, instead using the method for lavaanList-class objects, as per this discussion, with the final output looking like this:

> summary(out2)
lavaanList (0.6-17) -- based on 5 datasets (5 converged)


Group 1 []:

Latent Variables:
                    est.ave
  f1 =~                    
    y1                0.774
    y2                0.770
    y3                0.567
    y4                0.674

Intercepts:
                    est.ave
   .y1                0.000
   .y2                0.000
   .y3                0.000
   .y4                0.000
    f1                0.000

Thresholds:
                    est.ave
    y1|t1             0.690
    y2|t1             0.377
    y3|t1             0.037
    y4|t1            -0.570

Variances:
                    est.ave
   .y1                0.387
   .y2                0.397
   .y3                0.669
   .y4                0.510
    f1                1.000

Scales y*:
                    est.ave
    y1                1.000
    y2                1.000
    y3                1.000
    y4                1.000


Group 2 []:
....

Instead of how it should look, which is something like this (based on this other discussion: https://groups.google.com/g/lavaan/c/cfCekznw-M0):

summary(mod, ci = TRUE)
#> lavaan.mi object based on 20 imputed data sets. 
#> See class?lavaan.mi help page for available methods. 
#> 
#> Convergence information:
#> The model converged on 20 imputed data sets 
#> 
#> Rubin's (1987) rules were used to pool point and SE estimates across 20 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.
#> 
#> Parameter Estimates:
#> 
#>   Standard errors                             Standard
#>   Information                                 Expected
#>   Information saturated (h1) model          Structured
#> 
#> Regressions:
#>                    Estimate  Std.Err  t-value       df  P(>|t|) ci.lower
#>   y ~                                                                   
#>     x                 0.385    0.125    3.073   84.009    0.003    0.136
#>  ci.upper
#>          
#>     0.635
#> 
#> Variances:
#>                    Estimate  Std.Err  t-value       df  P(>|t|) ci.lower
#>    .y                 0.911    0.167    5.452  209.224    0.000    0.581
#>  ci.upper
#>     1.240

As you can see in the code, I explicitly load library(semTools), as specified in the dicussion linked. However, it seems like this is not enough. Any idea what to do? Thanks!

0 Comments
2024/04/23
21:41 UTC

1

What causes this error?

Hi everyone, I'm trying to perform a Lag Sequential Analysis using "lagsequential" package and I keep getting this error, upon this command:

sequential(seqdata, labels = NULL, lag = 1, adjacent = TRUE, onezero = NULL, tailed = 2, permtest = FALSE, nperms = 10)

Error:

'list' object cannot be coerced to type 'double'

My dataframe consists of one column filled with integer numbers, which should be OK to run the analysis as specified in the package document. How can i resolve this issue? I am a beginner in R btw.

Thanks in advance!

1 Comment
2024/04/23
19:42 UTC

1

Fatal error: Unexcepted exception: Bad allocation

Hi everyone.

So today I (stupidly I think) decided to update all R packages after having some trouble getting some function right. Since then, whenever try to run my main code, an error window appears saying "Fatal error: Unexcepted exception: Bad allocation". Then, an error message in the console says:

Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : namespace ‘xfun’ 0.40 is being loaded, but >= 0.43 is required

Thereafter, a new error window pops up saying "R Session Disconnected

This browser was disconnected from the R session because another browser connected (only one browser at a time may be connected to an RStudio session). You may reconnect using the button below. [Reconnect]". I click reconnect, but then I just go back to starting point.

I tried to uninstall both R, RStudio and RTools and install again, but that has not changed anything. What should I do? I would like to just reboot all R to its initial state where I didn't have this problem, but after uninstalling all programs and installing them again to their last version, I am unsure what else should I delete to come back to point 0. I just wanna be careful with deleting something that I should not be deleting and making the problem even worse.

All aid is extremely appreciated as I am in the last couple weeks to submit my Master thesis and this problem is quite disruptive. Now my initial trouble does not seem half as bad. Thank you!!

5 Comments
2024/04/23
14:51 UTC

1

.Rmd: Error in `parse()`: ! attempt to use zero-length variable name

Edit: Problem solved! The issues were as follows:

  • It is not possible to split if-else statements across different chunks in .Rmd files.
  • In the chunk options I should not include a comma between "r" and the name of the chunk; on the contrary, I should write {r Sales, ...}

______________________

I am relatively familiar with R, but I just started using .Rmd files and have incurred the following error. My "HM.Rmd" file doesn't knit, and gets stuck on a specific line where a code chunk starts. The error in the console says:

Quitting from lines 159-263 [Sales] (HM5.Rmd) Error in \parse()`: ! attempt to use zero-length variable name Backtrace:   1. rmarkdown::render("HM5.Rmd")   2. knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)   3. knitr:::process_file(text, output)   8. knitr:::process_group.block(group)   9. knitr:::call_block(x)      ...  11. knitr:::eng_r(options)  14. knitr (local) evaluate(...)  15. evaluate::evaluate(...)  17. evaluate:::parse_all.character(...)  18. base::parse(text = x, srcfile = src)`

Please note that the original R code on its own runs correctly, the current .Rmd file is indeed saved as .Rmd, and it does not knit even when triggered by the command rmarkdown::render('HM5.Rmd') . Could someone possibly help me? Thank you in advance!

For completeness, find attached the lines of code where it gets stuck (lines 158-185, many functions are my own and defined above):

``` {r, Sales, include=TRUE}
envelope <- function(avg.return, vc, sales = T){ # x and y are single pf, a is the proportion
if(sales==T){
rf <- c(0, 0.1)
for (i in 1:length(rf)) {
Z <- ginv(vc) %*% (avg.return-rf[i])
Z_Sum <- sum(Z) # Normalize
name <- eval(paste("pf",eval(i), sep=""))
assign(name, as.vector(Z/Z_Sum))
}
# Initialize
a <- seq(-2,6, by=0.025)
b <- rep(1, length(a)) - a
z <- matrix(data=NA, nrow=length(a), ncol=2)
# Compute
z[,1] <- a*c(avg.return %*% pf1) + b*c(avg.return %*% pf2)
var1 <- pf.var(pf1,vc)
var2 <- pf.var(pf2,vc)
var3 <- pf.cov(pf1,vc,pf2)
z[,2] <- a^2*c(var1) + b^2*c(var2) + a*b*c(var3)*2
colnames(z) <- c("ret", "var")
z <- data.frame(z)
z <- z[order(z$ret, decreasing=F),]
return(z)
}
```
8 Comments
2024/04/23
14:01 UTC

1

Looking for Lavaan equivalent to MPlus setting

Hi all,

Hopefully this should be an easier question. I'm in the process of doing an random intercepts cross lagged panel model (RI-CLPM). When looking at the example MPlus information found here, they mention how they set their analysis to MODEL = NOCOV to get it to override defaults and set covariances between observed and latent outcome variables to zero.

I was wondering if there is an option I'm missing within lavOptions that would be the option for this, or if there was a lavaan equivalent to this? I've looked around but I feel like I'm just not finding it.

5 Comments
2024/04/23
00:16 UTC

3

List files in a folder containing subfolders

Suppose I have 2 folders (A and B) and each folder has 4 subfolders containing files.

How can I list all the files in folders A and B and then see which ones are in both folders and which ones are not.

10 Comments
2024/04/22
21:15 UTC

0

R Medicine 2024 - Abstract Deadline in ONE WEEK!

Hey everyone! ONE WEEK to deadline! Submit your abstracts by April 29 for the R/Medicine Conference. Discuss R-based tools and approaches for health data. Don’t miss this virtual conference, June 10-14, 2024. Full details at https://rconsortium.github.io/RMedicine_website/ 

0 Comments
2024/04/22
19:06 UTC

0

[Request] Ideas for an Idle Server

Good afternoon R Users!

Due to a long story that is not worth getting into, there is a very powerful server at my university that is always on and nobody uses it. I have VPN access to it but it only allows me to access RStudio.

What projects could I carry out on it, taking advantage of the fact that I have access to a server with high computing power but cannot access it physically but only through RStudio remotely? Also, the electricity bill is paid by the university so I don't care if it's running all night long.

Also, no one else is using it nor there is any plan of anyone using it soon.

The model, in case it is important, is: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

Cheers! :)

8 Comments
2024/04/22
15:35 UTC

1

Simulating Bus Failures

I have the following probelm:

  • Suppose there are 100 buses (bus_1, bus_2,... bus_100).

  • The bus company sends the first 5 buses (bus_1, bus_2... bus_5) on the first day.

  • On the first day, each bus has a 0.5 probability of breaking down (once a bus breaks down, it is out of service permanently).

  • If the bus does not break down on the first day, its sent out on the second day but now it has a probability of breaking down of 0.49. That is each day a bus doesn't break down, the probability of breaking down reduces by 0.01.

  • When a bus breaks down, it is replaced with the next bus (e..g if bus1, bus2, bus3,bus4, bus5 are there bus2 and bus4 break on the same day, the next day we will have bus1, bus6, bus3, bus7 , bus5).

  • At any given day, there can only be a max of 5 buses out on the road ... towards the end, there might be 4 buses, 3 buses ... until they all break down.

My Question: I am trying to write a simulation which simulates this situation until all buses break down? The final result should be a data frame with 11 columns: day_number (1,2,3...n), x1, x2, x3, x4, x5 (these represent the bus number in that position), p1,p2,p3,p4,p5 (these represent the probabilities of each bus breaking down on that day's row).

Here is what I tried so far using basic loops:

bus_count <- 100
    bus_prob <- rep(0.5, bus_count)
    bus_active <- 1:5
    results <- data.frame(matrix(ncol = 11, nrow = 0))
    colnames(results) <- c("day_number", "x1", "x2", "x3", "x4", "x5", "p1", "p2", "p3", "p4", "p5")
    
    day_number <- 0
    while(length(bus_active) > 0) {
        day_number <- day_number + 1
        breakdown <- runif(length(bus_active)) < bus_prob[bus_active]
        bus_prob[bus_active] <- pmax(bus_prob[bus_active] - 0.01, 0)
        next_bus <- (max(bus_active)+1):(max(bus_active)+sum(breakdown))
        next_bus <- next_bus[next_bus <= bus_count]
        bus_prob[next_bus] <- 0.5
        bus_active <- c(bus_active[!breakdown], next_bus)
        results <- rbind(results, c(day_number, bus_active, bus_prob[bus_active]))
    }
    
    # rename bus values to actual names
    results[,2:6] <- apply(results[,2:6], 2, function(x) paste0("bus_", x))
    
    print(results)

This is not correct. I am noticing the following errors:

Problem 1: Column names did not come out correct?

X1 X3 X5 X6 X7 X8 X0.49 X0.49.1 X0.5 X0.5.1 X0.5.2

Problem 2: The bus ordering is incorrect. On the first day, the buses should be bus1, bus2, bus3, bus4, bus5.

       X1      X3       X5       X6       X7       X8  X0.49 X0.49.1   X0.5 X0.5.1 X0.5.2
    1   1   bus_3    bus_5    bus_6    bus_7    bus_8   0.49    0.49   0.50   0.50    0.5
    2   2   bus_3    bus_6    bus_9   bus_10   bus_11   0.48    0.49   0.50   0.50    0.5

Problem 3: Buses are "coming back from the dead". E.g. Day 15 vs Day 47, bus47 comes back

    15 15  bus_38   bus_45   bus_46   bus_47   bus_48   0.46    0.49   0.50   0.50    0.5
    47 47  bus_47   bus_47   bus_47   bus_47   bus_47  47.00   47.00  47.00  47.00   47.0

Problem 4: The same bus appears multiple times in the same row:

  47 47  bus_47   bus_47   bus_47   bus_47   bus_47  47.00   47.00  47.00  47.00   47.0

Problem 5: The probabilities are not in the correct ranges (e.g. can only be between 0.5 and 0)

    37 37  bus_98   bus_99   bus_98  bus_0.5  bus_0.5   0.50   37.00  98.00  99.00   98.0
    38 38  bus_99   bus_98  bus_100 bus_0.49 bus_0.49   0.50   38.00  99.00  98.00  100.0

Can someone please help me fix these problems and write this code correctly?

Thanks!

0 Comments
2024/04/22
14:50 UTC

0

How do I answer #2? Produce a scatterplot of these data, showing the best fit line (in green), #the prediction intervals (in blue), and the confidence intervals (in orange)

Hide Assignment InformationInstructions

library(dplyr)
library(ggplot2)

#Here are average systolic blood pressure data for children/young adults aged 
#1 through 19. Run the following lines of code:

age <- c(1:19)
SBP <- c(82,105,102,103,112,113,111,117,118,121,118,123,125,129,127,135,131,134,138)
bpstudy <- data.frame(age,SBP)

#1. Fit a regression line relating age to SBP. Interpret the t statistic and 
#p value of the age variable and the R-squared of the overall model in words.

#2. Produce a scatterplot of these data, showing the best fit line (in green), 
#the prediction intervals (in blue), and the confidence intervals (in orange).

3 Comments
2024/04/21
21:54 UTC

1

Multiple treatments with different methods of administration?

Hi! I'm an undergrad working on a school project where I have to create a research design.

The design I've thought of involves 2 different messages: one specific message and one generic message. Each treatment message is to be administered in 4 different ways to households:

a) women only

b) men only

c) men and women separately

d) men and women together

treatment 1 (specific message), treatment 2 (generic message) and control are randomly assigned amongst 15 communities. Wtihin treatment communities, households are randomly assigned to one of the 4 methods of treatment administration. the outcome variable is registration.

I'm quite confused as to how my regression equation should look. Any help at all would be much appreciated.

Does this sound correct?

Registration = β0​ + β1​×specific_message + β2​×generic_message + β3​×WomenOnly + β4​×MenOnly + β5​×MenWomenSeparate + β6​×MenWomenTogether + ControlVariables + ϵ

where all the indep variables are dummy variables.

The results table would look something like:

Intention-to-Treat Effects by Treatment Administration Type

====================================================================================

Specific Message Generic Message

------------------------------------------------------------------------------------------------------------

Women Only β0 + β1+ β3 β0 + β2+ β3

Men Only β0 + β1+ β4 β0 + β2+ β4

Men & Women Separate β0 + β1+ β5 β0 + β2+ β5

Men & Women Together β0 + β1+ β6 β0 + β2+ β6

------------------------------------------------------------------------------------------------------------

But if I am treating men and women separately (c) , do I need an interaction term? Same for d)?

Please help. Any help would be massively appreciated

0 Comments
2024/04/21
13:55 UTC

Back To Top