/r/rstats
A subreddit for all things related to the R Project for Statistical Computing. Questions, news, and comments about R programming, R packages, RStudio, and more.
PLEASE READ THIS BEFORE POSTING
Welcome to /r/rstats - the subreddit for all things R (the programming language)!
For code problems, Stack Overflow is a better platform. For short questions, Twitter #rstats tag is a good place. For longer questions or discussions, RStudio Community is another great resource.
If your account is new, your post may be automatically flagged and removed. If you don't see your post show up, please message the mods and we'll manually approve it.
Rules:
You can also check out our sister sub /r/Rlanguage
/r/rstats
I have 25 field sites across the country. I have 5 years of data for each field site. I would like to standardize these data to compare against each other by having the highest value from each site be equal to 1, and divide each other year by the high year for a percentage of 1. Is there a way to do this in Dplyr?
I was under the impression that qnorm() could be used to obtain the z score of a proportion when the distribution is normal but one of my professors told me that this is not the case. Can anyone tell me why it can or cannot be used in this case or what function I should be using instead?
Hi all! I've been using R for about 48 hours, so many apologies if this is obvious.
I'm trying to perform a Mantel-Haenzel test on stratified pure count data-- say my exposure is occupation, my outcome is owning a car, and my strata are neighbourhoods. I have about 30 strata. I'm trying to calculate odds ratios for each occupation against a reference (say, being a train driver). For a particular occupation I get:
Error in uniroot(function(t) mn2x2xk(1/t) - x, c(.Machine$double.eps, :
f() values at end points not of opposite sign
For some contingency tables in this calculation (i.e. some strata) I have zero entries, but that is also true of other occupations and I do not get this error for them. Overall my counts are pretty large (i.e. tens or hundreds of thousands). There are no NA values.
Any help appreciated! Thanks in advance.
hey! does anyone here know how to do a rao-scott test in R? i’ve been seeing some stuff, but i’m not sure it’s the right one.
for context, my goal is to test if two nominal variables are associated. i would have used pearson’s chi-square test, but there was stratification in my sampling design, hence rao-scott.
any help is greatly appreciated. thanks!
I’m an undergraduate currently working towards a publication and am seeking help with using generalized linear models for ecological count data. My research mentors are not experts in statistics and I’ve been struggling to find reliable help/advice for finalizing my project results. My research involves analyzing correlations between the abundance of an endemic insect and the abundance of predator and prey species (grouped into two variables, “pred” and “prey”) using ~10 years of annual arthropod monitoring data. This data has a ton of zeros, is over dispersed, and has some bias in sampling methods that may be producing more structural zeros. I’ve settled on two models to analyze the data: a zero-inflated negative binomial model with fixed effects, and a negative binomial model with mixed effects (nested random effects). Both models seek to minimize some of the sampling bias. Is there anyone familiar with similar models/methods that would be able to answer a few questions? I’d greatly appreciate your help!
Basically, I have a GAM with only one non-linear term, and the rest are linear, and I think I need clustered SEs. Can I just use vcovCL from sandwich like normal? I actually did this, but my SEs are much smaller, and that just seems suspicious. Thanks for any insight!
I know you might want more details about the data etc, but I mostly just want to know if it is possible/correct to use vcovCL with mgcv GAMs.
Dr. Nathakhun Wiroonsri and the RxTH User Group (Thailand) are making R more accessible and appealing across industries, especially among the younger generation. Details here:
https://r-consortium.org/posts/new-r-user-group-in-thailand-is-building-awareness-of-r/
Hi! I work in marine fisheries, and we have an SQL database we reference for all of our data.
I don’t have access to JMP or SAS or anything, so I’ve been using R to try to extract… anything, really. I’m familiar with R but not SQL, so I’m just trying to learn.
We have a folder of SQL codes to use to extract different types of data (Ex. Every species caught in a bag seine during a specific time frame, listing lengths as well). The only thing is I run this code and nothing happens. I see tables imported into the Connections tab, so I assume it’s working? but there’s so many frickin tables and so many variables that I don’t even know what to print. And when I select what I think are variables from the code, they return errors when I try to plot. I’ve watched my bosses use JMP to generate tables from data, and I’d like to do the same, but their software just lets them click and select variables. I have to figure out how to do it via code.
I’m gonna be honest, I’m incredibly clueless here, and nobody in my office (or higher up) uses R for SQL. I’m just trying to do anything, and I don’t know what I don’t know. I obviously can’t post the code and ask for help which makes everything harder, and when I go onto basic SQL in R tutorials, they seem to be working with much smaller databases. For me, dbListTables doesn’t even generate anything.
Is it possible the database is too big? Is there something else I should be doing? I’ve already removed all the comments from the SQL code since I saw somewhere else that comments could cause errors. Any help is appreciated, but I know I’ve given hardly anything to work off of. Thank you so much.
I am using an R script with Julia functions to run the code. It works perfectly on my computer, but when I try to set it up in the apptainer, it gives me an error. I've created a container (ubuntu 22.04) with R and Julia installed inside with all the packages required, and upon testing it worked great. However, once I run a specific code, which calls Julia to interact with R, it gives me this error:
ERROR: LoadError: InitError: could not load library "/home/v_vl/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so"
/usr/lib/x86_64-linux-gnu/libcurl.so: version `CURL_4' not found (required by /home/v_vl/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib/libhdf5.so)
I've looked online, and it says that the main problem is that the script is using the system's lib* files, as opposed to of that from Julia, which creates this error.
So I am trying to modify the last .def file to fix the problem, so far this is what I've added to it:
Bootstrap: localimage
From: ubuntu_R_ResistanceGA.sif
%post
# Install system dependencies for Julia
apt-get update && \
apt-get install -y wget tar gnupg lsb-release \
software-properties-common libhdf5-dev libnetcdf-dev \
libcurl4-openssl-dev=7.68.0-1ubuntu2.25 \
libgconf-2-4 \
libssl-dev
# Run ldconfig to update the linker cache
ldconfig
# Set environment variable to include the directory where the artifacts are stored
echo "export LD_LIBRARY_PATH=/home/v_vl/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib:\$LD_LIBRARY_PATH" >> /etc/profile
# Clean up the package cache to reduce container size
apt-get clean
# Install Julia 1.9.3
wget https://julialang-s3.julialang.org/bin/linux/x64/1.9/julia-1.9.3-linux-x86_64.tar.gz
tar -xvzf julia-1.9.3-linux-x86_64.tar.gz
mv julia-1.9.3 /usr/local/julia
ln -s /usr/local/julia/bin/julia /usr/local/bin/julia
# Install Circuitscape
julia -e 'using Pkg; Pkg.add("Circuitscape")'
julia -e 'using Pkg; Pkg.build("NetCDF_jll")'
%environment
export LD_LIBRARY_PATH=/home/v_vl/.julia/artifacts/2829a1f6a9ca59e5b9b53f52fa6519da9c9fd7d3/lib:$LD_LIBRARY_PATH
PS I need to run it in an apptainer because my goal is to use it on a supercomputer (ComputeCanada).
So far, I am trying to use LD_LIBRARY_PATH as a way to fix the problem, but it doesn't seem to work at all
plot_multi <- ggplot(multi_data, aes(x = factor(years), y = avg, color = parameter, group = parameter)) +
geom_line(na.rm = TRUE) +
geom_point(na.rm = TRUE) +
labs(title = "COD, BOD, TP, AN, NN Over Time", x = "Years", y = "Concentration (mg/L)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for better readability
scale_color_manual(values = custom_colors) + # Apply custom colors
scale_y_break(c(5, 15), space = 0.1)
When I'm trying to use scale_y_break (by ggbreak package), I get the Error in theme[[element]] : attempt to select more than one element in vectorIndex
error. The scale_y_break code breaks the code. Any suggestions on how to fix it? Thank you!
is there a way to remove the empty space on a coord_flip so the Name value is flush up against the columns?
library(tidyverse)
# Generate a dataset with random names and numbers
set.seed(123) # For reproducibility
datatest <- tibble(
Name = sample(c("Alice", "Bob", "Charlie", "David", "Eve",
"Frank", "Grace", "Hannah", "Ivy", "Jack"), 10),
Value = sample(1:100, 10, replace = TRUE)
)
datatest |>
ggplot(aes(Name,Value)) +
geom_col() +
coord_flip()
This tutorial explains how to run DeepSeek in R. We will use the DeepSeek API which can be used to run latest model of DeepSeek in R.
https://www.listendata.com/2025/01/how-to-use-deepseek-in-r.html
So, after 20 discussions with my promotor, I'm starting to doubt my statistics, so I want to know which test you guys would use. I have blood samples of 10 patients before and after treatment and 26 controls. On this blood, I did an experiment with measurements every minute for 6 minutes.
How can I look into the differences between PRE, POST and Control? Is a linear mixed model good? The fact that pre and post are the same patients are messing me up, as well as the 6 timed measurements for each patient.
Time also influences the measurement I did so I need to put it in the model//testing.
Hi everyone, I hope you're doing well!
I’m a recent medical school graduate and I’m interested in learning R in a short period of time. I’m not aiming to become an expert, but I want to learn enough to work on simple research papers.
I’ve completed a few online courses and feel that I have a good foundational knowledge to start with. However, I’m struggling to apply what I’ve learned to a full project—how to handle a dataset from A to Z.
I’m looking for someone who can tutor me and perhaps help me with one or two projects to build my confidence and ensure I’m getting the right results. Ideally, I’d prefer someone from the medical field who understands the concepts we’d be working with. [Please, I need someone in the medical field]
Thank you in advance!
Hello, I am a 4th year Industrial Engineering student and is currently undergoing a thesis. We will be using PLS-SEM as our means of analyzing data and we have come up with a model however I am having doubts whether our model is feasible for PLS-SEM specifically SmartPls. Our model has 3 dependent Variables with each dependent Variable having 5 independent Variables. The independent variables will be measure by 5 reflective questions. The model will be like this DV1 -> DV2 -> DV3, with DV2 being a moderating variable. Ive been having anxiety regarding the model since I have little knowledge with PLS-SEM since we were required to use the software by our university. Any help or inputs would be highly appreciated. Thank you so much!
can someone please help me i'm using the R package AeRobiology to make a violin plot but the package just wont let me change the colour scheme im so confused, its just always yellow.
pollen_calendar(data, method = "violinplot", n.types = 15,
start.month = 1, y.start = NULL, y.end = NULL, perc1 = 80,
perc2 = 99, th.pollen = 1, average.method = "avg_before",
period = "daily", method.classes = "exponential", n.classes = 5,
classes = c(25, 50, 100, 300), color = "green",
interpolation = TRUE, int.method = "lineal", na.remove = TRUE,
result = "plot", export.plot = FALSE, export.format = "pdf",
legendname = "Pollen grains / m3")
Friends, I need some help. I’m writing my MBA thesis in Data Science and Analytics, and I’ve chosen to work with a golf dataset that includes several variables and the players’ placement (FINISH) at The Open, from 2008 to 2023.
My goal was to evaluate which variable(s) are the most important in predicting placement. For example, whether the average number of birdies contributes the most to a higher placement.
I started with multiple linear regression using ordinary least squares, but the assumptions weren’t met. I then moved to mixed models with an ordinal variable since FINISH is ordinal, but I didn’t get good results either. Finally, I switched to Random Forest, which is new to me, but I’m still not seeing satisfactory results based on the OOB error rate and accuracy.
I don’t really expect the model to be perfect. I believe golf performance is much more complex, with significant influence from variables not included in the dataset (individual and environmental factors). Still, I want to make sure I’ve done everything possible with my model before concluding that.
Does anyone have experience with this topic? Any suggestions? I can share what I’ve done so far, although it’s not much.
Does anyone know when and why it became impossible to declare a paired t test from a formula? I'm certain it worked at this time last year. A very silly change IMO.
Does anyone use R outside of scientific research? I’ve been using it for years now for analysing pricing movements and product pricing erosion over extended periods of time, but I feel very much like an outsider. I don’t think I’ve seen any posts here (or anywhere else) outside of scientific arena.
Would be interested if I’m alone, or am I just missing everything.
Has anyone else here had issues with Dr Greg Martin's course for R? I paid for the course but its impossible to access to example files.
Hey everyone,
If I wanted to create a figure like my drawing below, how would I go about grouping the x axis so that nutrient treatment is on the x-axis, but within each group the H or L elevation in a nutrient tank is shown. This is where it gets especially tricky... I want this to be a stacked barplot where aboveground and belowground biomass are stacked on top of each other. Any help would be much appreciated. Especially is you know how to add standard error bars for each type of biomass (both aboveground and belowground).
R en Buenos Aires (Argentina) User Group organizer Andrea Gomez Vargas believes "...it is essential to reengage in activities to invite new generations to participate, explore new tools and opportunities, and collaborate in a space that welcomes all levels of experience and diverse professional backgrounds."
Exceptional!
I am hoping that someone here can provide some help for me as I have completely struck out looking at other sources. I am currently writing script to process and compute case break odds for Topps Baseball cards. This involves using Bernoulli distributions but I couldn't get the RLab functions to work for me so I wrote a custom function to handle what I needed. The function basically computes the chance of a particular number of outcomes happening in a given number of trials with a constant rate of odds. It then sums the amounts to return the chance of hitting a single card in a case. I have tested the function outside of mutate and it works without issue.
\
``{r helper_functions}`
caseBreakOdds <- function(trials, odds){
mat2 <- numeric(trials+1)
for(i in 0:trials) {
mat2[i+1] <- (factorial(trials)/(factorial(i)*factorial(trials-i)))*(odds^i)*((1-odds)^(trials-i))
}
hit1 <- sum(mat2[2:(trials+1)])
return(hit1)
}
\
```
Now when I run the chunk meant to compute the odds of pulling a card for a single box, I run into issues. Here is the code:
\
``{r hobby_odds}`
packPerHobby = 20
boxPerCase = 12
hobbyOdds <- cleanOdds %>% select(Card, hobby) %>%
separate_wider_delim(cols = hobby,
delim = ":",
too_few = "align_start",
too_many = "merge",
names = c("Odds1", "Odds2")) %>%
mutate(Odds2 = as.numeric(gsub(",", "", Odds2))) %>%
mutate(packOdds = ifelse(Odds2 >= (packPerHobby-1), 1/Odds2, packPerHobby/Odds2)) %>%
mutate(boxOdds = ifelse(Odds1 == "-", "", caseBreakOdds(packPerHobby, packOdds)))
\
```
This chunk is meant to take the column of pack odds and then compute then through the caseBreakOdds function. Yet when I do it, it computes the odds for the first line in my data frame then proceeds to just copy that value through the boxOdds column.
I am at a loss here. I have been spending the last couple hours trying to figure this out when I expect it's a relatively easy fix. Any help would be appreciated. Thanks.
Hi. I'd like to make a standalone shiny app, i.e. one which is easy to run locally, and does not need to be hosted. Potential users have a fairly low technical base (otherwise I would just ask them to run the R code in the R terminal). I know that it's not really possible to do this as R is not a compiled language. Workarounds involving Electron / Docker look forbiddingly complex, and probably not feasible. A possible workaround I was thinking of is (a) ask users to install R on their laptops, which is fairly straightforward (b) create an application (exe on Windows, app on Mac) which will launch the R code without the worry of compiling dependencies because R is pre-installed. Python could be used for this purpose, as I understand it can be compiled. Just checking if anyone had any thoughts on the feasibility of this before I spend hours trying to ascertain whether this is possible. (NB the shiny app is obviously dependent on a host of libraries. These would be downloaded and installed programmatically in R script itself. Not ideal, but again, relatively frictionless for the user). Cheers.
Hey all,
Does anyone have resources/code for creating a stacked bar plot where there are 4 treatment categories on the x axis, but within each group there is a high elevation and low elevation treatment? And the stacked elements would be "live" and "dead". I want something that looks like this plot from GeeksforGeeks but with the stacked element. Thanks in advance!
I created a data frame which includes some rows where there is a quote:
testcsv <- data.frame(x = c("a","a,b","\"quote\"","\"frontquote"))
The output looks like this:
x |
---|
a |
a,b |
"quote" |
"frontquote |
I exported it to a file using fwrite():
fwrite(testcsv,"testcsv.csv",quote = T)
When I imported it back into R using this:
fread("testcsv.csv")
there are now extra quotes for each quote I originally used:
x |
---|
a |
a,b |
""quote"" |
""frontquote |
Is there a way to fix this either when writing or reading the file using data.table? Adding the argument quote = "\""
does not seem to help. The problem does not appear when using read.csv, or arrow::read_csv_arrow()
My project focuses on exploring the comorbidity patterns of disease A using electronic medical records data. In a previous project, we identified around 30 comorbidities based on diagnosis/lab test/medication information. In this project, we aim to analyze how these comorbidities cluster with each other using exploratory factor analysis (via the psych package) and examine the mediation effect of disease B in disease A development (using the lavaan package). I currently have the following major questions:
Thank you so much for your help!