/r/rstats
A subreddit for all things related to the R Project for Statistical Computing. Questions, news, and comments about R programming, R packages, RStudio, and more.
PLEASE READ THIS BEFORE POSTING
Welcome to /r/rstats - the subreddit for all things R (the programming language)!
For code problems, Stack Overflow is a better platform. For short questions, Twitter #rstats tag is a good place. For longer questions or discussions, RStudio Community is another great resource.
If your account is new, your post may be automatically flagged and removed. If you don't see your post show up, please message the mods and we'll manually approve it.
Rules:
You can also check out our sister sub /r/Rlanguage
/r/rstats
Anyone know of an open source (free) api to access historical polling data?
I've cobbled together a function that changes a date that falls on a weekend day to the next Monday. It seems to work, but I'm sure there is a better way. The use of sapply()
bugs me a little bit.
Any suggestions?
Input: date, a vector of dates
Output: a vector of dates, where all dates falling on a Saturday/Sunday are adjusted to the next Monday.
adjust_to_weekday <- function(date) {
adj <- sapply(weekdays(date), \(d) {
case_when(
d == "Saturday" ~ 2,
d == "Sunday" ~ 1,
TRUE ~ 0
)
})
date + lubridate::days(adj)
}
Hi all,
Do people still use this FedData R package even though it was removed from R?
I appreciated the access to NLCD rasters and developed a few workflows that I thought were pretty good!
Should I spend time looking for a work around to the FedData package, or is it a robust option to use the archived version of the package?
Thanks
Hey guys, I am new to r and have a question if this is possible. I compare two medications an and b and want to show in a forest plot, which one is better. Problem is, I have studies that compare an and b, and some that compare a or b with placebo sham. So I guess network analysis is the right thing to do. Do you have a script that would do this? Thanks so much
ggplot() +
geom_polygon(data = states, aes(x = long, y = lat, group = group),
fill = "white", color = "black") +
filter(flights3, dest != "ANC", dest != "HNL" ) +
geom_point(data = flights3, aes(x = lon, y = lat, color = avg_delay_mean)) +
coord_map()
This code keeps giving me the error:
" Warning: Incompatible methods ("+.gg", "Ops.data.frame") for "+"
Error in ggplot() + geom_polygon(data = states, aes(x = long, y = lat, :
non-numeric argument to binary operator"
I'm not sure what I am doing wrong :(
Hi there, I am working on a research project and I need to calculate the distance between the geographic location of a town's city center and their MLB stadium. I have lat/longs for every ballpark and city center that I need, but I don't know a good package to use. It would be great too if I don't have to enter them individually as I am calculating the distance for dozens of observations.
Does anyone know an efficient way to do this?
TL;DR - I am trying to create a nested tryCatch, but the error I intentionally catch in the inner tryCatch is also being caught by the outer tryCatch unintentionally. Somewhat curiously, this seems to depend on the kind of error. Are there different kinds of errors and how do I treat them correctly to make a nested tryCatch work?
I have a situation where I want to nest two tryCatch() blocks without letting an error condition of the inner tryCatch() affect the execution of the outer one.
Some context: In my organization we have an R script that periodically runs a dummy query against a list of views in our data warehouse. We want to detect the views that have a problem (e.g., they reference tables that have been deleted since the view's creation). The script looks something like this:
con_prd <- DBI::dbConnect(...)
vectorOfViews <- c("db1.sampleview1", "db2.sampleview2", "db3.sampleview3")
checkViewErrorStatus <- function(view, connection) {
tryCatch({
DBI::dbGetQuery(
conn = connection_to_dwh,
paste("EXPLAIN SELECT TOP 1 1 FROM", view))
return("No error")
},
error = function(e){
return(e)
}
}
vectorOfErrors <- map_chr(vectorOfViews, checkViewErrorStatus)
results <- data.frame(viewName = vectorOfViews, errorStatus = vectorOfErrors)
DBI::dbWriteTable(
connection_to_dwh,
SQL("mydb.table_with_view_errors"),
results,
append = TRUE,
overwrite=FALSE)
Instead of running this script directly, we use in a wrapper Rmd file that runs on our server. The purpose of the wrapper Rmd file, which is used for all of our R scripts, is to create error logs when a script didn't run properly.
tryCatch({
source("checkViewsScript.R")
},
error = function(e){
createErrorLog()
})
When checkViewErrorStatus() inside the checkViewsScript.R catches an error then this is intended. That's why I am using a tryCatch() in that function. However, when something else goes wrong, for example when DBI:dbConnect() fails for some reason, then that's a proper error that the outer tryCatch() should catch. Unfortunately, any error inside the checkViewsScript.R will bubble up and get caught be the outer tryCatch(), even if that error was triggered using another tryCatch() inside a function.
Here is the weird thing though: When I try to create a nested tryCatch() using stop() it works without any issues:
tryCatch(
{
message("The inner tryCatch will start")
tryCatch({stop("An inner error has occurred.")}, error = function(e) {message(paste("Inner error msg:" ,e))})
message("The inner tryCatch has finished.")
message("The outer error will be thrown.")
stop("An outer error has occurred.")
message("The script has finished.")
},
error = function(ee) {message(paste("Outer error msg:", ee))}
)
The inner tryCatch will start
Inner error msg: Error in doTryCatch(return(expr), name, parentenv, handler): An inner error has occurred.
The inner tryCatch has finished.
The outer error will be thrown.
Outer error msg: Error in doTryCatch(return(expr), name, parentenv, handler): An outer error has occurred.
When I look at the error thrown by DBI::dbGetQuery() I see the following:
List of 3
$ message : chr "nanodbc/nanodbc.cpp:1526: 42S02: [Teradata][ODBC Teradata Driver][Teradata Database](-3807)Object 'XXXESTV_LAB_"| __truncated__
$ call : NULL
$ cppstack: NULL
- attr(*, "class")= chr [1:4] "nanodbc::database_error" "C++Error" "error" "condition"
By contrast, an error created through stop() looks like this:
> stop("this is an error") %>% rlang::catch_cnd() %>% str
List of 2
$ message: chr "this is an error"
$ call : language force(expr)
- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
So here is my question: Are there different types of errors? Is it possible that some errors bubble up when using a nested tryCatch whereas others don't?
Hi rstats,
I'm running different 'repeated events' cox models on some data, and I need some help with interpretation.
Using coxph() from the survival package, I can fairly easily obtain 95% confidence intervals, and I can run cox.zph() and/or plot residuals to see if and how badly I am violating proportional hazard assumptions. I am using coxph() to run the following 'flavours' of repeated events models (I have a reason do all of these: I favour a (?stratified) frailty model to answer my research question, and someone else would like to use the PWP-gap time model to answer a different question; etc etc).
However, I saw that to run frailty aka random effects models, I should use coxme() for computational reasons, apparently, according to the survival package documentation. And I believe it - machine didn't like it much!
So using coxme() is fine, and I am returned the coefficients, hazard ratios, standard errors etc... but firstly, is there a way to extract confidence intervals from coxme(), or is that a really dumb thing to ask? Secondly, I guess I can plot residuals to visually check if I'm violating assumptions? But is there a way I should be interpreting randef() ? A giant printout of the matrix with [level of random effect] & [value] doesn't mean anything to me.
Many thanks in advance for helping out a physiologist who is trying their best :)
This is a beta release of R in Obsidian, a plugin to run R in Obsidian, its very much in development/unnstable only validated on my mackbook and win11 on my mackbook. Installation instructions are in the GitHub readme. Looking forward to all your crash/issue reports so I can this better.
I am struggling to download R and R Studio on my Mac with the Sequoia update. I have tried numerous versions and it keeps saying it can't be installed on this computer. Any help would be welcome. Thanks!
So I remember that Nvidia's CUDA has been the standard for few years for pytorch
, but just wondering what your experience has been recently with reticulate
related R coding.
Beyond that, what are some options without using reticulate
?
I am using the lavaan package to do a SEM or really a path analysis (I think that is the better term, given I am not working with any latent factors, but I am open to using cfa/latent factors). N = 189. My data is as follows:
All effects of diet attenuated in my most conservative model with all 5 covariates. I am wondering if I should consider a different approach to how to model them/introduce them. I also have a range of other data points around health and demographics. Kind of lacking in specialised support at the moment and feeling like I need something outside of the box!
model_adjusted2 <- '
# Mediator model
prefx_mca_avg ~ plant.diet + meat.diet + western.diet + age + sex + educationtotal + totalmvpa + kJwithDF
# Outcome models including the mediator and direct paths + age + sex + educationtotal + totalmvpa + kJwithDF
LongTermMem ~ prefx_mca_avg + plant.diet + meat.diet + western.diet + age + sex + educationtotal + totalmvpa + kJwithDF
ProcSpeed ~ prefx_mca_avg + plant.diet + meat.diet + western.diet + age + sex + educationtotal + totalmvpa + kJwithDF
ExecFunc ~ prefx_mca_avg + plant.diet + meat.diet + western.diet + age + sex + educationtotal + totalmvpa + kJwithDF
ShortTermMem ~ prefx_mca_avg + plant.diet + meat.diet + western.diet + age + sex + educationtotal + totalmvpa + kJwithDF
# Covariances among diet variables
plant.diet ~~ meat.diet + western.diet
meat.diet ~~ western.diet
'
Having trouble getting this properly, as I’m not getting the correct response. Has anyone done API with WillyWeather in R before?
Has anyone managed to run a PCA on multiple imputed datasets using MICE?
mice.dat = mice(dat[-1], m = 50, seed = 27)
prcomp(complete(mice.dat))
This code works, but mice.dat includes more variables than the ones I want to use in the PCA, a lot of variables were just included as auxiliary for the MI. Does anyone know how to make this work?
I want to then extract participant level scores, so I not just concerned with averaging the loadings.
Hi, I am trying to learn machine learning and have been reading on it, but I always found I learn better by doing projects. Is anyone else learning and wanting to collaborate? I would love to do a collaborative project with someone. Would love to have fun with it. Please DM me if you are interested. Doing it purely for fun and trying to learn ML. Let me know!
Hi there,
I have a dataset that consists of 120 survey points, each of which was surveyed 4 times in a year (so 480 total surveys). We counted the presence or absence of species in every 3-minute interval of a 15 minute survey. I am interested in determining the accumulated species total after each 3-minute interval. For example, if in Survey 1 i found American Crow, Robin and Goldfinch, my accumulated species richness would be 3. In Survey 2 if I found an American Crow, Goldfinch and Chickadee my accumulated richness would be 4, etc. I just need a function that gives me the accumulated total of species detected in each subsequent 3-minute interval. Any help would be greatly appreciated
(I have tried using the vegan package and that seems to only give extrapolations, I want the actual count from my data).
Hello all, I am a grad student in psychology learning structural equation modeling in R right now. I like leaning with other people since comprehention is so much better when you are discussing and explaining things. Also it is quite helpful to keep eachother accountable and motivated. So I am looking for a study buddy. I have done something similar before and it's worked out fantastically.
Here is a rough idea on how we could go about doing this (but it is just a first idea, and we can make adjustments as you like) :
Write a message if you are interested in working together. :)
I'm studying structural breaks so I have identified a break in my series. I'd like to find the slope of the linear equation (y~x) before and after the break such that the fitted values are joint by the break. I tried fitting linear regressions independently for each segment or using interactions (with and without an effects) but that just gives me a different "starting point" for each segment, which I want to avoid. I'm thinking something similar to the software jointpoint but for a case when I already know where the breaks are.
Any ideas or suggestions would be greatly appreciated!!
To make things more clear, my use case is this:
My partner's job requires her to write a lot of reports for a lot of people that are based on a variety of metrics specific to them.
These reports are pretty predictable. For example, say metric A has a range of values from 1 to 9; anyone scoring between 1 and 3 will all have the same paragraph written about it in their report, likewise for people scoring between 4 and 6, and people scoring between 7 and 9.
So, if you know their score, you know what paragraph needs to be inserted into the report.
I'd like to be able to tell RMarkdown, or whatever other program, "if Metric A is in this range, display this text", for each of the metrics (though the logic will actually be a bit more complex than that), and generate a complete, professional report.
(Before you start to worry, she would, of course, review all of them and make any necessary modifications to the reports before ever sending them out; this is just meant to make writing the report easier, not to replace doing it entirely)
Hello everybody, i am a freshman in college and currently a business major. I find the classes here boring in the business department, and have been attending a sports analytics club, and have spoken with the professor of the program and it sounds like something i'd love to work in
However, i have absolutely no idea how to code, or anything like that. As a complete newbie i would love to learn R so i can get a step ahead (if i transfer into the sports analytics program) and do my own projects for fun mainly focusing on hockey.
I would love to hear any help on how to learn R, or if i should go about learning another coding language for sports analytics.
I would also love to hear any advice on how i could make 'projects' in my free time just for fun. Thank you for all of your help in advance!!
Hi,
PhD student with some mixed stats knowledge here. I’ll keep things simple for conciseness, but can provide more details if needed.
I am analysing a small dataset of microbiome samples (n=18). 6 of the samples are from one individual, the rest are from unique Individuals. No individuals are related.
Each sample contains abundance counts for 704 species of bacteria. It also contains a factor for sampling method - there were 3 ways of collecting the sample.
we are interested in how sampling method may bias the results.
I added an “individual” tag as a factor such that the 6 samples from one individual had the same tag, and everyone else had a unique tag.
The purpose of the PERMANOVA was to see what proportion of the variance in microbial structure was explained by sampling method, individual, or both. The results were as follows:
Sampling method alone explained 22% of the variation
Individual identity alone explained 95% of the variation
together, both explained 99% of the variation
Clearly, individual variation is driving the microbial community structure here, more than sampling method.
my question is about overlap in the variance explained. When combined, the total explained is less than the sum of their parts, which would be over 100%.
I want to know what I can say about the overlap, if anything, and which further tests I could do to quantify it.
It’s worth noting there is some correlation between sampling method and individual identity. Of the three sampling methods, one method was exclusively applied to the individual who was sampled 6 times.
I’m fairly new to using R, curious if there is a package available to analyze weight-adjusted population data using factor analysis (1000 bootstrap weights). My experience is primarily with STATA, factor analysis isn’t available via their survey commands.
Hi there. I have a research project due in a few weeks and I'm finding it really hard to work out R. I know how to do what I need with the data extraction and cleaning in Excel, but they want us to use R for significance calculations and possible predictive modelling.
I have a huge worldwide data set, but I only need information about 2 regions (Region A and Region B) which are named in column 1. In column 3 is the total number of deceased people per incident (row). Each incident also has a recorded date of when it happened. There's over 35k rows of data.
I need the total number of deaths per Region per year. Can anyone help me code this? If I was doing this in Excel I'd make it an COUNTIFS but I don't know how to convert that to R.
I'm using Rstudio cloud (named posit) and I have a project where I used up all the RAM so I upgraded my account. My allocated memory went from 1 to 3 GB and now I have up to 4 CPU??? Can anyone give me a basic/ intermediate explanation of what exactly this means? Is it just memory? Thank you.
I have a project due in three weeks, and I’m a beginner in R and RStan, so I am unable to complete the assignment. If someone with a strong background in these can help, please let me know. I am willing to pay a reasonable price for their help.