/r/Rlanguage
We are interested in implementing R programming language for statistics and data science.
This reddit seeks new methods. For life and organization. We are interested in implementing R language for statistics and data science.
R and Statistics subs:
R resources:
Comprehensive R Archive Network
Swirl: Learning R with interactive lessons within the R console
/r/Rlanguage
I'm trying to emit a SELECT from_unixtime(ts)
like so:
tbl(db, "table") |> select(ts=dbplyr::sql_expr(from_unixtime(ts)))
but I get this error: \
dbplyr::sql_expr(from_unixtime(ts), db)` must be numeric or character, not a <sql/character> object.`
The dbplyr docs don't really explain how to use sql_expr()
in connection with other dplyr functions.
Hi all,
real life example. A bunch of "tools" is executing a bunch of "jobs". Each job is either a production of maintenance job. I need to flag each production job that was followed in time by a maintenance job. This sample does what I want:
library(tidyverse)
jobs <- as.tibble(read.table(textConnection("
tool time is_maintenance
1 1 0
1 2 0
1 3 1
1 4 0
2 1 0
2 2 0
2 3 0
2 4 0
"), header=T))
jobs.1 <- ddply(jobs, "tool", function(x) {
# sort by time so we can know what the "next" job on a particular
# tool is
x <- x[order(x$time),]
# "next_maintenance" is "is_maintenance" shifted one up
x$next_maintenance <- c(x$is_maintenance[2:nrow(x)], NA)
x
})
print(jobs.1)
jobs.1
is a data frame with an additional column next_maintenance
that flags if the next job is a maintenance job. (Of course due to the stupidity of R's "inclusive subscritping" of 1-indexed sequences and this will break if some tool made less that 2 jobs but I'll let that slide for the moment.)
This works well enough but doesn't seem to be the preferred method in 2024. I've found nothing in the tidyverse documentation that resembles this workflow:
Chop the data frame into groups
Do some arbitrary stuff with each group, yielding new data (tibbles) with possibly additional or fewer rows and/or columns than the original
join the group results row-wise
It's the "arbitrary" part of 2) that I'm having trouble finding information on because tidyverse seems to be focused on summarizing groups rather then creating new, row-wise data.
Hey r/Rlanguage ,
I want to share a project I've been working on: a platform to develop and share Shiny apps. I'd greatly appreciate it if you gave it a try and shared your feedback!
Features
Limitations
Feedback
Let me know if you have any suggestions, feature requests, or any issues; I'll be happy to help!
I have a dataset of zip codes and want to highlight all zips that are adjacent to those listed in the dataset. I actually want to do this one more time so that there is a collar 2 zip codes thick around all listed zips. How would I do this, I am having trouble getting started.
Hello everyone,
I've started working with the demographics
package in R and I have some questions. I want to apply certain models provided by this library, but I'm not sure what type of data these models require. As I understand, I need to have the data in a demogdata
object. The only thing I found was how to create a demogdata
object by importing data from text files. However, I had to make several data transformations, so my fully prepared data is now in a data frame (I have several columns: age group, years, population, and fertility rates).
My question is: how can I convert my data frame to a demogdata
object to use with these forecasting methods?
Thank you in advance.
I'm on Windows 10.
If I try to install ggplot2 with install.packages("ggplot2")
I get several errors about dependencies. If I write library(ggplot2)
I get "Error in library(ggplot2) : there is no package called ‘ggplot2’". My R version is 3.6.1 and I'm using RStudio through Anaconda.
Error on installation:
Warning in install.packages :
unable to access index for repository
cannot open URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/PACKAGES'https://cran.rstudio.com/bin/windows/contrib/3.6:
If I go to C:\Users\myusername\AppData\Local\Temp\RtmpyofeSw\downloaded_packages
I can see ggplot2_3.5.1.tar.gz.
It's a fresh installation, so I don't know what is happening.
Edit: Yeah, it was the R and/or RStudio version. I was using whatever Anaconda has installed, but I've uninstalled that one and installed it on my own behalf and now it works. Thanks to everyone!
Hi there, so my tasks with R concern primarily importing data and forming graphs (I have a macbook). It's mainly statistics for public administration. I'm very amateur and so is everyone in my class. We have calculated assignments but I think i'm kind of losing it somewhere and falling behind. A midterm is approaching so I would really appreciate someone knowledgeable and willing to help/guide me through this. Thank you in advance :)
So I finished my bachelors in sociology this year and now looking for jobs in data analysis. I’ve been using R throughout college for various research projects and have always relied on using chatgpt or googling how to do stuff because I’ve always had trouble memorising the exact syntax for what I’m trying to do. I am quite familiar with the statistical concepts behind what I’m doing and can analyse and interpret the results but whenever it comes to actual coding I still heavily rely on looking up the syntax or telling chatgpt what I need to do. I tried memorising the syntax but I always forget a special character here or a comma there and my output results in errors.
So my question is do other people have this issue or do people really memorise all the syntax including all special characters?
I’m sorry if this is kind of a dumb question but I have an interview coming up and I’ve been practicing using R but I keep running into the same problem.
Any advice or opinions are appreciated.
Hey y'all! First time poster on Rlang. I'm working with a friend on a mapping project. Neither of us are professionals in the industries, but I have some experience in JavaScript(TS) and Rust and my partner in this is pretty proficient in R and GIS (he went to school for it).
I'm hoping we can put our R scripts into a serverless function to avoid heavy, custom environments in the static server. I came across this git repo (https://github.com/UI-Research/lambda-r-demo) that embeds R in a Python function using a version of rpy2 and handlr to create a python-based handler and deploy it to lambda. I'm finding that, even though the repo was posted last year, a few of the dependencies are unavailable (EPEL v7). I dug pretty deep in the Dockerfile trying to get this particular implementation to work, but 8 hours later I'm at the bar posting on reddit about it. I'm not attached to thir particular implementation, but it seemed to make sense. Our project is a javascript-based mountain bike trail mapping app for our area and id rather just send all our geospatial data to a svls function than deploy a whole environment with rust, python, r, and JavaScript. Does anyone have any insight into this?
https://blog.stephenturner.us/p/expand-your-bluesky-network-with-r
I wrote this post demonstrating how to find people followed by the people you follow, but who you don't follow, using R and the atrrr package.
Hey there people,
Got a bit of a pickle with Rstudio
TL.DR : I want to replace NA values of each column in the same numeric distribution than non-NA values (see green example). How do I do that in Rstudio?
See upper dataframe, I have phenotypic numeric values for different species of Squamata. Lots of NA which messes up stats analyses. I want to replace those NA by numeric values.
What I've done currently : I calculated the mean value of non-NA values and replace NA by mean values for each column.
optional question : how do I do that in Rstudio ? Ressources online didn't work and doing it "by hand" on Excel was aids
What I want : replace NA values of each column by mimicking the distribution of other numeric values in the same column. Basically what I did manually in green as an example : Min value is 15, max is 38, and most variables are around 22. Thus NAs are replaced to mimic that.
Actual question : is there any commonly used script in scientific research which does something similar to what I want to do ? No need for anything too complex, it's for a school project.
If not, I'd like to calculate the extent for one column, divide that by the number of NA values. And increment the result while replacing NAs. Example : for green column, min is 15, max is 38. Extent is 38-15 = 23. lets say there are 23 NA values. 23/23=1. Replace 1st NA value by min value : 15. Replace 2nd by 15+1 =16. Replace 3rd by 16+1 = 17, etc...
I can do that manually in Excel, but is it possible to do so in R studio ?
Many thanks for any help!
Hi, so I have an assignment where my prof. wants two different quantile-quantile plots for the following data. I have tried to figure it out myself with the help of websites. But as someone who has very, very little knowledge of this software I don't understand what any of it means. I pretty much need to code two separate quantile-quantile plots, one for the "Yes" category of lactating and another for the "No". I have tried to copy and paste this data into two separate spreadsheets but R gives me an error so this is my last hope 😭. Please help a suffering uni student in her time of need 🙏
Code: options(
digits = 2,
scipen = 999,
warn = -1
)
rm(
list = ls()
)
library(magrittr)
library(readr)
Predicting_Demand_2 <- read_csv("~/Predicting Demand 2.csv")
col_types = "cnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnccn"
name_repair = janitor::make_clean_names
library(dplyr)
Predicting_Demand_2$train_test <- "Train"
train_data<- Predicting_Demand_2 %>%
dplyr::filter(train_test == "Train") %>%
dplyr::mutate(
train_test = ifelse(runif(n()) > 0.5, "Validation", train_test)
)
Predicting_Demand_2$box_cox_quantity <- "id + lat + long + pop + shop + quantity + price"
#box_cox_quantity <- "city + lat + long + pop + shop + brand + container + capacity + price"
#dplyr::select(-shop)
`Predicting_Demand_2 <- Predicting_Demand_2 %>%
filter(train_test == "Train") %>%
lm_house <- lm(
formula = box_cox_quantity ~ id + lat + long + pop + shop + quantity + price,
data = Predicting_Demand_2 %>%
dplyr::filter(
train_test == "Train"
)
)
I am doing data analysis for my PhD on plastic pollution. I used to group data based on an unique ID column with dplyr and a combination of group_by(id) and summarize(...) to sum up data and so on. Now this stopped working and tells me to use reframe() instead of summarize(). However, grouping does not work anymore and neither do the summarizing functions (e.g. tot_litter_grams = sum(litter_grams)). The dplyr documentation does not help me as of now, and neither did ChatGPT. Does anyone know how to get this working again?
Edit: Solved. I changed one function inside the summarize() function, which threw an error about summarize() not being supported anymore. Changing back the function inside summarize() solved the issue.
The code below is what I'm using. If I do 10 rows, fine, it works. The problem is my data frame is 7.3m rows. I'm testing it with a 1m subset, and it's been running for 3 hours, so that's obviously not going to be very feasible. Any suggestions?
library(sparklyr)
# Connect to databricks
sc<-spark_connect(method="databricks")
# subset it to smaller number of rows for testing speed icMX<-icM[1:1000000,]
# Convert it to a Spark Dataframe
spark_df<-sdf_copy_to(sc,icMX,overwrite=TRUE)
# Save it
spark_write_table(spark_df, "edlprod.lead_ranking.intent_wide", mode="overwrite")
Not sure if this is the right channel or if there is another one better, but since I didn't fine one for RMarkdown, here we go.
I'm doing some writing using RMarkdown and a VS Code plugin called FOAM (Logseq-like). I'm writing the documents in a .md
file and build the stuff using a single .rmd
file. The thing is, FOAM uses the characters [[
and ]]
to create links between the files, pretty useful to create a wiki-like structure for writing. The main problem is, the characters appear on the output pdf. I want to get rid of those characters when I build, but I'm not experience enough with R to do so and I cannot find any proper solution by myself. The closest solution I found is the following post (not the main answer, but the other one), but I don't know how to adapt it for my purposes.
The .rmd
file looks like this:
---
title : Some Title
subtitle : Some Subtitle
author: |
| My Name
wordcount : "X"
documentclass : article
floatsintext : no
figurelist : no
tablelist : no
footnotelist : no
linenumbers : no
mask : no
draft : no
tables : no
output:
bookdown::pdf_book:
toc: false
header-includes:
- \usepackage[spanish]{babel}
- \usepackage{booktabs}
- \usepackage{placeins}
- \usepackage{titling}
---
```{r, include = FALSE}
library(knitr)
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = '.')
```
```{r, child=c('MyMarkdownDocument.md')}
```
Any advice to get rid of those characters? I want to avoid the manual option of totally remove the symbols every time I build, if I can.
I recently need to use latent class analysis (LCA) function. However, once I installed in R 4.4.0, it says the lcca package was created for previous version, like before R 4.x.x version. Does anyone know how to install this package and use it smoothly in the most updated R? Thank you!
I am running two r scripts at once, both on different desktops (windows option to have another screen?).
Will R run slower if there are multiple scripts going at once? Would it be wiser to run them one at a time?
I made an R package that needs some initialization code run upon loading of the package using library(). Is there a possibility to do this?
I'm trying to write a convenience package that facilitates access to a database I use all the time. Here's a minimal example of the single R file involved:
.pdb = DBI::dbConnect(odbc::odbc(), driver="SQL Server",
<more connection args>)
#' @export
Anlage <- dplyr::tbl(.pdb, 'Anlage')
Yes, there's a DB connection hard-coded into a package. Never mind. This is only for my local use, not distribution.
Enter a Windows shell in the package source directory and load the package in the development environment:
PS > R.exe
R version 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life"
> library(devtools)
Loading required package: usethis
> load_all()
ℹ Loading ProdDB
> class(Anlage)
[1] "tbl_Microsoft SQL Server" "tbl_dbi"
[3] "tbl_sql" "tbl_lazy"
[5] "tbl"
> Anlage
# Source: table<"Anlage"> [?? x 43]
# Database: Microsoft SQL Server 13.00.6300[ProdDB]
anlagentyp anlagennummer cre_dat end_dat
<chr> <chr> <dttm> <dttm>
1 " EXT" "1 " 1992-12-23 09:40:22 5512-05-04 21:13:51
2 "01LI" "409 " 2012-03-20 13:57:54 5512-05-04 21:13:51
So that works fine. Let's build and install it (no errors, output from commands omitted):
> build()
> install()
* DONE (ProdDB)
Exit and re-enter R:
> q()
Save workspace image? [y/n/c]: n
PS > R.exe
R version 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life"
Load and test installed package:
> library(ProdDB)
> class(Anlage)
[1] "tbl_Microsoft SQL Server" "tbl_dbi"
[3] "tbl_sql" "tbl_lazy"
[5] "tbl"
This looks like before. Let's get some data:
> Anlage
$src
$con
Loading required package: odbc
Error: external pointer is not valid
Now that's where I am. The top of traceback() looks like this:
> traceback()
10: stop(structure(list(message = "external pointer is not valid",
call = NULL, cppstack = NULL), class = c("Rcpp::exception",
"C++Error", "error", "condition")))
9: connection_info(dbObj@ptr)
8: dbGetInfo(object)
7: dbGetInfo(object)
Hi, I recently came across a paper that performed sentiment analysis on H.P. Lovecraft's texts, and I found it fascinating.
However, I was unable to find additional studies or examples of computational text analysis applied to his work. I suspect this might be due to the challenges involved in finding, downloading, and processing texts from the archive.
To support future research on Lovecraft and provide accessible examples for text analysis, I developed an R package (https://github.com/SergejRuff/lovecraftr). This package includes Lovecraft's work internally, but it also allows users to easily download his texts directly into R for straightforward analysis.
Hi everyone,
I’m looking to take on a side project of building an R package and releasing it to the public. However, I’m struggling with deciding what the package should include. The R community is incredibly active and has already built so many tools to make developing in R easier, which makes it tricky to identify gaps.
My question to you: What’s something useful and fairly basic that you find yourself scripting on your own because it’s not included in any existing R packages?
I’d love to hear your thoughts or ideas. My goal is to compile these small but helpful functionalities into a package that could benefit others in the community.
Thanks in advance for sharing your suggestions!
I want to create a fastapi-based web site, and much of its functionality will be provided by r and quarto. (I am part of a community that wrangles data and creates reports using both r and quarto. Also, I know and have used python since the 90s so I know it provides these abilities as well. However, this community doesn't.) I have been looking for a web hosting service that would allow me to call r (via rpy2) and quarto on the server; however, I have been unsuccessful.
Any help would be appreciated.
This is really frustrating. I'm trying to make a complex joins of a half a dozen tables, and some of them have a column called flags
. To differentiate them, R names themflags.x, flags.y, ...
in the order they appear in the join. Yes I know I can specify a suffix
argument to the the inner_join()
function, but that only gets appended if that column is actually used in the query.
merge()
uses suffixes)I know better than complain about FOSS. I just can't understand why these in my exes counterproductive decisions were made. I'm a strong proponent of "explicit is better than implicit", which is why I wouldn't mind if any multi-table query would by default prepend the table name to all variables so there is never any ambiguity.
I have two data frames where one contains all the values and the second is missing a column of values, but I need to maintain the order of the second data frame. I'm having the hardest time doing this after two years if not using R. I'm not even sure the best function to use. Any help would be appreciated.