/r/statistics
/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.
This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.
Guidelines:
All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:
Tag | Abbreviation |
---|---|
[Research] | [R] |
[Software] | [S] |
[Question] | [Q] |
[Discussion] | [D] |
[Education] | [E] |
[Career] | [C] |
[Meta] | [M] |
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator
Related subreddits:
Data:
Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.
Related Software Links:
Advice for applying to grad school:
Advice for undergrads:
Jobs and Internships
For grads:
For undergrads:
/r/statistics
I know that there is wikipedia as a source, but nothing beats well written book from experts in the field, everyday I come across new statistical terminology and subfields that I would love to know what's going on there.
For context I'm currently pursuing an MSc in Statistics. I usually hear statisticians on the job saying things like "people usually come up to me for stats help" or "I can believe people at my work do X and Y, goes to show how little people know about statistics". Even though I'm a masters student I don't feel like I have a solid grasp of statistics in a practical sense. I'm killer with all the math-y stuff, got an A+ in my math stats class. Hit may have been due to the fact that I skipped the Regression Analysis course in undergrad, where one would work on more practical problems. I'm currently an ML research intern and my stats knowledge is not proving to be helpful at all, I don't even know where to apply what I'm learning.
I'm going to try and go through the book "Regression and other stories" by German to get a better sense of regression, which should cover my foundation to applied problems. Are there any other resources or tips you have in order to become a well-rounded statistician that could be useful in a variety of different fields?
There is a old French tv game that just restarted after a lot of time. During the final a candidate was currently wining a pack of card and was given 4 screen to choose from. The host explained : One of the screen was « keep your pack of card » One of the was « a crappy thing » One of them was « a decent thing » One of them was a car
So at that point I got strong Monty hall vibe watching this. The candidate initially think screen 1 but ask his friends in the public to join him and discuss and after he hesitated between screen 1 and 4. The the candidate ask the host if he can start by ditching 2 and 3 and the host say sure why not. It happen and the 2 eliminated was the pack of card and the crappy thing. It left the « decent » and the car. The candidate then follow his friends advice for screen 4 and get the car.
I’m wondering how applicable Monty hall logic can be on this one.
• the candidate did not give a choice to officially change since he was hesitating because of his friends • the candidate and not the host choose the screen to eliminate and it could have been the car but it was not so technically, it was the « two goat reveal » of the Monty hall • at this point does Monty hall logic apply and had he a better chance by choosing screen 4 like he did ? It feel to me like yes because the crap got eliminated we returned to a Monty hall. So he had 1/4 chance of picking the correct screen at the beginning so switching is better, but can someone that know more on probabilities confirm it ? I dunno if any of this event change the probability distribution compared to a standard Monty hall
Hello,
I am currently working on a paper. I have already done a multiple mediation analysis with 3 mediators.
I decided to add sex as a moderator, as in my descriptive stats sex indicated a significant difference between scores.
The index of moderated mediation is non significant, so I know that gender does not moderate the relationship between X > Med > Y. Would I report the normal a/ b pathways as I would in a multiple mediation analysis, OR would I report the interaction pathways as I would in a moderated mediation?
Please note using the usual pathways keeps my mediation effect as significant (as it was before adding a moderator) if I use the interaction pathways it will no longer be significant... So I assume we would not use the interaction as the moderator is not significant?
Please let me know!!!!
[Q] 2010-2020 unemployment rate for Phoenix, AZ is given under the attached
and they provide me with the longest list of unemployment rates. how am i supposed to find those three with sooo many numbers? help please
I was looking at for example this set of 25 analogies (PDF warning) but frankly many of them I find extremely lacking. For example:
The 5% p-value has been consolidated in many environments as a boundary for whether or not to reject the null hypothesis with its sole merit of being a round number. If each of our hands had six fingers, or four, these would perhaps be the boundary values between the usual and unusual.
This, to me, reads as not only nonsensical but doesn't actually get at any underlying statistical idea, and certainly bears no relation to the origin or initial purpose of the figure.
What (better) analogies or mini-examples have you used successfully in the past?
Current standard in my field is to use a model like this
Y = b0 + b1x1 + b2x2 + e
In this model x1 and x2 are used to predict Y but there’s a third predictor x3 that isn’t used simply because it’s hard to obtain.
Some people have seen some success predicting x3 from x1
x3 = a*x1^b + e (I’m assuming the error is additive here but not sure)
Now I’m trying to see if I can add this second model into the first:
Y = b0 + b1x1 + b2x2 + a*x1^b + e
So here now, I’d need to estimate b0, b1, b2, a and b.
What would be your concern with this approach. What are some things I should be careful of doing this. How would you advise I handle my error terms?
Hi Community,
I worked on an interactive tutorial on the ROC curve, AUC score and the confusion matrix.
https://maitbayev.github.io/posts/roc-auc/
Any feedback appreciated!
Thank you!
TLDR: any resources or suggestions on how to decompose time series data logged at the millisecond level?
I am trying to apply practical methods of time series decomposition to server memory telemetry (e.g. % of memory used over time). The data is captured at the millisecond level, span ~1.5 days, and is stationary. However, I am having a very hard time understanding the best approach to decomposing it using something like STL. From the plotted data I can see there is certainly seasonality (and/or perhaps more irregular cyclicality) to the data which I would like to remove. But determining the correct periodicity to use seems to be hindering my work. Due to the granularity of the data it's nearly impossible to eyeball and roughly guess what the periodicity of the trend may be.
In yearly, monthly, or weekly time series you have a sense of periodicity to work from, but I don't really have that sense given the scale of the data as to what would make sense in this case. I've done some basic ACF / PACF to look at lagging values. The plots show steady drop-offs in correlation over time before stabilizing. I've also done some very elementary frequency testing to try to establish the ideal period to use. But I need to be more methodical. Most of the resources I've found online don't seem to cover advanced cases of time series decomposition and certainly not in the context of very granular time intervals.
My ultimate goal in decomposition is to detrend the data and analyze the residuals so that I can compare multivariate data across memory usage, swap usage, and other telemetry time series.
Excuse me for asking this.
So, before the question here's the scenario: There's an experiment.. nested design
2 groups, say, A and B Under A there are 2 subgroups W and X, each of which have 2 samples. Under B there are 2 subgroups Y and Z, but here Y has 3 samples under it, and Z has 1.
The experiment is carried out in 3 replications. Each replication has 15 tests from each sample.
Now, when applying nested ANOVA after collection of raw data, We calculate means.
First, calculating mean for each sample across 3 replications. For example: If 11, 10 and 12 out of 15 tests were positive for W1 The mean for W1 is MW1= 10... Calculating similarly for all 8 samples.
Now, we calculate mean of each subgroup, Say, MW1=10 MW2=12
Mean if subgroup W = 11 And so on for other groups X, Y and Z.
Now as we go to mean of the group, there's a confusion.
For example if we calculate mean of the group B, Y and Z Say, Mean of Y=avg(MY1,MY2,MY3) Mean of Z=avg(MZ1)
Mean of B= avg(Mean of Y, Mean of Z) ....... But, If we were to calculate the mean of B using individual sample values Like, Mean of B= avg((MY1,MY2,MY3,MZ1).
We would get a different value
It is obvious because of different number of samples under each subgroup.
But the question is, which one would be more appropriate to be used in the nested ANOVA Calculation.
This same thing happens when calculating the overall mean using the group means Overall mean = avg(mean of A, mean of B).... {following the same order to calculate mean}
Or
Overall mean = avg(MW1,MW2,MX1,MX2,MY1,MY2,MY3,MZ1)....{Calculating with individual values}
.....
Overall mean will be used in calculating sum of squares, so it's confusing which way is the correct one.
The NSF is hosting a workshop on using mathematical and statistical foundations to advance AI! This event will explore how cutting-edge math and stats can drive innovation in AI, from theory to applications.
📅 When: February 22–23, 2025
📍 Where: Virtual
The focus will be on:
Strengthening AI’s theoretical underpinnings
Addressing challenges in explainability, fairness, and robustness
Bridging the gap between pure math/stats and practical AI tools
Researchers, educators, and industry pros are encouraged to attend. Registration is free, but spots are limited!
Details & registration: NSF Event Page
Hello, I’m a second year undergraduate student majoring in neuroscience and minoring in mathematics. I’m doing a neuropsychology research internship at a hospital and I expressed a lot of interest in learning how to do anything mathematical/statistical and my supervisor said that I could potentially be involved in that. However, I don’t have much knowledge in statistics and I’ve never taken a statistics class.
What are some good resources to efficiently self-learn statistics, especially statistics relating to biomedical research? I have money to buy textbooks but of course I prefer resources that are free. Also, I’ve taken math up to Calculus II and I’m currently taking Linear Algebra if that helps.
By my calculation, 23.5% of Americans are on Medicaid (79 million out of 330 million). I believe births in the US as a percentage of population is 1.1% (3.6 million out of 330 million). So, would RFK's math mean the U.S. is 11.6 billion people?
Essentially, (30 million babies / .011 babies per 1 person in U.S. population) / .235 (Medicare population to total population)
Prep for Qualifying Exams
I was accepted into a decent stats PhD program. While it’s not my top choice due to funding concerns, department size, and research fit, it’s the only acceptance I have and I am very grateful. I would like to prepare myself to pass a stats PhD program quals.
I am reasonably confident in my mathematical analysis training. I am taking measure theory at a grad level in my final semester of undergrad, which goes over Stein and Shakarchi. I also took some other grad math classes (I was a math major and I focused more heavily on machine learning and applied math than traditional parametric statistics).
However, I fear that because I have not extensively practiced statistics and probability since I took the courses, I’m a little rusty on distributions and whatnot. I’ve been only taking math classes based on proofs for the last 1-2 years, and apart from basic integrals and derivatives, I’ve done few computations with actual numbers.
Here and there, I did some questions on derivations of moments for transformations of Gaussian random variables, but I honestly forgot a lot formulas
Should I end up at this program, I will find an easier summer job so I can grind Casella and Berger this summer. Im mainly fearful because a nontrivial number of the domestic students admitted fail the quals.
Please, guys, do you have any recommendations / advice?
During the 2014 World Cup, Uraguayan soccer player Luis Suarez bit opposing team's players 3 times during the cup. Later, some news sources (reputable and non-reputable) identified a statistical estimation that one has a higher liklihood of being bitten by Suarez at 1 in 2,000, much more probabilistic than the chance of being bitten by a shark (at the time 1 in 3.7 million).
How the hell does one estimate this? Seems like an odd thought experiment
I've been planning on going back to school and getting my masters, and I've been strongly considering applied statistics/biostatistics. I have my bachelor’s in history, and I've been unsatisfied with my career prospects (currently working in retail). I took an epidemiology course as part of a minor I took during undergrad (which sparked my interest in stats in the first place) and an introductory stats course at my local community college after graduation. I'm currently enrolled in a calculus course, since I will have to satisfy a few prerequisites. I'm also currently working on the Google Data Analytics course from Coursera, which includes learning R, and I have a couple projects lined up down the road upon completion of the course.
Is it feasible to apply for these programs? I know that I've made it a little more difficult on myself by trying to jump into a completely different field, but I'm willing to put in the work. Or am I better off looking elsewhere?
Hello everyone,
I have a question regarding the interpretation of the beta-coefficient of a mediating variable taken from a regression table. The model consists of a categorical indepvar (sex; coded: 1=male; 2=female), a numeric depvar (income) and a numeric mediating variable (workhours). The table reports a coefficient of 35.67€/hour. I am wondering if this is the average increase of income per additional workhour for BOTH male AND female OR ONLY for male? Does this depend on the coding of sex? I struggle with the term „…keeping sex constant”.
Help me understand this problem more generally. I want to get better at interpreting mediator-coefficients.
Thank you!
Let's say a coin flip comes head with probability p, then after N flips i can expect the with 95% that the number of heads will be on the limit (p-2*sqrt(p*(1-p)/N,p+2*sqrt(p*(1-p)/N), right?
Now suppose I have a number M much larger than N by the order of 10 times as large and a unkown p
I can estimate p by counting the number of sucess on N trials, but how do i account by uncertainess range of p on a new N flips of coins for 95%? As i understand on the formula (p-2*sqrt(p*(1-p)/N,p+2*sqrt(p*(1-p)/N) the p value is know and certain, if i have to estimate p how would i account for this uncertainess on the interval?
I've been tracking some of my own data on a daily basis over the last two years. It's mostly habits and biometric data such as step count for a total of 18 variables. I'd like to be able to make some inferences from the data but want to do so in a way that's not just looking at graphs.
I've looked into intense longitudinal DSEM but those are both only tracking a very small number of parameters and focus on within-peraon and between-person effects. Both of these don't really fit my application.
On the other hand, I do have some ideas and a path model I would like to investigate but my main issue with that is that my data violates the independence assumption. This is a characteristic of tools I used to record the data. Basically the data outputs from these habits (besides step count) are either booleans for each day (these should be fine to use). The other is a "trend" type of data which changes scores depending on sustained recurrence of daily habits with a decay function built in.
Does anyone here know what I could look into to analyse the data?
I don’t know if I fully agree with the overall premise that R2 is useless or worse than useless but I do agree it’s often misused and misinterpreted, and the article was thought provoking and useful reference
https://getrecast.com/r-squared/
Here are a couple academics making same point
http://library.virginia.edu/data/articles/is-r-squared-useless
https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/10/lecture-10.pdf
Scenario: Analyzing preference data with 3 values (For example: Which do you prefer: Football, Baseball, or no preference (i.e., neutral)?)
Primary Research Questions:
Strangely I'm not seeing many strong recommendations regarding this scenario unlike for continuous data.
My question is
Additional caveats:
Options I've considered:
Option A1 - Neutral expected = observed
First, eyeball (or confidence interval) the differences between Neutral and Football/Baseball.
Second, set the Neutral expected value EQUAL to the observed. Split the remaining expected values across Football and Baseball (50/50 split) to "remove" Neutral, but maintain sample size. (See image for example)
|| || | |Observed|Expected| |Football|36|(58/2) =29| |Neutral|42|42| |Baseball|22|(58/2) = 29|
Observed | Expected | |
---|---|---|
Football | 36 | (58/2) =29 |
Neutral | 42 | 42 |
Baseball | 22 | (58/2) =29 |
One problem seems to be the statistic itself b/c it's really wonky to try to interpret. It's like, "after removing the effect of neutral responses, participants’ preferences differed (or did not differ) between Football and Baseball."
Option A2. Neutral vs. others along with Neutral expected = observed
Instead of the first step above, either (A2a) take the larger of Football and Baseball, (A2b) add Football and Baseball together and see if combined they differ from Neutral, or (A2c) the average of Football and Baseball to see if that average is different than Neutral.
One problem is the interpretability of either A2a, A2b, and A2c is… they are hard to interpret and/or take a lot of language to explain.
Then use the second step above. So the same interpretability problem as A1.
Option B1 - Confidence intervals' overlap across expected values
[incomplete solution]
Calculate confidence intervals and compare to EXPECTED values. Same problem as above: How do you calculate expected values that are meaningful across the 3 values (33,33,33 is NOT in my opinion). So what expected values??
Option B2 - Confidence Intervals' overlap across the 3 observed values
Similar to using confidence intervals to eyeball differences between continuous data
Option C. Your suggestions!
Thoughts, opinions, suggestions**? Thank you!**
I am running a mediation analysis for my thesis project and have a couple of questions. My research consisted of administering a questionnaire via Qualtrics where everything is likert data. I tried running a CFA in JASP and R and came across the issue of having R treating my data as continuous, while JASP was treating it as ordinal. I believe the SEM class I took only handled continuous data, which was something I did not realize at the time. Now I am trying to figure out if I should continue treating my data as ordinal or continuous? For example, depressive symptoms were assessed using the DASS-21 subscale, where the final score is calculated by summing the responses to the relevant items, so in my head I feel this can technically be continuous if I use the total score. Luckily, I can manipulate JASP to treat my items as continuous so I can run my analysis with the ML estimator, but I am wondering if this is compromising my model fit in any way and if I should be treating my data as ordinal from beginning to end.
I am clearly very new at this and just need some guidance outside of what my advisor and committee is offering me
Hi,
I am currently an MS student in statistics. I just started this semester and graduated with my BS in biology last semester. I'm honestly not sure what to do with my life. I would ideally want to break into biostatistics, but that field isn't looking too hot for entry-level people. I just feel completely lost. I have applied to so many internships and have just gotten straight rejections. I don't have research experience, and it seems impossible to get at my uni bc all the profs only want phd students to work under them. I just dont know what to do.
Hi I am third year undergrad studying data science.
I am planning to apply to thesis masters in statistics this upcoming fall, and eventually work towards a phd in statistics. In the first few semesters of university i did not really care for my grades in my math courses since I didnt really know what I wanted to do at that point. So my math grades in the beginning of university are rough. Since those first few semesters I have taken and performed well in many upper division math/stats, cs, and ds courses. Averaging mostly A's and some B+'s.
I have also been involved in research as well over past almost 11 months. I have been working in an astrophysics lab and an applied math lab working on numerical analysis and linear algebra. I will also most likely have a publication from the applied math lab by the end of the spring.
When I look at the programs i want to apply to a good portion of them say they only look at the last 60 credit hours of my undergrad so that gives me some hope but I'm not sure what more I can do to make my profile stronger. My current GPA is hovering at 3.5 I hope to have it between 3.6-3.7 by the time I graduate in spring 26.
The courses I have taken and am currently taking are: Pre-calc, Calc 1-3, Linear Algebra, Discrete Math, Mathematical Structures, Calc-based Probability, intro to stats, numerical methods, statistical modeling and inference, regression, intro to ml, predicitive analytics, intro to r and python.
I plan to take over the next year: real analysis, stochastic processes, mathematical statistics, combinatorics, optimization, numerical analysis, bayesian stats. I hope to average mostly A's and maybe a couple B's in these classes.
I also have 3-4 professors I am sure that I can get good letters of recommendation from as well.
Some of the schools I plan on applying to are: UCSB, U Mass Amherst, Boston University, Wake Forest University, University of Maryland, Tufts, Purdue, UIUC, and Iowa State University, and UNC Chapel Hill.
What else can I do to help my chances of getting into one of these schools? I am very paranoid about getting rejected from every school I apply to. I hope that my upward trajectory in grades and my research experience can help overcome a rough start.
Hello!
I'm a senior undergraduate majoring in math. Down the line, I'm interested in graduate study in statistics. I'm further interested in careers in applied statistics, data science, and machine learning. I'm currently enrolled in an Advanced Real Analysis class.
The class description is the following: "Measure theory and integration with applications to probability and mathematical finance. Topics include Lebesgue measure/ integral, measurable functions, random variables, convergence theorems, analysis of random processes including random walks and Brownian motion, and the Ito integral."
For my academic and professional interests post-graduation, is it worth taking this class? It seems extremely relevant to my interests. However, the workload and stress from the class feel nearly unmanageable. What advice do you all have for me?
I am currently a student in my department's MS in Statistics program.
I applied for the PhD in Statistics program for the Fall 25 cycle in my department. I spoke to a person in the department, and though I was not rejected per se, they said that they had already sent out the offers.
I am working under a professor who is young and new to the department on a project (that is a potential publication), and this professor doesn't have any PhD students right now. I have expressed my interest in working under him, and he also has funding for a student. Since I started talking to the professor after I applied to the program, the fact that I am working with him is not included in my statement or resume, so the admissions committee is clueless about this situation.
I will also apply to the next cycle, but is there something I can do about this in this cycle?
If you were me, how would you best navigate through this situation?
i want a descriptive statistiques book where most of its content is about proving identites/ inequalities related to statistiques . thank you in advance !