/r/statistics
/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.
This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.
Guidelines:
All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:
Tag | Abbreviation |
---|---|
[Research] | [R] |
[Software] | [S] |
[Question] | [Q] |
[Discussion] | [D] |
[Education] | [E] |
[Career] | [C] |
[Meta] | [M] |
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator
Related subreddits:
Data:
Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.
Related Software Links:
Advice for applying to grad school:
Advice for undergrads:
Jobs and Internships
For grads:
For undergrads:
/r/statistics
I am running NMDS plots on metabarcoding data, which is often represented as relative abundance. Can I log-transform the relative abundance as well or should I just do one or the other? I know it is common to transform data in some form before NMDS.
Hi, I have gather particulate matter data from three sensors, inside and outside, for a full week. I put them together before and after the real data gathering. I want to test if the sensors are the same, so that the received data is the same when looked at it with a statistical test. What would I use for this?
I want to compare inside to outside data, so I have kept track of any inside activities that may cause peaks in the data. Any suggestions of what I can do statistically when comparing outside to inside? The sensors display a measurement every ten seconds, so I have a lot of data. It has also been done twice, at two different houses.
Hi all,
Full disclosure: I don't have a statistics background, but have been experimenting with copulas recently in the context of simulating data.
I've been told by a friend that copulas do struggle when introducing variables into the mix with limited dependencies with other variables. In my (admittedly limited) experience, once variables with limited dependencies on the majority of a set of variables are removed the expected correlations do seem more robust (this is when using a function which automatically constructs a vine/copula structure rather than my own construction of it).
What would be options for this type of situation? Or is the problem inherent to the use of an automated system of construction and a simpler structure conatructed directly is generally preferable?
Thank you!
Hi!
I'm trying to test the relation between altimetry and cave entrance coordinates. I generated a table of coordinates and another one of altimetry on the landscape where caves were searched for.
I want to know if the distribution of caves on the landscape is random (i.e., distribution should be equal to the altimetry distribution) or if they are clustered around a certain altitude range.
For this, I thought the chi-squared test was the best option since the altitude distribution on the landscape is non-normal. I generated two tables: one of the relative frequency of altitudes on the landscape and another of the relative frequency of cave entrances at any given altitude range. I ran the chi-squared test using the landscape frequencies as expected values and the cave entrance frequencies as observed values. It returned incredibly low p-values (9e-41).
Is my procedure correct? Can I use the distribution of altitudes on the landscape as expected values?
Would be grateful for any help :)
Hello!
I'm hoping to get some clarity on what statistical significance means exactly and how it relates to t-tests.
Is it that a "statistically significant" result or effect in a sample is accurately representative of a trend in the population? Or, assuming the null hypothesis that there is no difference is true, something is "statistically significant" when the observed effect is more likely due to a legitimate trend than chance?
Watching videos (specifically this one), I'm struggling to wrap my head around the first example (@2:40). What does it mean for the observed mean life expectancy to be "statistically significantly different" from the presumed population mean?
Any help would be super appreciated, as my mind is tying itself in knots trying to digest it all right now. 🙃 Thanks!
Hello! If I was looking to get 2 users to produce values using a software. These numbers each correlate to a subject. I want to compare the similarity between the numbers each user gets for each subject. I am also looking to compare this data to the qualitative data that is already known. I was wondering what statistical tests I could preform and what data presentation would be best. in essence I want to see if I can quantify qualitative data.
This is an example User 1 gets values 1, 2, 3, 4 and 5 For subjects 1-5 User 2 gets values 2, 4, 5, 4 and 2 for subjects 1-5.
it is already known that subject 1, 3 and 3 are positive and 4 and 5 are negative. how can I prove that the values correlate to whether the subject is positive or negative
I was wondering if there are implementations of these scales (nominal, ordinal, ratio, interval) on top of the normal data types, e.g. in data frames in R or Python pandas?
I'd believe it would enable more automation for EDA and visualization because they each come with very specific conditions and requirements. E.g. let's say I have a data frame with an integer column - if it were somehow marked as "ordinal" I'd expect that I wouldn't be able to calculate a mean, but get an error that says that it isn't possible. But I would be able to get the Median which I can't get from nominal data!
On the other hand it could also enable other packages to utilize this meta information and show specific visualizations or do certain summary stats out of the box.
Anyways, is there something like this that goes beyond "categorical" and "numeric" in Python and/or R?
So i have a cohort of around 60 medical images, each scored on a scale from 0-5 for subjective image quality (from non-diagnostic to excellent), as well as a score for presence of image noise (0-4; minimal to severe) and BMI scores for each patient (coninuous data from 19 to 39)
What statistical test can i use to see if BMI is correlated with Image quality(score 0-5) and BMI vs noise level(0-4)?
And do i need to perform a different test to see if image noise and image quality are correlated?
Just FYI: there are only a limited number of patients with impaired image quality and high amounts of noise, the majority is scored as 'good'. BMI seems to be normally distributed. Ive tried spearman correlations and Kruskal-wallis tests, but im not sure which one (or neither) is correct
Thank you in advance!
Assuming it's not review bombed, and it's not a divisive film (so no two-peaked, love it or hate it scenario), what is the best distribution to represent how the ratings would spread out? I would assume it's something like a normal or gamma distribution, but bounded on both sides. A beta distribution is the one I found that intuitively feels the most appropriate, but is that actually correct?
hey everyone! i am really so so so confused about what statistical procedure im supposed to use and any help would be greatly appreciated! 😭
basically what im dealing with is: we're testing our participants 5 times a day over the course of 7 days. im gonna calculate the daily means and continue with that.
i have three items, which are on a scale of 0-10.
my first two items are on the same questionnaire and they're about avoidance of thoughts and avoidance of situations.
the third item is seperate and it measure the intensity of pain.
i want to know, if there's a difference between the avoidance items on how the influence pain.
my initial thought was a multiple linear regression where the items of avoidance would predict the outcome of pain but im very unsure if that would be a good procedure since those two items are dependent coming from the same person.
what other procedures could i use?
so grateful for any help!!!
Its still my first semester of grad school but I can already tell taking Topology in undergrad would be far more beneficial than taking more analysis classes (I say “more” because Topology itself usually requires a semester of analysis as a prerequisite. But rather than taking multiple semesters of analysis, I believe taking a class on Topology would be more useful).
The reason being that aside from proof-writing, you really don’t use a lot of ideas from undergrad-level analysis in grad-level probability and statistics classes, except for some facts about series and the topology of R. But topology is used everywhere. I would argue it’s on par with how generously linear algebra is used at this level. It’s surprising that not more people recommend taking it prior to starting grad school.
So to anyone aspiring to go to grad school for statistics, especially to do a PhD, I’d highly recommend taking Topology. The only exception to the aforementioned would be if you can take graduate level analysis classes (like real or functional analysis), but those in turn also require topology.
Just my opinion!
See this data table from an article on HCQ as a treatment for COVID:
I asked ChatGPT to do a significance test on this data of survival of intubated patients, i.e., those put on ventilators. First, I had to separate the categories into disjoint subsets. That is to say, the “Intubated” line contains the total of both those who survived and those who died. So presenting the data to ChatGPT I had to break that into separate “Survive” and “Death” lines. Then this is how the question was put to ChatGPT:
Do the chi-square significance test on this data:
| | HCQ | No HCQ | Total |
|--------|--—--|------——|--------|-----------|
| Survive | 105 | 9 | 114 |
| Death | 49 | 17 | 66 |
| Total | 154 | 26 | 180 |
And this was the ChatGPT response:
Here are the results of the chi-square test:
Since the p-value (0.0022) is less than the significance level of 0.05, we reject the null hypothesis. This suggests a statistically significant association between survival (Survive/Death) and HCQ usage (HCQ/No HCQ) in this data.
Posts like this get my blood boiling. People come up with flashy new ideas and think everything that’s been around for decades is “obsolete”. This guy makes the most absurd takes and just gasses up this new uncertainty quantification method known as “conformal prediction”. Can someone explain this to me before I just start putting him on blast via LinkedIn?
In a war scenario, please calculate the total number of civilian victims based on demographics.
Premise
Assumptions
Question
Issue 1:
I’m analyzing my data using one-way ANOVA to examine differences in professional development (PD) method frequencies across educator demographic groups (e.g., attendance at workshops by age, years of experience, etc.). To check for homogeneity of variances, I’ve been using Levene’s test. When variances are equal, I proceed with standard ANOVA and use Tukey’s HSD when results are significant.
So far, everything has been straightforward.
However, I’ve been advised that when Levene’s test shows unequal variances, I should switch to Welch’s ANOVA and then use the Games-Howell post-hoc test if needed.
***
Issue 2:
Most of my Likert scales range from 1 to 5 (e.g., never to always). However, for questions about the effectiveness of PD strategies (e.g., Reflective discussions are 1 = No help to 5 = Very helpful), I’ve included a 0 = No exposure option, making it a 0-5 scale.
Using SPSS, I tried the 'Select Cases' function to exclude responses marked '0,' but it removes all responses for that respondent, even those with valid answers for other items. For instance, take the variable “Teaching observation” (labeled C2_2) as an example:
Ideally, I’d want to keep:
Problem: My current approach ends up analyzing:
It’s excluding all of Respondent A's responses, which reduces my sample unnecessarily.
This is how I have been excluding responses in SPSS 25
In calculus you have point slope or the derivative, but statistics doesn't to my knowledge have this. Let's say you have two pretty solid clusters of data that spread apart as x gets larger. You could use linear regression to find a line of best fit through the space between the clusters, and intuitively you would know that as x gets larger variance increases. But best you can do to calculate variance for a specific point on your line of best fit is to take some sort of range of values above and below the line centered at x=c and find some sort of estimated variance. What could we do to make a more accurate estimate of point variance? Is there some concept I'm missing?
Does anyone have any recommendations for statistics programs on the TI-84 Plus calculator?
I’m trying to get into a decent stats program and I’m wondering how I could help my chances. Ive taken the SOA probably exam and passed it as well as calc 1-3, linear algebra, 1 undergrad and 1 grad stats course. I’m currently living in Illinois so I’m thinking my cheapest options would be to go to Urbana Champain. I’m also a citizen of Canada and EU, but I’d probably only want to study in Canada so I’m looking at UBC, McGill, Toronto but Ive noticed that they have more requirements and I may not be able to get in if I don’t have an undergrad in stats
Can anyone explain what the uppercase T means/does in the QDA classifier formula? I am not able to find an explanation across the internet or textbooks. The formula can be found on this page: https://www.geeksforgeeks.org/quadratic-discriminant-analysis/
I'm just trying to understand how the formula works, and I appear to be missing some basic notation knowledge. Any help would be greatly appreciated!
I just completed my undergrad programme majoring in statistics. I've been doing a lot of research into masters programmes I may be interested in and how that would help in future career options (right now, I'm leaning towards data analytics). I struggled (kind of still struggling tbh) in choosing between a pure statistics and an applied statistics degree. I'm thinking an applied statistics degree may help better prepare me for the industry as I don't want to go into academia. But since I know that MAS degrees focused on teaching students how to apply statistical knowledge in the real world, it would be more coding-focused. I'm concerned my basic programming skills may not be enough to get accepted in any programme. I'm not completely clueless when it comes to coding. I'm at a beginner level in Python and still learning. Is that enough or would I need at least intermediate skills before I'd be considered or would I be better off just applying to pure statistics programmes?
How many components should be extracted?
Scree plot: https://postimg.cc/4nzVxNW9
This is the output for my PCA:
Principal Components Analysis
Call: psych::principal(r = res.imp$completeObs, nfactors = 2, rotate = "oblimin")
Standardized loadings (pattern matrix) based upon correlation matrix
TC1 TC2 h2 u2 com
QTH_VARIABLE_QUAL 0.65 0.45 0.55 1.0
QUW_VARIABLE_QUAL 0.79 0.61 0.39 1.0
QIW_VARIABLE_DJJ 0.41 0.30 0.70 1.6
QOW_VARIABLE_PTT 0.77 0.55 0.45 1.0
QQJ_INTEREST 0.41 0.51 0.61 0.39 1.9
WESCHLER_2020 0.78 0.63 0.37 1.0
SDQ_HYPERACTIVITY 0.84 0.64 0.36 1.0
VOCABULARY_TEXT 0.91 0.87 0.13 1.0
TC1 TC2
SS loadings 2.47 2.18
Proportion Var 0.31 0.27
Cumulative Var 0.31 0.58
Proportion Explained 0.53 0.47
Cumulative Proportion 0.53 1.00
With component correlations of
TC1 TC2
TC1 1.00 0.43
TC2 0.43 1.00
Mean item complexity = 1.2
Test of the hypothesis that 2 components are sufficient.
The root mean square of the residuals (RMSR) is 0.11
with the empirical chi square 98.01 with prob < 0.000000000000004
Fit based upon off diagonal values = 0.92
res.pca.imp$eig
eigenvalue percentage of variance cumulative percentage of variance
comp 1 3.5232032 44.040041 44.04004
comp 2 1.1239440 14.049300 58.08934
comp 3 0.9686372 12.107965 70.19731
comp 4 0.7531068 9.413835 79.61114
comp 5 0.5761703 7.202128 86.81327
comp 6 0.5024870 6.281087 93.09436
comp 7 0.3649657 4.562071 97.65643
comp 8 0.1874858 2.343573 100.00000
Thank you so much for your help!
I’m talking on the level of tibshriani, Friedman, hastie, Gelman, like that level of cracked. I mean for one, I think part of it is natural ability, but otherwise, what does it truly take to be a top researcher in your area or statistics. What separates them from the other researchers? Why do they get praised so much? Is it just the amount of contributions to the field that gets you clout?
Hi, this is my generated outputs from my algo (I have 10 runs for each parameter, table below only display first 4), the algo takes a long time to run, I do not know the underlying distribution of my output. I want to perform statistical test to test whether differences between my parameters are significant. in order to do that, first I need to perform power analysis to determine whether my sample size is appropriate, (n=10 could be too small). How should I approach conducting Power analysis here? ChatGPT suggested using Monte Carlo simulation to try out Friedman test, however my question is, if you use MC simulation, don't you already have underlying assumption for distribution? Thanks for your help!
run1 | run2 | run3 | run4 | |
---|---|---|---|---|
parameter A | 4519 | 4518 | 4520 | 4517 |
parameter B | 4518 | 4517 | 4521 | 4519 |
parameter C | 4522 | 4521 | 4527 | 4515 |
update: I've sampled the algo 1000 times (using 1 of the parameters), the distribution looks skewed, like a beta distribution: https://ibb.co/5nrF2v9
To understand how statistics formulas work, I have found it very helpful to recreate them in base R.
It allows me to see how the formula works mechanically—from my dataset to the output value(s).
And to test if I have done things correctly, I can always test my output against the packaged statistical tools in R.
With ChatGPT, now it is much easier to generate and trouble-shoot my own attempts at statistical formulas in Base R.
Anyways, I just thought I would share this for other learners, like me. I found it gives me a much better feel for how a formula actually works.
I'm working on a personal data bank as a hobby project. My goal is to gather and analyze interesting data, with a focus on psychological and social insights. At first, I'll be capturing people's opinions on social interactions, their reasoning, and perceptions of others. While this is currently a small project for personal or small-group use, I'm open to sharing parts of it publicly or even selling it if it attracts interest from companies.
I'm looking for someone (or a few people) to collaborate with on building this data bank.
Here’s the plan and structure I've developed so far:
user_id
) will link data across tables, allowing smooth and effective cross-category analysis.I was just wondering what time of the week and hours of the day do Americans on average have the most free time? Is there any variation in this by age group, or any studies anyone knows of which addresses this? Any information would be greatly appreciated.
Now that Trump won, clearly some (if not most) of the poll results were way off. I want to understand why, and how polls work, especially the models they use. Any books/papers recommended for that topic, for a non-math major person? (I do have STEM background but not majoring in math)
Some quick googling gave me the following 3 books. Any of them you would recommend?
Thanks!
Currently tasked with an disease-treatment project.
I’ve been asked to find a way to take disease specific scores, convert them into a decision tree based on paths, and give outcome probabilities + scores at each branch. On the outset, this is very easy. It’s a straightforward sensitivity branching analysis and I can do a follow up $/change in score at each branch. This is using published population pooled averages (Ie, a quick and dirty pooled average of changes after treatment in published literature) using disease specific scales, convert that to EQ-5D or similar, and then to QALY. I’ve found a paper that published an R algo to do this with the most common disease specific instrument (SNOT-22) but only on an individual basis. How would I go about doing this with group averages only?
I was using the default built-in np.random.random python, and I was wondering what exactly does it do to generate "random" values, I hope people who are here can enlighten me