/r/AskStatistics
Ask a question about statistics
(other than homework). Don't solicit academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.
Ask a question about statistics.
Posts must be questions about statistics. The sub is not for homework or assessment help (try /r/HomeworkHelp). No solicitation of academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.
See the rules.
If your question is "what statistical test should I use for this data/hypothesis?", then start by reading this and ask follow-ups as necessary. Beware: it's an imperfect tool.
If you answer questions, you can assign your own flair to briefly describe your educational or professional background in statistics.
/r/AskStatistics
Hi guys, do y'all happen to know the model of fixed/random effects poisson regression? and also are they just the same for fixed/random effects negative binomial regression?
I'm reading a lot of publications but I get confused coz they are different from each other.
Pls help me out
Hello, I would like to find a textbook or book that covers these terms comprehensively, rather than reading about each term separately. Thank you!
When I perform a generalized linear model in JMP, I don't see how to find the means and standard errors for each level of the independent variable. Could someone point me to this in the output?
Apologies in advance for the long post.
Digging into this problem at work where a product generates some value and consecutive readings need to be consistent to within a pretty right limit, say 1%, for calibration. There's a separate test for this product with a much wider specification which checks the post calibration output, let's say 5%.
I'm trying to justify expanding the specification for calibration by showing that devices which calibrate only marginally outside that 1% tolerance can consistently pass the post calibration output test.
I've collected some data, n=30 for each calibration and output test point, and found the following.
Based off those findings I thought, alright, instead of fixating on the calibration spread, let's just change the test to sample that final, post-calibration output. One thought I had was to have it collect a sample of say, 10, during the final test and evaluate the sample mean and standard deviation (with confidence intervals) to see whether we could expect the device to consistently produce outputs within the 5% limit. I say 10 as an example because this is a product acceptance test, and we don't necessarily want to increase calibration time by 2-hours to collect n=100 samples.
Chebyshev's Inequality appears to be a general method of doing this, but all the literature I found is pretty explicit about using population parameters. My questions are then:
Is there any way to use Chebyshev's Inequality in this context where we have sample data only? Or does the theorem not even apply in a case like this where the underlying population distribution isn't evident?
Are there any other tricks or techniques that would be helpful in this case? Would attempting transforms on data which fail normality testing make sense? Would normality testing at smaller sample sizes even have enough power to be useful?
Are there any interesting conclusions I can reach after observing a consistent range of data across repeated samples? Perhaps repeat the experiment at different n values to see how big a sample would need to be in order to capture the full, expected range and use that to define a sample size for acceptance testing?
Thanks!
I'm doing a meta-analysis comparing the effectiveness of the Agrobacterium-mediated method and the Biolistic method for making transgenic plants. Between the two groups, I'm comparing the rate of successful gene insertion, the average yield, and the % relative range for the yield. I analysed them using Mann Whitney U, since i had a small sample size and the data was not normally distributed.
When I compare the means between these two groups, there's a massive difference. However, when I use Mann Whitney U, it says that the difference isn't significant enough to reject H0. I checked the data distributions, and they're all wildly different shapes. What do I do?
This problem is driving me nuts and apologies for the illiteracy here in advance : Imagine we have a machine process and want to determine the failure rate of widget production with high confidence. We can’t sample every widget so how should we structure this problem from a statistics point of view. The main issue for me is that this is an ongoing process without a known end date.
Hi!
I don’t know whether it’s possible but I would like to compare the basal characteristic of different population from different studies based on aggregate data. For example age from three different studies having, for each study, numerosity mean and standard deviation of the population. Is there any statistical test allowing that?
Furthermore, is possible to compare different Kaplan-Meier curves having an aggregate number of events for discrete time points (ex 1month, 6 months, 1 year and so on)?
Many thanks
Hey everyone, I am an undergraduate student in Food Science. I developed a fruit coat to prevent postharvest loss. Now, I want to check the fruit coat's impact on the fruit's antioxidant capacity and physicochemical properties( three properties check). I hope to take measurements four times ( day 1, day 4, day 8, and day 12). I have four study groups including the control group. I put the parameters as shown in the picture
I have no idea about the effect size and number of measurements. Can someone please explain me about those two factors and whether my sample size is correct or not
Thank you!
Hi all,
some time ago I stumbled upon a problem of calculating standard error on a coefficient in a regression analysis, where this coefficient is a mean value taken from multiple regressions. It has been bugging me ever since, maybe someone has an idea how to deal with this.
To make the issue clear, the actual application is as follows:
My camera is observing a uniform scene and the sensor is Nx by Ny pixels. I am changing the illumination on the sensor over time, so in essence each pixel should have the same signal over time function, except that each pixel has slightly different response function. For each pixel I am performing non-linear regression and fit some coefficients to the model. Let's say it's something like:
y(x) = x0 + x1*exp(-x2/x)
x0 and x1 are responsible for the sensitivity of the pixels and I don't care about those, these are supposed to be different for every pixel. x2 is responsible for the illumination function, and should be the same for all pixels (in principle), so I am calculating mean value of x2 over all pixels. How should I calculate standard error (or confidence intervals) of the averaged x2 value?
I'm having a bit of a crisis right now, really. The only things that I've learned in my undergrad program that I'm attached to are numerical methods, and loads of linear algebra lol. These days, I do wish to pursue grad school and earn my PhD in numerical analysis...but damn, does this feel like a waste of an undergrad experience.
Every day, we hear the same things. "Medical researchers find these cures using machine learning", or "materials scientists discover x number of new materials using AI". That's awesome. So how many of these innovations could've been done without AI, and without the obvious negative externalities that AI brings to humanity?
Hi all. I recently submitted a portion of my results chapter to my supervisor. I included both correlation and multiple linear regression. She said it’s basically duplicating the results and I should just stick to the one of the tests. She asked why did I do the regression. My thinking was to control for the predictors in the model. So now she wants me to only use the correlation but I feel as though it doesn’t provide enough insights. But she said that if we are doing a SEM thereafter, the regression is just duplicating the SEM as well. So her thinking is to do a correlation and thereafter do a SEM. Does this make sense ? I am in the social sciences.
Hi,
I just recieved the reviews of my manuscript. My experimental setting was as follows: wild-type and knockout mice that were fed normal diet (chow) and high-fat diet (HFD) for 8 and 16 weeks. I used two-way anova to check the interaction between diet and genotype and then Tykey's post hoc. I was looking at one dependent variable in each case. The reviewer suggested that I should use MANOVA here.
Is he right?
Does anyone know how to perform Win-Ratio analysis using JASP or JAMOVI?
I don’t really get what to use between one-tailed and two-tailed test. The given description for us, along with the level of significance, was to find the coefficient of correlation and determine at the 0.10 significance level whether the correlation in the population is greater than zero.
Pls help
On a recent statistics test I got a question wrong, but I'm pretty sure I should've gotten it right or maybe even there was no right answer. I don't have the test to quote it exactly, but it went something like "An online store is interested in the experience of their customers and sends an email to everyone who bought an item in the last [some amount of time idk] asking them to complete a survey. Which type of bias is guaranteed to occur in this situation?" I answered non-response bias because it's an optional survey, and when I asked my teacher about the question she said that that's not *guaranteed* because in theory everyone could answer. (Even then I feel like that's still non-response bias in a way because everyone just happened to respond and you're not representing the viewpoint of someone who wouldn't respond). But what the answer was supposed to be was undercoverage bias, because supposedly emails are undercoverage bias on principle because not everyone necessarily has an email. Now I don't remember the exact wording but I'm positive the word "every" (or "everyone") was in the question, which means they had everyone's emails which means an email was required to buy from them which means the entire population they're interested in has emails. But it seems weird that a statistics professor would make such a simple mistake and then even stand by it when I pressed them a bit (I had to get to another class so I didn't make much of an argument). Is there a principle here I'm not understanding?
Hello!
I'm confused about the way predicting models are validated and, in general, the way they operate. I mean:
Let's say I want to train a classification model based on historical data and I'm interested in getting an spatio-temporal prediction, that is to say, my classification task will have to determine:
whether a certain location in space will be the scenario or not of a certain kind of event in the future.
That's the classification task!
Now, let's picture that when I say "the future" I'm talking about today or this week.
So, my questions are: how old should my data be to meet this requirment? I mean, if I have data which last update was in January, is it useful for today's prediction and therefore, for how long is the model valid considering training and testing periods? I mean, I can't really picture the output I'll get once the model is trained, I know it will be if a place will be a hotspot or not. There's no doubt about it but, what about the time? How could classification output be TIMELY/OPPORTUNE if the timestamp of my last testing entry (real-world data) can only be at most in January?
To be clear, I have two datasets: One that is official/governmental, I mean, historical events truly happened and another dataset that is 'vox populi'. That's why data from January until now is unknown because government doesn't often update its datasets. In the other hand, the vox populi dataset is just from the last week. I mean, I have "old" governmental data (2016-January 2024), then there's a limbo (a period I have no data from any source from Jan 2024-Oct 2024) and then I have newly created data from last week. Are this datasets useful for what I'm trying to do? If not, how could I manage this to get a prediction for the time we're living in? Is it possible or not at all?
I would really appreciate your help guys. Any advice you could give to me would be very helpful.
The explanation I see for this is that if you sample with replacement, the odds of getting a value are the same each time (independent). But it seems to me you wouldn't want the probability to be independent if you want a representative sample of the population.
For example, let's say we have 50 red balls and 50 yellow balls, but we don't know what percent are red. If we've taken more red balls, we want a greater chance of picking a yellow ball, so that our sample mean will be closer to the population mean. The population doesn't count the same item twice, so our sample shouldn't either.
Just to prove this empirically, I wrote a python script that samples 50 random numbers out of 100 and estimates the average. I get a lot of samples with an without replacement and loop to see which method tends to estimate the population average better.
import random
import statistics
random_numbers = [random.randint(1, 30) for _ in range(100)]
sample_means = [statistics.mean(random.sample(random_numbers, 50)) for _ in range(1000)]
standard_deviation = statistics.stdev(sample_means)
sample_means_with_replacement = [statistics.mean(random.choices(random_numbers, k=50)) for _ in range(1000)]
standard_deviation_with_replacement = statistics.stdev(sample_means_with_replacement)
print (standard_deviation)
print(standard_deviation_with_replacement)
If you run this yourself, you'll see that the standard deviation of the sample averages in our samples without replacement (around 0.8) is almost always much lower than the standard deviation of the samples averages in our samples with replacement (around 1.2), meaning that we get results with a smaller spread by not replacing (even if our sample size is more than 10% of the population.)
I am taking a course in ML and the term posterior distribution comes up a lot and I don't have much of a background in statistics. What makes something a posterior distribution and why is it called that?
Hey guys!
I plotted a violin plot with survival probabilities (estimated marginal means EMM) back-transformed from log-odds, modelled from glmm analysis of binomial data (alive /dead) from different treatments.
Initially I used a basic dotplot but later I was suggested by my supervisor that I could use a violin plot to spice things up - so I tried and since I'm presenting model-derived, back-transformed probabilities (continuous) rather than raw binary data, it makes sense at this point. Later I was thinking, on a larger context, since survival is binary (alive or dead), plotting probability distributions doesn’t directly convey information about individual successes or failures but rather about the model's estimated probabilities. So the distribution displayed by violin plots may give a misleading impression of variation when working with probabilities that represent expected survival rates rather than an observed spread in data points.
Are there other alternatives to plotting these survival probability EMM instead of the basic plots (bar/dot)?
Thank you!
Hey everyone,
I just want to make sure I understand why normality of the error is important in OLS. I can wrap my head around the other assumptions because they are a little more concrete, but this one isn't coming quite as intuitively to me...
I know I'm very much boiling this down, so hang with me while I talk through my process. So essentially, OLS is estimating the slope of a line through data, trying minimize the SSE. I get that. Then, once we have our beta coefficient, we have to test for significance with a t-test. What this does, is it takes the beta estimate, we find the standard error, and then throw stuff up to the sampling distribution, where we can see if the estimate is significantly different from zero.
I guess my question about the error distribution is this. Does a normally distributed error help us with valid hypothesis testing because it suggests the sampling distribution of the beta is normal? In other words, the beta distribution forms around the "true" beta with a normal distribution. Some betas will be significantly bigger and some smaller because, in the infinite samples we take from the distribution, some will have higher error and lower error. This creates those bigger and smaller betas. Is this why the normal distribution of the error is theoretically important?
Let me know if I'm like... completely off base here. Like I said, I'm just trying to rationalize this assumption in my head. Appreciate the responses (and critiques) in advance!
I've been trying to find a sufficient answer online and am not entirely convinced by the information I've found so far.
My concern is, why is applying a transformation considered a valid way of handling skewed data? Surely by compressing large values, such as with a log function, removes some information?
Doesn't the fact to that you have to transform the data to meet the assumptions for analysis mean that any conclusions drawn are invalid?
For example, say I have moderately skewed data. I perform a log transformation and the data now visually fits within a normal distribution and the relevant p-value test indicates the transformed data is now normalised. If I identify outliers using the 3sigma rule, are they really outliers in the orginal data, considering the values I used for my analysis have undergone a transformation.
I am so completely lost with my dissertation project's analysis steps, and would really appreciate any insight/recommendations on how to proceed.
I am examining rehospitalizations (count data) during the first year after receiving a kidney transplantation in 3 US states. In my negative binomial regression, I include: age (categorical), sex (female/male), race/ethncity (categorical), length of stay (of initial transplant procedure), elixhauser comorbidity score, and hospital/transplant center. I realize that the hospital/transplant center variable does not just willy nilly go into the negative binomial regression as a covariate, but that I do need to adjust for hospital. There are 51 hospitals in my dataset. How do I go about including/adjusting for transplant center/hospital clustering in my nb model? I am working in STATA if that is helpful to know, ty so much (from a feeling somewhat defeated phd student who badly wants to finish)
Hey, so I've been reading "Stats: Data and Models" by Richard De Veaux, Paul Velleman, David Bock
My question is when and where to use each model? Binomial is pretty easy since that's either a success or failure, but the others got me stumped.
Could you do some examples and why each model were used?
Frustrated history student here, I'm not very good at maths, so here it goes:
I want to calculate deaths per 1000 in the Netherlands in the 19th century Amsterdam only including the ages 4-20. I've got the ammount of deaths divided in age groups and the total population, which is not divided by age. If I divide the total deaths by the total population, multiply by a thousand and then multiply the answer by the percentage of deaths within age group 4-20 (1%=0.01) would this give me death rate only including the 4-20 group? An estimate would also be suffient.
Im sorry if this is a stupid question, not used to doing stats or maths at all.
i am new to statistics so this may seem dumb. but help would be appreciated.
My model is a regression with a varying 2-5 number of features. I am working with time series data and the idea is to retrain daily on a rolling window and test on next days outcome.
Say that I have 1250 data points in total and I use 1000 for my train set.
The rolling window can range from 30 to 120 days and can be thought of a hyperparameter and so is the exact number of features.
The idea is to create the most robust testing framework. Walk forward validation is oft cited as a good option for time series data, but I’m getting tripped up by the fact that I train daily and want to compare my prediction with reality the next day (usually in walk forward you train on 250 days, then keep your model fixed for next 250 to test on as an example). Whereas I train on 90days and predict tomorrow and repeat.
So how do I do this? Can I just split my train dats into 4 folds, check num parameters and rolling window that does best on first 250 days and then test on next 250, and repeat? Is doing this tuning process right away kosher?
Thanks!
Im current doing my undergraduate thesis proposal and trying to lessen the sample size I need. One of my local RRLs used this to determine their sample size. Can someone simply explain how the percentage of prevalence was used in the formula?
Excerpt from the article:
"The study sample size was calculated using the Cochran formula to estimate the prevalence. The prevalence of menopausal symptoms was set to 51.4% based on the study of Chim et al.[23] that 51.4% experience low back pain. The resulting sample size was calculated at 196 with a margin of error set to 7% and a level of confidence at 95%. Adjusting for nonresponders at 15%, the minimum required sample size was 225, which was fulfilled by the study."
Hey guys...
A friend had an insight he wanted to test: common Pain Scales ask patients to rate their pain score from 0 to 10, being 0 no pain and 10 the worst imaginable.
Classic literature claims mild pain are values from 1-3, moderate pains range from 4-6 and severe pain from 7-10.
My friend has this hypothesis that the cutoffs are not correct. that actual mild pain ranges up to nearly 5, and only pains above 8 or maybe 9 are considered severe pains.
So he collected data. he interviewed a lot of patients and asked them both their Numeric pain scale, and subjective (mild, moderate, severe) score.
With data in hand, I got the challenge of how to analyze it.
My initial idea was to transform "Mild, moderate and severe" into arbitrary numerical values, i used 2, 4, 8... and i tried a Pearson correlation and took note.
then i built another column, transforming values from 0-5 into "new_mild", hence 2, 6-8 became "new_mod" hence 4 and 9-10 became "new_severe" hence 8. so again i made another Pearson correlation between the new values and compared to the original scale... that, and some values wrangling later... we found the best fit...
Later on, i thought about using AUROC, or more accurately - Diagnostic Odds Ratio to attempt to find the best fit. - it matched precisely like my pearson coefficient initial attempt.
All in all, it seems ok... but i don't think this is the correct approach to this problem, rather it seems like a layman's foolish attempt to use simple tools to tackle a complex problem. Can you guys advice me on how should i have conducted this better? Thanks in advance. cheers!
I'm having a little argument with a professor because I've used a Mann-Witney test to compare two groups of very different sample sizes (n=17 and n=122). I've done that because I believe the first group is too small to do a standard t-test.
She argues that I can't use this test because of the different sample sizes and asks me to take a random sample from the big group and use that as a comparison. This doesn't really make much sense to me, because why would I use a smaller group if I have data from a bigger group? I've tried to search this online and found a lot of people saying that these differences are ok for the test, but no article or book references.
I'm not completely sure my approach is right. What do you guys think? Which one makes more sense? Do you have any references I can send to her to talk about how the different sample sizes aren't a problem for this test in particular?
Thanks!
Hey all!
I'm wondering if anyone can share any good books, articles, or websites that walk you through the steps of designing a quantitative research study.
I'm in an ed.d program, with a dissertation requirement, but all of our stats classes have been incredibly theoretical. I'm looking for some resources that highlight the practical process I need to be following to design a good study. My aim is to go mixed methods. I have some familiarity with R and have taken regression and multivariate analysis.
Thanks in advance for any recs!