/r/statistics
/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.
This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.
Guidelines:
All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:
Tag | Abbreviation |
---|---|
[Research] | [R] |
[Software] | [S] |
[Question] | [Q] |
[Discussion] | [D] |
[Education] | [E] |
[Career] | [C] |
[Meta] | [M] |
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator
Related subreddits:
Data:
Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.
Related Software Links:
Advice for applying to grad school:
Advice for undergrads:
Jobs and Internships
For grads:
For undergrads:
/r/statistics
Hi everyone,
I have a question on survey where respondents could choose up to 4 answers. These are categorical variables (e.g. financial pressure, family pressure etc.).
I also have a score for three scales of anxiety, stress, and depression.
I want to do a correlation between each of those scales and the categorical variables: for example, if you score higher on anxiety, are you more likely to choose "financial pressure" as one of your 4 choices?
Any suggestions on how I could do this?
Or should I do regression and dummy code the categorical question?
Thanks!
Hey guys,
i am new to statistics and have a problem I dont know how to solve the best. So i analyze mutiple studies about two medications x and y, which is more effective. The outcome is, if event z does happen, so I choose to do a risk ratio with the program revman 5.
Now to my problem. Not all studies do compare both medications, some do compare only x with placebo and some do compare medcation y with placebo, but all analyze if event z happens.
If want to know, how i can leave a side blank. I can only insert 0s, but that ruins the data.
My approach was to do 3 risk ratios. 1 with medication x vs placebo, 1 with medication y with placebo and then just do a third risk ratio with the added together data.
Hey guys, I'm a tabletop player and I was just thinking about what is the probability of getting an 8 or higher with 2 dice, given the opportunity afterwards to reroll either or both of the results (say you roll a 6 and a 1, you can choose to keep the 6 and roll the 1 again) I know initially it is 15/36 but after that, given the opportunity to reroll the results, it makes the math a lot more complicated. Was wondering if anyone knew what the real probability would be?
I'm currently in my penultimate year at uni studying comp sci and maths. The market for computer scientists is very saturated at the moment, and I wasn't able to secure an internship this year. And while I don't mind self studying topics for an interview, I think the bar has been set pretty high for being able to solve coding questions and it felt like I was doing an extra course this year purely off of interview prep.
I did computer science because I wanted a job, high earning potential, and stability. Seeing as those are probably off the table for me, I think I'd rather pursue something I enjoy. I love maths and stats, but I'm not entirely sure if I should make the switch this late. If I do switch, I should still be able to graduate on time, though maybe missing out on a couple of stats courses that I'd want to take. I'd love to hear a statistician's opinion on switching majors.
I’m currently modeling financial log-returns with a GARCH(1,1) Model with 0 Mean. i have acquired the following output: https://imgur.com/a/QIO8qaT
the other parameters were of significance, which implies that past returns(a) and past variances(b) does have an effect, but the intercept (i’m assuming it is volatility when both a & b is zero) is insignificant.
A case study you could use in statistics classes. Dog adoptions at an animal shelter, and the effects of COVID restrictions.
When COVID hit, animal shelters barred visitors from the kennels and switched to an appointment system. Most shelters re-opened post-COVID, but OC Animal Care (California) kept restrictions in place through 2023. In late 2023 they did a pilot program of allowing visitor access to kennels.
This paper shows that dog adoptions rise significantly when visitors get to see dogs in their kennels; and viewable dogs have a much higher chance of adoption.
https://doi.org/10.56771/jsmcah.v3.85
The paper is open access. The statistics are not too complicated. Good example for classes.
Badly needed
I’m trying to obtain some sort of measure for testing independence between the scores constructed by different classifiers. So for example, Naive bayes assumes there is independence between the classes and scores accordingly, while other classifiers like decision trees or deep neural networks take into account other classes in their scoring. I’m trying to figure out a way that constructs an estimate that quantifies this independence based on the scores for each class for some number of samples.
The best I’ve come across have been correlation and mutual information. I don’t know which is more relevant (and whether there are other approaches, please let me know if so), but I assume mutual information is more appropriate. In any case, both of these construct measures between every pair of classes. Is there a way (or alternate approach) to get a single estimate for testing how ‘independently’ some classifier constructs the scores between classes? Can I just average the mutual information between every pair of classes? Is this a sound approach? Any suggestions on what else I can do?
For example:
Disease 1 has a mortality rate of 35%
Disease 2 has a mortality rate of 56%
Disease 3 has a mortality rate of 60%
How would one calculate the likelihood of death, based on the sample statistics above, if an individual has all 3 diseases?
Do I just find the mean of each percentage? I am trying to calculate something for personal reasons.
Hey everyone, not sure if this is the right place to ask this question, but here goes.
I am currently a registered nurse in the intensive care unit. I got into nursing because I like science, I like working with people, and I’m pretty analytical so icu was a good specialty. Also, thought it would give me a more flexible schedule, but I’ve just found that working nights, weekends, holidays, no set schedule, etc and just everything about it has caused me burnout. It is just not for me anymore. I feel that the times I get to actually use my brain are few and far between, which is why I got into it in the first place, because nursing is overshadowed by so many other issues. I still enjoy the analytical aspect of nursing with looking at the patient but not everything else anymore.
So, I’m looking to switch up careers. As background about me, I’ve always excelled academically, graduated nursing school with 4.0, icu job straight out of school (competitive), have always loved math and science. So thinking of this, I was researching and came across the health analytics/ statistics field. There’s a uni near me that offers a masters in health analytics/ biostatistics. They require only that I have taken an undergrad stats class, which I have. But I’m worried because I really haven’t done stats or math in a while, and have zero knowledge or experience with computer science and programming. I’m willing to put in the work, and I think I have a good personality for it. But I’m just wondering if it’s worth the switch, and how much of a learning curve it will be going into this field with really no experience. Also, is there anything that would help me prepare a little or get a head start? Anything to introduce me to stats again since it’s been a while, or even learn basic programming?
Thanks, I appreciate any help or advice.
Disclaimer: I asked something similar before elsewhere
https://www.reddit.com/r/AskStatistics/comments/16sj21j/exposure_to_investigation_of_data/
but I am asking for something specific here, even though it is somewhat similar in nature.
I have been studying Hogg and McKean's "Introduction to mathematical statistics" and I have completed first seven chapters, and I managed to solve pretty much all the exercise problems. But as someone mentioned elsewhere, it is an "All theta, no data" situation. I have always been wondering if there are books that specifically describe the kind of data that statisticians deal with in practice, and the kind of statistical tools that will be useful in those situations.
I've been stuck in this situation where I can study any amount of theory but I have no idea how to get hands on experience using those tools on real data in a way they are supposed to be used. I am left with the idea that "Data Science/ML" does not involve rigorous statistics (simply because bazillion people manage to learn data science/ML without ever having studied mathematical statistical concepts/results like sufficiency, completeness, Basu's theorem etc).
If you can suggest a specific book which would be a nice follow up to Hogg & McKean's book keeping in mind the above requirement of mine, I would greatly appreciate it.
Is there a particular model that one would need to use to conduct a CFA with longitudinal data? Ideally with m-plus but any program works. I have been reading and I just don't have enough base knowledge to understand some of the jargon I am reading lol (such as in Geiser's book). Does anyone have any good starting points with this? I can provide more details about the nature of my data as needed.
Hoping someone can help me think this through. I'm running a classic attentional cueing paradigm - > participants' attention is cued to a location in space and then a target appears, either where the cue said it would (valid trials, 80% of the time) or on the opposite side (invalid trials, 20% of the time). Participants are instructed to only respond to valid trials (when target appears where cue said it should).
I have four categories of responses:
This seems to fit what I need to calculate d'. However, I'm worried because d' is usually calculated where some response is given in all categories - I have zero responses in categories 2 and 4. Can I still calculate d'??? Any thoughts or sources would be very much appreciated.
Thanks!
[Question] Basically is it worth applying for a job which uses Stata exclusively?. If it is as simple as SPSS I am confident I can grasp it before the interview date. I have no experience coding apart from at GCSE.
Correlation or regression or both
Hi all. I recently submitted a portion of my results chapter to my supervisor. I included both correlation and multiple linear regression. She said it’s basically duplicating the results and I should just stick to the one of the tests. She asked why did I do the regression. My thinking was to control for the predictors in the model. So now she wants me to only use the correlation but I feel as though it doesn’t provide enough insights. But she said that if we are doing a SEM thereafter, the regression is just duplicating the SEM as well. So her thinking is to do a correlation and thereafter do a SEM. Does this make sense ? I am in the social sciences.
When providing an economic impact study for a particular, say, architectural project of a building...there are certain "multipliers" to direct expenditure to get numbers on other categories of economic impacts. A blog/online article showed casually how this works here: https://higheredstrategy.com/institutional-economic-impact-statements-the-basics/
and gave an example reported data
$ Millions | |
---|---|
Provincial GDP Growth over last five years (1971-2010) | 20 000 |
total Factor Productivity (x 20%) | 4 000 |
Domestic as share of total (x 70%) | 2 800 |
Share of Domestic by Universities (x 40%) | 1 120 |
University of X's Share of Provincial Research 55 (x 35%) | 392 |
Research Impact | 392 |
Of course there are more nuances to this and depends on agencies and consultants' methods. But my question is (1) how reliant are these multipliers and (2) how do you explain these to those who really want to know actual numbers? (3) If you want things to be more accurate, what are the methods?
So I know i should probably use MLM but im not. Im also using SPSS - rather than R (please dont kill me). I wanted to just do the omnibus test for the interaction between time (4 time points) and a between subject variable (even a couple- I actually have a 3 way int) so in SPSS i did a repeated ANOVA and added the continues var as a covariate - and then manually defined the model and with the covariates as factors (and their interaction).
The within-subject effect in the results looked good. But am I doing something wrong here ?
I've been interested in data science/analytics for a few years now. I always thought that it would be a good major due to how interdisciplinary it is, since I'm interested in a lot of topics. Only problem is, I've never really had an interest in computer science. Now that I'm a DA major, I can say that while coding is kinda fun, I'm not really sure if it's what I want to do for the rest of my life. I'm only a freshman, so maybe it's too early to say that, but the DA major leaves me with almost no room for electives or anything else. There's even a non-zero chance I don't graduate in 4 years, unless I want to spend $10,000+ on summer semesters.
I really want to study other subjects as well, particularly geography and demography, but with a DA major, this almost isn't possible. If I switched to statistics, I'd be able to add a second major in geography.
I am a bit worried, however, that as the tech sector grows, the outlook for pure statistics will go down, and anyone without a CS background will be unable to find a job. I would be doing GIS with my geography degree though, which requires basic coding skills. So it's not like I'd be totally inept at coding. Sorry if this isn't the right subreddit for this question. Just wondering if a statistics degree will hurt my chances at finding a job compared to data analytics.
(And I really like math, so I'm definitely not only doing DA for the money)
I am trying to calculate reliability for a new product my company is developing. I have previously used a binomial equation with a Weibull distribution to determine we needed to test X samples for Y hours each to reach a Z reliability in the product running for our intended lifetime. However, because the samples are largely field tests they will not all start at once. Meaning each sample will reach Y hours at different intervals. I'm trying to determine if I can modify the equation to account for that.
For example, if I need to test 10 samples for 20 hours to reach a satisfactory reliability, could I instead get the same result with 3 samples at 50 hours, 5 samples at 20 hours, and 2 samples at 10 hours? It seems plausible to me, but I don't know how to mathematically account for that.
In my problem, each of my samples is represented by a weighted sample mean (in my case, the weights sum to 1). I want to understand the degree to which each sample deviates from the population of weighted sample means.
When my sample means are unweighted, I understand I can generate a z-score from a one-sample z-test. However, I'm not sure how to go about modifying my problem when my samples are represented by a weighted average, and I wasn't able to find a straight answer online.
Is it sufficient to recalculate all my sample means with an effective sample size(of one?) and generate a z-score from that? Are there other statistical tests that make more sense?
Any help is very much appreciated
As I progress further into my statistics major, I have realized how important regression, ANOVA, and logistic regression are in the world of statistics. Maybe its just because my department places heavy emphasis on these, but is there every an application for hypothesis testing that isn't covered in the other three methods?
I am working on a project where I need to compare percent success rates within two indepedent groups, A and B. I'm, counting the number of successes and failures in groups A and B. I want to test whether the proportion of successes in A and B is significant.
For example if both A and B have a sample size of 10 but A has 6 successes and B has 4, I want to compare the 60% and 40% proportions.
Which statistical test do I use for this? T-test, Chi-square, or ANOVA? I think I should be using Chi-square but just wanted to verify.
It's late and I sort of hit the end of my analysis and I'm postponing the writing part. So i"m tinkering a bit while being distracted and suddenly found my self evaluation the importance of predictors based on the loss of AUC score.
I have a logit model; log(p/1-p) ~ X1 + X2 + X3 + X4 .. X30 . N is in the millions so all X are significant and model fit is debatable (this is why I am not looking forward to the writing part). If i use the full model I get an AUC of 0.78. If I then remove an X I get a lower AUC, the amount the AUC is lowered should be large if the predictor is important, or at least, has a relatively large impact on the predictive success of the model. For example, removing X1 gives AUC=0.70 and removing X2 gives AUC=0.68. The negative impact of removing X2 is greater than removing X1, therefor X2 has more predictive power than X1.
Would you agree? Is this a valid way to rank predictors on their relevance? Any articles on this? Or should I got to bed? ;)
I apologize in advance for my english as I am not a native speaker.
This is the scenario: There are three classrooms of students of different grades. I want to analyze what variables of these influenced likelihood of scoring 0 on a test. The factors are: grade (8, 9 or 10) and stress level scale from 0 to 3. Of the students who scored 0 I want to compare those things statistically. How can I do this? I have excel and Jamovi (an R software).
I want to make into two graphs max and most importantly find out if there is a significant difference in students scoring 0 between stress levels and ages. Like in the 0-scoring subset. This is an analogy I came up with for a project I am working on but am not allowed to share--- but this is the same statistical issue I am having.
My colums look like this: test score (continuous decimal variable)--> list of scores. Stress level (ordinal variable, 0 1 2 or 3). Grade (ordinal) 8, 9, or 10.
I am comparing the sex ratios and abundance of my site to the environmental measures present - temperature, salinity, and slope. I want to see how these factors impact the study organism's sex ratio and abundance. What would be the best way to compare these stats? My current idea is three ANOVA tests, but I don't think this is right. One comparing the male % to the three factors, one comparing the female % to the three factors, and one comparing total abundance to the three factors. I think the ANOVA suits abundance, but I'm uncertain about sex ratio, since they're sort of linked dependent variables. Any help appreciated!
Hello, first time posting here. I’m traveling to Florida for a week in April and I’m trying to calculate the probability of seeing at least a rocket launch from Cape Canaveral. I’m only considering SpaceX launches for this exercise, and I’ve downloaded the list of their 2024 launches from Cape Canaveral. I’ve calculated the n-1 times in between each of the n launches and put them in a table. Assuming that each instance of ‘time between launches’ T (including repetitions) is equally likely, the probability of me seeing a launch is min(1,D/T): if my stay there is longer then the time between launches, the probability is clearly one; if my stay is shorter then it’s D/T, i.e. if the time between launches is 10 and I’m staying for 5 days the probability is 50%. Repeating this for all the n-1 ‘times between launches’ and averaging them gives me my overall probability.
How am I doing?
Edit: I think I also need to weight the longest time windows more, because I have effectively more chances of landing into one of those.
Hello. I am almost at my wit's end and would like to ask for help. I am working with a panel data, with regions as the panels and some gaps in the series. The gap comes from a region splitting into three such that the whole region had data for 1980-2000 but when it split, data for three regions that came from the bigger region picks up off where the original bigger region ended.
I am planning on making an MI for the three regions' data for the years 1980-2000. In the past, I would just fill in the missing values with the median. This time though, the number of missing years is just too significant. And I was hoping it can be imputed from the original series (1980-2000) since after all, they were once a single region. Could you please help me determine whether this is MNAR or MAR? TIA!
If I have a significance level of alpha = 0.05, do I use 0.05 as the the alpha when choosing the critical value or alpha/2 since the test is double sided? Most sources I look at says that I should use 0.05 since the F-distribution is one sided, but how could that be? Is not the fact that it is either two or one sided completly neglected in that case?