/r/statistics

Photograph via snooOG

/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.

This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.


Guidelines:

  1. All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:

Tag Abbreviation
[Research] [R]
[Software] [S]
[Question] [Q]
[Discussion] [D]
[Education] [E]
[Career] [C]
[Meta] [M]
  • This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.

  • Please try to keep submissions on topic and of high quality.

  • Just because it has a statistic in it doesn't make it statistics.

  • Memes and image macros are not acceptable forms of content.

  • Self posts with throwaway accounts will be deleted by AutoModerator


  • Related subreddits:


    Data:


    Useful resources for learning R:

  • r-bloggers - blog aggregator with statistics articles generally done with R software.

  • Quick-R - great R reference site.


  • Related Software Links:

  • R

  • R Studio

  • SAS

  • Stata

  • EViews

  • JMP

  • SPSS

  • Minitab


  • Advice for applying to grad school:

  • Submission 1


  • Advice for undergrads:

  • Submission 1


  • Jobs and Internships

    For grads:

    For undergrads:

    /r/statistics

    564,266 Subscribers

    1

    [Q] Best way to measure comparability between 20 different measurements.

    Hello all, We have analyzers that measure your blood hemoglobin. We have 24 of them. Each year we have to do study to ensure that each instrument value are close to each other. For this we measure 10 same samples on each analyzer 10 times. What is the best way to define that these values are statistically similar? R? R2? SDI? Thank you

    0 Comments
    2024/05/03
    14:13 UTC

    3

    Theory of Point Estimation after Casella and Berger [Q]

    Hello, I’m looking at follow up books to statistical theory coursework that follows casella and bergers statistical inference. I took a two semester sequence in probability and statistical inference using casella and Berger and I’m looking for a follow up book for more knowledge of the theory. The theory in casella and Berger is good, but I think it was very surface level.

    In terms of next books what would you recommend next? Asymptotic statistics by Van Der Waart or Theory of Point Estimation?

    I’m considering either of those two. My math background is real analysis linear algebra and calculus 1-3. I have not taken measure theory.

    0 Comments
    2024/05/03
    13:50 UTC

    4

    [C] Recruiters prefer undergrads with engineering degree over those with stats degree for DA roles.

    I noticed this is the case (at least in my country). I am majoring in Statistics at a low-ranking university. It seems like even getting an internship is impossible. What advice can you give me to stick out from the rest?

    5 Comments
    2024/05/03
    12:11 UTC

    3

    [Q] How to select confounding variables

    I’m doing an analysis on the impacts of bullying on student achievement using the PISA 2022 data. As so many variables impact student learning outcomes I’m really struggling to figure out how to choose appropriate controls for my analysis. Any advice would be greatly appreciated!

    4 Comments
    2024/05/03
    10:10 UTC

    1

    [Q] What is the difference between cumulative and compound returns?

    Hello,

    I am analysing the annual returns of three portfolios and used this formula (1+r1​)×(1+r2​)−1 to calculate the compound annual returns. However, I have seen many papers refer to these returns as cumulative returns.

    What is the difference between the two? What is the correct terminology?

    Aren't cumulative returns just a simple sum (r1 + r2) of the returns?

    Thank you very much for your help.

    0 Comments
    2024/05/03
    09:44 UTC

    0

    [Q] How to caclulate cronbachs alpha in sheets???

    Ive been watching numerous videos on how to calculate cronbachs alpha but I keep getting an answer greater than 1??? Here is the video I watched

    https://www.youtube.com/watch?v=Hgf22LMcOHc&t=367s

    formula = n/(n-1) * (var.s of the sum of data - var.s of the sum of all individual data variance)/var.s of the sum of data

    what am I doing wrong?

    2 Comments
    2024/05/03
    07:14 UTC

    2

    [Q] Prerequisites for Probability and Random Process by Grimmett?

    Hello I really want to read Probability and Random Process because it seems it's one of the best books to understand diffusion process. I currently have studied

    Linear Algebra (abstract, proof based version -> Linear Algebra Done Right and Linear Algebra by Friedberg et al)

    Calculus

    Introductory Probability (Harvard's introduction to probability and statistics by prof. Blitzstein)

    But do I need the understanding of measure theory, real analysis, complex analysis etc to understand this book?

    Also, could you recommend me good books on diffusion process? Thank you!

    1 Comment
    2024/05/03
    03:56 UTC

    1

    [E] More Theory vs. Applications subjects

    Would you rather choose a degree program (Bachelor or Master) that is more advanced theory-focused (more subjects in the theory than practical stuff) then learn the practical stuff on your own (or on the job) or a degree program that is more applied then learn the theory on your own? Which is harder to do? Which of the options is more beneficial?

    For example, your goal is to become a data scientist and work in industry. Choosing the more applied route seems to be the obvious choice. And the theory-focused if your goal is to do a Ph.D and become a researcher. However, isn't choosing the theory-focused option also beneficial in the long run regardless of what you plan to do after graduation? Since having a very solid grasp of the theory (say, mathematics of machine learning, statistical theory of deep learning, optimization for deep learning) will help you to advance your career faster, not to mention if you eventually opt to pursue advanced degrees in the future either for promotion at work or to enter academia?

    In a more personal context, I'm trying to decide whether to apply for the Master in Mathematics and Economic Decision or Master in Data Science for the Social Sciences (both in the Toulouse School of Economics). Yes, both are indeed under Applied Mathematics but the first one is more theory-focused (and optimal for those who intend to do research in applied mathematics) and the second has more subjects that will train you on the more practical stuff (e.g. data mining, statistical consulting, risk analysis).

    Of course, in neither option would you do exclusively theory subjects or exclusively practical subjects. It's more a question of which I should prioritize in formally studying. Any thoughts?

    0 Comments
    2024/05/02
    23:40 UTC

    4

    [E] Any statistical model for decision making book?

    As the title says, i want to learn more about that.

    3 Comments
    2024/05/02
    23:27 UTC

    2

    [Q] Time Series Analysis

    Hello everyone,

    I have dataset consisting of social media comments in a platform from 2001 to 2024.

    The comments were annotated into 5 thematic categories. I want to test if, for example, proportional increase of the category 4 over time is significantly higher than category 2. Or perhaps I can compare the trends of each categories. For such a context, what statistical test would you suggest?

    Thank you!

    5 Comments
    2024/05/02
    23:24 UTC

    2

    [Question] Significance and Increases in Means Between Pre/Post Scores of Different Groups

    Hi Math People! I am trying to finish up my masters thesis and cannot figure out how to calculate whether a change is statistically significant or not. I have a pre score value and a post score value for every participant. Additionally, every participant is either part of group 1 or 2. I know that group one had a 15% increase between their mean pre score and their mean post score as a group. Group two had a 20% increase between their group’s pre and post mean scores. How do I know whether their average score increase is statistically significant between the two groups? Many thanks!!

    1 Comment
    2024/05/02
    21:48 UTC

    2

    [Question] Correlation and statistical outliers

    Hello Math-Wizards!

    I am working on my Bachelor in Psychology and I am analysing my collected data right now. Two weeks ago I released a survey with a short intelligence test (HMT-scores), a creativity measure (CAQ-scores) and a question about the regularity of creative activities (RAC-scores). I recoded the variables in JASP and I also added up all the outcomes of the different subsections of the tests (for example if someone got 3 correct answers on the intelligence test their score is 3, if they got 4 answers correct their score is 4 and so on). Now I have three variables - the HMT-sumscore, the CAQ-sumscore and the RAC-sumscore.

    I started to analyse the data (n=627) with pearson correlations and I found a small but positive correlation of 0.12 between CAQ and HMT - this was expected because of the already available theory on this topic.

    But my problem is the RAC-HMT correlation. It was a lot lower than expected with an r of only 0.06. I looked at the descriptive statistics of the RAC score and I found 1 extreme outlier.

    If I remove this outlier, my correlation is up to r=0.085 and JASP flags it as a significant correlation- which was not the case before.

    Now my question: After all I have learned in my course at university I feel extremely uncomfortable to just remove a dataset. It would feel like I only removed it, because I get a better outcome for my research this way. But I read up a little bit on correlations and apparently they are quite susceptible to outliers. So I also don´t want to report a statistic, that actually has an effect as a nullfinding.

    How do I go about this the right way? Do I report the full dataset, mention the outlier and remove it (is there a test I can do for that or a paper I can cite?) and then continue to analyse my data without the outlier? Or is there another way?

    I struggle a lot with statistics so I feel quite unsure about this situation.

    If someone could help me out that would save me from the mental breakdown I am having right now sitting over this dataset xD

    EDIT: Here are the values I found!

    RCA-Sumscore (Valid n=626, Missing n=1)

    Mean 17.075

    Std.Deviation 21.351

    Min. 0

    Max 145

    And some points from the frequency table:

    0 points -> frequency of 79 (Cumulative Percent 12.620)

    1 point -> frequency of 56 (Cumulative Percent 21.565)

    ...

    51 points -> frequency of 1 (Cumulative Percent 90.895)

    ...

    100 points -> frequency of 1 (Cumulative Percent 99.521)

    ...

    145 points -> frequency of 1 (Cumulative Percent 100)

    10 Comments
    2024/05/02
    18:52 UTC

    3

    [D] Do better teams win more often in a best of 7 than a single game?

    In any given sport, people seem to think longer series “remove luck” and usually lead to the “better team” winning more often than a single-elimination style. Assuming Team A has a 60% chance of winning any given game against Team B, what is the likelihood they would beat team B in a best of 7 series? (In other words, if they were to play 7 games, what is the likelihood that team A wins 4 or more?) If it doesn’t increase odds, is it, then, safe to say that longer series aren’t any less “random” than a single elimination game?

    18 Comments
    2024/05/02
    17:57 UTC

    0

    [Question] Say a test was taken by people from various countries. What’s the proper ratio to get “US fail rate”?

    Say a test was taken by people from various countries. If I wanted to get the “US fail rate”, what would I calculate?

    (Amount of US fails) / (Amount of US exam takers) Or (Amount of US fails) / (Amount of total exam takers) Or (Amount of US fails) / (amount of total fails)

    3 Comments
    2024/05/02
    17:32 UTC

    2

    [Q] Math Electives

    I am a Mathematical Statistics undergraduate. I get exposed to a lot of math courses but the majority of them are Statistics courses. I have a number of elective options for mathematics.

    The courses I have already taken are: Calculus 1 Calculus 2 Calculus 3 Linear Algebra 1 Discrete Mathematics Probability

    I am considering taking: Combinatorics 1 Numerical Analysis 1 Real analysis 2 [I will be taking Real analysis 1 as it is part of my program]

    Do any of you recommend any math courses that will benefit me once I go to graduate school or my overall statistics knowledge? Would linear algebra 2 be worth while? Optimization? Complex Analysis? Differential Equations?

    Things like stochastic processes and inference are considered Stats courses at my university and I will be taking these regardless.

    Let me know!

    3 Comments
    2024/05/02
    14:57 UTC

    2

    [Question] on bootstrapping with replacement

    Hello folks; I’m trying to understand bootstrapping sampling with replacement. An example which I understand: given a draw with 5 observations and there’s like 5 categories ABCDE; the sampled statistic of the distribution is first calculated. Next, for the subsequent draw with 5 observations, repetition is allowed for the categories eg ABBDE, and the sampled statistic is calculated. My question is this: will there a limit on the amount of repetition per unique categories for each draw? Eg is AAAAA permitted? I would assume that such draws will severely distort the distribution. Or is the replacement limited to one additional (eg AA) is allowed.

    5 Comments
    2024/05/02
    14:35 UTC

    4

    [Q]can i apply Wilcoxon-mann-whitney test even when two sample sizes are equal

    i know it is primarily used for unequal sample sizes but can we use it on equal sizes too? it it applicable in that case? Please cite any source regarding this if you have any

    3 Comments
    2024/05/02
    14:29 UTC

    8

    [Question] Is continuous data continuous if it is measured to an arbitrary decimal place?

    Continuous data is described as a value having an infinite possible number of values, I got examples like like height and mass from my course. However, if for an example, height can only be measured with something like a tape measure (in m) which is only capable of measuring to the nearest 3d.p doesn't that mean the data is discrete since it has to be a value with 3 d.p?

    19 Comments
    2024/05/02
    11:43 UTC

    2

    [Q] Is there a way to see if a sampled dataset is representative of the total population, without actually having the complete population-level data?

    I'm conducting my master's dissertation on if it's possible to assess the welfare of a wild animal population, using welfare data from just a sample of observed individuals within that population. I've tried a few things but have made little headway since a lot of stats tests require at least some data to 'fill in the gaps', which runs counter to my intention for this model. Does anyone have any suggestions for this? Thank you so much for your time.

    6 Comments
    2024/05/02
    11:28 UTC

    2

    [Q] Comparison between pmf and pdf on a plot

    I met a basic problem in pdf and pmf.

    I perform a grid approximation on bayesian problem.

    After normalizing the vector, I got a discretized pmf.

    Then I want to draw pmf and pdf on a plot to check if they are similar distribution.

    However, the pmf doesn’t resemble the pdf at all.

    The instruction told me that I need to sample from my pmf then draw a histogram with density for comparison. It really works, but

    why can I directly compare them?

    7 Comments
    2024/05/02
    09:49 UTC

    1

    [Q] Python logistic regression question

    Hi. I’ve just started learning about logistic regression and I came across grid search.

    From what I understand, based on Python’s sklearn.linear_model.LogisticRegression(), the C parameter refers to the inverse in regularization strength. This means to say is a greater C value means less regularisation that increases accuracy of train data and decreases accuracy of test data.

    Regularization is used to calibrate the graph in a way that it doesn’t fit the train data to a great extent so it can fit the test data equally (prevent overfitting). I know that there are two types of regularization methods: Ridge and Lasso and the calculations for both are different.

    Hence, my question is what type of regularization does the parameter C refer to? I apologize if I’ve made some theoretical errors as I’ve just started exploring this topic. Thanks in advance.

    Edit: Realized that there’s a penalty parameter within the LogisticRegression().fit() that states l2 regularization… 😓. Thanks for the help!

    2 Comments
    2024/05/02
    08:27 UTC

    12

    [Q] What are the odds of 1 person wining 3 of 5 bingo games out of 80 cards per game?

    Suspected cheating / scam at a game tonight. Almost everyone left angry and suspicious. Just curious of the odds

    37 Comments
    2024/05/02
    03:46 UTC

    14

    [Question] How to find evidence that a p-value is off

    I recently read a paper that just gives off really strong vibes of fabricated/falsified data. One of the red flags was the number of p-values of <0.00001 (yes, that many zeroes) for correlations of around 0.6 to 0.8 in a sample of n=150. All correlations that would be expected have p-values that low, and then there are more realistic-looking p-values for correlations that would not be expected or where there would be no strong a priori hypothesis. I'm not sure I care enough to ask the author for the original data and examine it, but I'm trying to think through conceptually whether there's something in the reported numbers alone.

    19 Comments
    2024/05/02
    03:09 UTC

    0

    [Q] Ap stats project idea?

    Hello All, I have a project in AP stats in which i have to answer a question, i am finding difficult to land on an idea can anyone can help me out.

    I was thinking about maybe I can conduct a survey on reddit if the proportions of males and female are equal for supporting democrat. Do you have any ideas how can i execute it. Like how can i find realting studies and provide raw data etc. i would really appreciate it if you guys dan help me out.

    I am attaching what i need to include in the project:

    1. Introduction 10%

    a. Statement of the question you are answering b. Population c. Parameter of interest d. Hypotheses e. Background information including related studies 2. Data Collection 20% a. Type of sampling survey or type of experiment b. Discussion of possible biases and corrective methodology used. c. Details on how survey was administered including randomization steps. d. Data collection if an observational study including randomization steps. e. Resources used f. References of previously done work that is similar to your own. 3.Data Analysis 20% a. Provide raw data in tabular form b. Restatement of hypotheses in statements and symbol form c. Assumptions and Conditions for your significance test d. Significance test calculations and or confidence interval calculations. e. Appropriate graphical displays 4. Conclusion 20% a. Results in terms of the hypotheses b. Limitations of the study c. Ways to improve your project

    4 Comments
    2024/05/01
    23:12 UTC

    2

    [Q] Standard Deviation Increases as Values Get Larger

    Hi,

    I was coding today and noticed that the std for some of my groups was much larger than for others. After checking the formula for std, this is explained by the fact that the std is calculated with a difference in the nominator, not a percentage difference. This means that as the numbers get larger, std increases even though the distribution isn't really more "spread out". Is there a way of measuring how "spread out" a distribution is? Perhaps skew? Or divide the std with the median?

    Edit: seems like kurtosis would be a good start

    11 Comments
    2024/05/01
    19:16 UTC

    1

    [Q] Where can I find good stats on digital or app overload?

    I recieved a bunch of stats from speaking with my LLM that turned out to be false. I've checked consensus and the broader internet but can't seem to get a good signal on how large of a problem digital overload, information overload or app overload is in the US.

    I'm hoping to find something that says X out of Y Americans suffer from too much digital information or stimulation, but I'm not too well versed at this.

    Anyone have ideas on where else to look to find data or stats relevant to this issue

    0 Comments
    2024/05/01
    17:59 UTC

    2

    [Q] Is it correct to use ANOVA subgroups analysis and then compare them with student t-tests?

    Hi, Im doing a study plan for an epidemiology class. This is a pre-intervention/post-intervention study in which we will be comparing one outcome (sleep quality) between 3 periods (baseline, post-intervention and follow-up). The plan for the analysis is to do an ANOVA test and then do the post-hoc test (Tukey) after if the results are significant. Is it correct to, after that, do multiple ANOVAs for each subgroup that we would be studying (age, sex) and then do student t-tests with the results of each subgroup to see if they are statiscally different?

    I'm not really good at statistics and would really apreciate all the help.

    2 Comments
    2024/05/01
    17:16 UTC

    9

    [Q] What are the best online resources and courses to learn inferential and descriptive statistics from scratch to college level?

    7 Comments
    2024/05/01
    16:29 UTC

    12

    [E] How do I get started in the field of statistics?

    I'm in my first year of college and I've become interested in becoming a statistician, but I'm not sure where to start from since there's not a statistics major in my local community college. I'm particularly interested in majoring in biostatistics but I've still got a long way before then.

    I'm quite unsure which undergraduate degree to go through with. Should I choose a general math degree or a computer science one? Or should I take a math major with a bio minor?

    27 Comments
    2024/05/01
    12:09 UTC

    1

    [Question] Investigating Correlation and Regression in Repeated-Measures Design

    I'm planning an experiment in which about 70 participants will rate the same 30 images based on two interval-scaled variables. Would it be possible to examine if there's a correlation or regression relationship between these two variables? Can this be applied to repeated-measures designs, and are 30 images a sufficient sample size for this type of analysis?

    0 Comments
    2024/05/01
    11:32 UTC

    Back To Top