/r/statistics

Photograph via snooOG

/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.

This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.


Guidelines:

  1. All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:

Tag Abbreviation
[Research] [R]
[Software] [S]
[Question] [Q]
[Discussion] [D]
[Education] [E]
[Career] [C]
[Meta] [M]
  • This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.

  • Please try to keep submissions on topic and of high quality.

  • Just because it has a statistic in it doesn't make it statistics.

  • Memes and image macros are not acceptable forms of content.

  • Self posts with throwaway accounts will be deleted by AutoModerator


  • Related subreddits:


    Data:


    Useful resources for learning R:

  • r-bloggers - blog aggregator with statistics articles generally done with R software.

  • Quick-R - great R reference site.


  • Related Software Links:

  • R

  • R Studio

  • SAS

  • Stata

  • EViews

  • JMP

  • SPSS

  • Minitab


  • Advice for applying to grad school:

  • Submission 1


  • Advice for undergrads:

  • Submission 1


  • Jobs and Internships

    For grads:

    For undergrads:

    /r/statistics

    587,235 Subscribers

    2

    [Q] I need help creating a ranking system using aggregate scores?

    Hi! I need some help regarding statistics--something I am not very good at. (TLDR: Combining number values with percent values.)

    Context: I have a Google Sheets tracker that acts as a tracker for my daily creative goals. Each week, I list a number of options that I'd like to get done. Each option has a corresponding checkbox. Whenever I spend time on one of those options, I give said option a checkmark using the checkbox.

    I tally the total number of checkmarks each option receives, and at the end of the year, I like to create a ranking of which option I check off the most times.

    Example of what the tracker looks like

    Ranking System 1
    When I started ranking everything for the first time, I realized that just ranking everything based on how many times they were checked off is a bit unfair. One of the rules of the tracker is that once I finish an option, (ex: finish a show, beat a game, etc.), that option can no longer be checked off, as the option has been finished.

    Because of this, it's not exactly fair if something like this happens:

    1. Option A: 26 Checkmarks (Took 80 days to finish)
    2. Option B: 5 Checkmarks (Took 6 days to finish)

    While Option A had more checks, Option B was not only finished faster, but also received checkmarks with a better consistency. As far as this tracker is concerned, I was more productive in finishing Option B, and yet it ranked lower than Option A.

    My desire to correct this oversight led to the creation of a second ranking system.

    Ranking System 2
    Unlike the previous ranking system, this one is based on the percent rate at which an option was checkmarked, essentially providing the percent-chance that an option could have been selected on any given day. The formula for doing so is handled like this:

    x / y

    x= Number of times option was checkmarked
    y= The number of days in which the options were available to be checkmarked

    If we were to compare it to the previous example, it would look like this:

    Option A: 32.50%
    Option B: 83.33%

    In this case, Option A may have had a higher number of checkmarks, but the rate at which it received them was far lower than Option B.

    On paper, this seemed like a much fairer way to rank things. However, I quickly noticed that something else would happen that ALSO felt unfair.

    Here's an example.

    Option C: 100.00% (Received 4 checkmarks in 4 Days)
    Option D: 85.19% (Received 23 checkmarks in 27 Days)

    Option C received a perfect 100% after earning 4 checkmarks in 4 days. I spent time on Option C for every single one of those days, hence the 100%. It was quick and easy to complete.

    However, Option D is a different story. It earned 23 checkmarks in 27 days. A solid result, but because of the longer amount of time it took to complete it, there were more chances for me to miss a day. Sure enough, I missed 4 days, thus resulting in an 85.19%, (23/27), selection rate.

    You see the problem here? Despite Option C only earning a measly 4 checkmarks, it ranked higher than Option D, a beloved option that earned a remarkable 23 checkmarks. This is ANOTHER oversight, and one I'd like to correct.

    Ranking System 1 rewards Quantity

    Ranking System 2 rewards Quality

    I need a system that rewards BOTH.

    What I need help with:

    Thank you for reading through all this.

    I need help figuring out a Ranking System that combines the previous two into a system that is built on some kind of aggregate score.

    I imagine it's something like multiplying the number and percent values together, but I doubt it's either that simple or that effective.

    If someone could help me find the solution I'm looking for, it would be incredibly appreciated!!

    0 Comments
    2025/01/18
    20:47 UTC

    1

    [Q] Help request: longitudinal program assessment

    Hi, I’m looking for some advice and (ideally) resources on conducting longitudinal program assessment with rolling treatments and outcomes.

    My project is intended to assess the effectiveness of an educational support program on various outcomes (GPA, number of failing grades, etc.). I had planned to do this with propensity score matching. I have a solid understanding of implementing this as a cross-sectional project.

    However, the program has been offered for several semesters, and I’d like to use all that data in the assessment. In this longitudinal data set, both the treatment (program involvement) and outcomes are time-varying, and I’m struggling to understand how to appropriately set up the data file, apply propensity score matching, and complete the analysis. (Not to mention that students naturally censor due to graduation, drop out, etc.).

    I’ve considered creating multiple datasets (one for each semester) and running the propensity analysis by semester, but this seems like the brute-force approach. It also feels like I might be losing statistical power in some way (this is just a feeling, not knowledge), and it increases the chances of errors.

    My asks:

    • Does anyone have recommendations for ways to approach this type of longitudinal program assessment with propensity scores?
    • Are there resources you’re aware of that would be useful (tutorials, guides, exercises, etc.)?
      • I’m doing this work in Stata, but if resources use some analogous program, I might be able to translate.

    Thanks for any help!

    P.S. - If other subreddits are more appropriate for this kind of question/request, I'd appreciate a redirect.

    0 Comments
    2025/01/18
    20:12 UTC

    2

    [Q] What's the fairest way to gauge overall performance in a science Olympiad, where teams choose 4/11 possible modules (of varying difficulty)

    Sorry for the verbose title; I couldn't figure out how to explain it any better. I'm part of the managing team of a science contest with 11 different modules. Each participating team chooses 4 modules to participate in. Modules are graded independently with completely different criteria (e.g. the mean score in one module could be 10/60, in another it could be 80/100).

    Ultimately we want a metric for the "best team", regardless of modules. What would be the fairest way to account for the varying "difficulty" and theoretical top scores of all participants?

    As a side note, many (but not all) teams are affiliated with an "institute". Some institutes have more teams than others. We also have an award for the best institute by considering the average performance of all affiliated teams.

    What would be the 'best' way to calculate that, without skewing results based on module difficulty and the number of teams in a given institute? (Would it simply be averaging the above scores for each team?)

    Thank you for any help in advance, if any clarification is needed please let me know in the comments and I'll edit the post accordingly.

    2 Comments
    2025/01/18
    08:54 UTC

    2

    [Q] What other courses should I take?

    1. Stat 625: Regression Modeling
    2. Stat 607-608: Probability and Mathematical Statistics I, II
    3. Stat 535: Statistical Computing

    These are the musts for my program, I can also take five courses in other areas of stats, econometrics, biostats, and also machine learning and data science. I kinda feel like I should data science type stuff to get more coding experience, but worry I will be lacking in stats knowledge, which is kinda what would differentiate me between a cs degree. What do you all think? Any advice is super appreciated!! Thanks in advance.

    7 Comments
    2025/01/18
    00:52 UTC

    1

    [Q] I wanna get into finance, perhaps quant research. Didn’t do internships as I taught during my masters. Thinking of PhD because I really wanna do it. Two birds, one stone. Thoughts?

    I know for quant trading you need a masters and interview studies, but I wanna get into research.

    Anyone take this path? I’ve talked to some quants and said it’s a good idea if I wanna do research rather than trading.

    11 Comments
    2025/01/18
    00:29 UTC

    0

    How does one get a job at Posit? [Q]

    Never see them hiring ever. But would seem fun to just work in R all day writing software packages!

    4 Comments
    2025/01/17
    16:30 UTC

    0

    [S] Looking for free/FOSS software to help design experiments that test multiple factors simultaneously - for hobbyist/layman

    Hello all!

    I'm working on making some conductive paint so that I can electroplate little sculptures stuff I make - just as a hobby/creative outlet. There are recipes out there but I want to play around with creating my own.

    I'm looking for some free software that can help me design experiments that can test the effects of changing multiple ingredients at the same time and also analyze/plot the results. Because this is something I'm just doing for fun I'm looking for something free and also something that doesn't have a huge learning curve because it doesn't make sense to spend so much time learning to use a tool I'll rarely use (so R to me looks like it would be out of the question).

    I know I could use excel and do the experimental design myself, but I figured perhaps people more knowledgeable about this sort of thing might be able to point me towards something better.

    Thanks in advance!

    1 Comment
    2025/01/17
    12:58 UTC

    2

    [Q][E] i have a statistics final exam nex Tuesday and i want to get the full mark , any tips ?

    I just have never got the full mark in statistics and i feel scared , and my course is about parametric and non parametric tests , during the test i feel confused and i feel like my brain got stuck , any tips that helped you in exams ?

    3 Comments
    2025/01/17
    09:57 UTC

    1

    [Q] test if a measured value significantly differs from expected norms without a control group?

    Hi all,

    I have a group of patients with specific characteristics, and I’ve observed that a value I measured (let’s say heart rate) seems to be lower than expected for most of the subjects. I’d like to determine if this difference is statistically significant. The challenge is that I don’t have a direct control group. However, I do have two potential comparison options:

    1. Predicted values for each patient: For each patient, I have a predicted "norm" heart rate. My measured heart rates are around 80-90% of these predictions for most patients. Is there a statistical method I can use to test if my group differs significantly from the predicted norm (100%)?
    2. Percentile charts: I also have access to percentile charts for heart rate by age. These include values for the 2nd, 9th, 25th, 50th, 75th, 91st, and 98th percentiles, as well as the distribution parameters (Mu and Sigma). Can I use these to test if my group statistically differs from the expected population distribution?

    Any guidance on appropriate statistical tests or approaches for either of these scenarios would be greatly appreciated! For info: the group is relatively small.

    1 Comment
    2025/01/17
    09:47 UTC

    0

    [Q] Need help!

    Hello, I'm doing an undergraduate thesis and my study is about the gendered impact of typhoon on women in the certain areas it affected (municipal-level). I was told that my data analysis should be chi-square, is it true? I'm sorry but I am really bad at statistics and it'll be a great help if u can share your thoughts. Thank you!

    Note: my questionnaire is a structured questionnaires but my thesis is mixed-method (thematic analysis for KIIs) ; i haven't gathered a data since the submission is only Chapter 1-3 (3= methodology). but the instrument that i've made is a structured questionnaires (questions about their demographic profile (socioeconomic status, condition, etc); and their roles and responsibilities, and impact of typhoon during and after the calamity

    2 Comments
    2025/01/17
    03:01 UTC

    1

    [C] Low Stat Applicant Seeking Advice on MS Statistics Programs

    Hi everyone,

    I am a domestic, non-traditional, low-stat applicant. I majored in cs at a no-name university, have no research experience, and hold a 3.1 GPA. Over the past year, I retook Calculus I–III, Linear Algebra, and Intro to Statistics at a community college to refresh and strengthen my math foundation (postbacc gpa 4.00) while working full-time. I have been out of school and working in an unrelated field for about two years.

    I am looking to gain research experience in a master's program and then aim for a PhD. I am in search for schools with rigorous math,/statistics departments that offer ample research opportunities.

    I have curated a list of schools to apply to, but I am unsure if it is appropriately balanced given my stats. Should I aim higher or lower? Any recommendations or insights?

    • UChicago
    • UMich
    • UMN
    • SBU
    • UIUC
    • NCSU
    • TAMU
    • UCI
    • UIC
    • UGA
    3 Comments
    2025/01/17
    01:23 UTC

    243

    What is hot in statistics research nowadays [Research]

    I recently attended a conference and got to see a talk by Daniela Witten (UW) and another talk from Bin Yu (Berkeley). I missed another talk by Rebecca Willett (U of C) on scientific machine learning. This leads me to wonder,

    What's hot in the field of stats research?

    AI / machine learning is hot for obvious reasons, and it gets lots of funding (according to a rather eccentric theoretical CS professor, 'quantum' and 'machine learning' are the hot topics for grant funding).

    I think that more traditional statistics departments that don't embrace AI / machine learning are going to be at a disadvantage, relatively speaking, if they don't adapt.

    Some topics I thought of off the top of my head are: selective inference, machine learning UQ (relatively few pure stats departments seem to be doing this, largely these are stats departments at schools with very strong CS departments like Berkeley and CMU), fair AI, and AI for science. (AI for science / SciML has more of an applied math flavor rather than stats, but profs like Willett and Lu Lu (Yale) are technically stats faculty).

    Here's the report on hot topics that ChatGPT gave me, but keep in mind that the training data stops at 2023.

    1. Causal Inference and Causal Machine Learning

    • Why it's hot: Traditional statistical models focus on associations, but many real-world questions require understanding causality (e.g., "What happens if we intervene?"). Machine learning methods, like causal forests and double machine learning, are being developed to handle high-dimensional and complex causal inference problems.
    • Key ideas:
      • Causal discovery from observational data.
      • Robustness of causal estimates under unmeasured confounding.
      • Applications in personalized medicine and policy evaluation.
    • Emerging tools:
      • DoWhy, EconML (Microsoft’s library for causal machine learning).
      • Structural causal models (SCMs) for modeling complex causal systems.

    2. Uncertainty Quantification (UQ) in Machine Learning

    • Why it's hot: Machine learning models are powerful but often lack reliable uncertainty estimates. Statistics is stepping in to provide rigorous uncertainty measures for these models.
    • Key ideas:
      • Bayesian deep learning for uncertainty.
      • Conformal prediction for distribution-free prediction intervals.
      • Out-of-distribution detection and calibration of predictive models.
    • Applications: Autonomous systems, medical diagnostics, and risk-sensitive decision-making.

    3. High-Dimensional Statistics

    • Why it's hot: In modern data problems, the number of parameters often exceeds the number of observations (e.g., genomics, neuroimaging). High-dimensional methods enable effective inference and prediction in such settings.
    • Key ideas:
      • Sparse regression (e.g., LASSO, Elastic Net).
      • Low-rank matrix estimation and tensor decomposition.
      • High-dimensional hypothesis testing and variable selection.
    • Emerging directions: Handling non-convex objectives, incorporating deep learning priors.

    4. Statistical Learning Theory

    • Why it's hot: As machine learning continues to dominate, there’s a need to understand its theoretical underpinnings. Statistical learning theory bridges the gap between ML practice and mathematical guarantees.
    • Key ideas:
      • Generalization bounds for deep learning models.
      • PAC-Bayes theory and information-theoretic approaches.
      • Optimization landscapes in over-parameterized models (e.g., neural networks).
    • Hot debates: Why do deep networks generalize despite being over-parameterized?

    5. Robust and Distribution-Free Inference

    • Why it's hot: Classical statistical methods often rely on strong assumptions (e.g., Gaussian errors, exchangeability). New methods relax these assumptions to handle real-world, messy data.
    • Key ideas:
      • Conformal inference for prediction intervals under minimal assumptions.
      • Robust statistics for heavy-tailed and contaminated data.
      • Nonparametric inference under weaker assumptions.
    • Emerging directions: Intersection with adversarial robustness in machine learning.

    6. Foundations of Bayesian Computation

    • Why it's hot: Bayesian methods are powerful but computationally expensive for large-scale data. Research focuses on making them more scalable and reliable.
    • Key ideas:
      • Scalable Markov Chain Monte Carlo (MCMC) algorithms.
      • Variational inference and its theoretical guarantees.
      • Bayesian neural networks and approximate posterior inference.
    • Emerging directions: Integrating physics-informed priors with Bayesian computation for scientific modeling.

    7. Statistical Challenges in Deep Learning

    • Why it's hot: Deep learning models are incredibly complex, and their statistical properties are poorly understood. Researchers are exploring:
      • Generalization in over-parameterized models.
      • Statistical interpretations of training dynamics.
      • Compression, pruning, and distillation of models.
    • Key ideas:
      • Implicit regularization in gradient descent.
      • Role of model architecture in statistical performance.
      • Probabilistic embeddings and generative models.

    8. Federated and Privacy-Preserving Learning

    • Why it's hot: The growing focus on data privacy and decentralized data motivates statistical advances in federated learning and differential privacy.
    • Key ideas:
      • Differentially private statistical estimation.
      • Communication-efficient federated learning.
      • Privacy-utility trade-offs in statistical models.
    • Applications: Healthcare data sharing, collaborative AI, and secure financial analytics.

    9. Spatial and Spatiotemporal Statistics

    • Why it's hot: The explosion of spatial data from satellites, sensors, and mobile devices has led to advancements in spatiotemporal modeling.
    • Key ideas:
      • Gaussian processes for spatial modeling.
      • Nonstationary and multiresolution models.
      • Scalable methods for massive spatiotemporal datasets.
    • Applications: Climate modeling, epidemiology (COVID-19 modeling), urban planning.

    10. Statistics for Complex Data Structures

    • Why it's hot: Modern data is often non-Euclidean (e.g., networks, manifolds, point clouds). New statistical methods are being developed to handle these structures.
    • Key ideas:
      • Graphical models and network statistics.
      • Statistical inference on manifolds.
      • Topological data analysis (TDA) for extracting features from high-dimensional data.
    • Applications: Social networks, neuroscience (brain connectomes), and shape analysis.

    11. Fairness and Bias in Machine Learning

    • Why it's hot: As ML systems are deployed widely, there’s an urgent need to ensure fairness and mitigate bias.
    • Key ideas:
      • Statistical frameworks for fairness (e.g., equalized odds, demographic parity).
      • Testing and correcting algorithmic bias.
      • Trade-offs between fairness, accuracy, and interpretability.
    • Applications: Hiring algorithms, lending, criminal justice, and medical AI.

    12. Reinforcement Learning and Sequential Decision Making

    • Why it's hot: RL is critical for applications like robotics and personalized interventions, but statistical aspects are underexplored.
    • Key ideas:
      • Exploration-exploitation trade-offs in high-dimensional settings.
      • Offline RL (learning from logged data).
      • Bayesian RL and uncertainty-aware policies.
    • Applications: Healthcare (adaptive treatment strategies), finance, and game AI.

    13. Statistical Methods for Large-Scale Data

    • Why it's hot: Big data challenges computational efficiency and interpretability of classical methods.
    • Key ideas:
      • Scalable algorithms for massive datasets (e.g., distributed optimization).
      • Approximate inference techniques for high-dimensional data.
      • Subsampling and sketching for faster computations.
    50 Comments
    2025/01/17
    01:19 UTC

    192

    [Q] Why do researchers commonly violate the "cardinal sins" of statistics and get away with it?

    As a psychology major, we don't have water always boiling at 100 C/212.5 F like in biology and chemistry. Our confounds and variables are more complex and harder to predict and a fucking pain to control for.

    Yet when I read accredited journals, I see studies using parametric tests on a sample of 17. I thought CLT was absolute and it had to be 30? Why preach that if you ignore it due to convenience sampling?

    Why don't authors stick to a single alpha value for their hypothesis tests? Seems odd to say p > .001 but get a p-value of 0.038 on another measure and report it as significant due to p > 0.05. Had they used their original alpha value, they'd have been forced to reject their hypothesis. Why shift the goalposts?

    Why do you hide demographic or other descriptive statistic information in "Supplementary Table/Graph" you have to dig for online? Why do you have publication bias? Studies that give little to no care for external validity because their study isn't solving a real problem? Why perform "placebo washouts" where clinical trials exclude any participant who experiences a placebo effect? Why exclude outliers when they are no less a proper data point than the rest of the sample?

    Why do journals downplay negative or null results presented to their own audience rather than the truth?

    I was told these and many more things in statistics are "cardinal sins" you are to never do. Yet professional journals, scientists and statisticians, do them all the time. Worse yet, they get rewarded for it. Journals and editors are no less guilty.

    202 Comments
    2025/01/16
    21:36 UTC

    1

    [Q] Confidence of StdDev measurements

    I am working on a system where I consume data over a period of time and I'd like to be able to find a reasonable "min" and "max" values for this metric so that I can be alerted when data points are outside the range.

    I'd like to set the min and max values at plus/minus 3 standard deviations from the mean. However the part I'm struggling with is how to determine when I've gathered enough data to have confidence in my measured mean and standard deviations. I wouldn't want to enable alerts for the range until I have confidence that the mean and stddev I've measured are accurately representing the underlying distribution. So is there a way to quantify and calculate this "confidence" measure? I'd imagine that such a concept exists already but I am a statistics noob. Thanks!

    4 Comments
    2025/01/16
    19:07 UTC

    0

    [R] PLS-SEM with bad model fit. What should I do?

    Hi, I'm analysing an extended Theory of Planned Behavior, and I'm conducting a PLS-SEM analysis in SmartPLS. My measurement model analysis has given good results (outer loadings, cronbach alpha, HTMT, VIF). On the structural model analysis, my R-square and Q-square values are good, and I get weak f-square results. The problem occurs in the model fit section: no matter how I change the constructs and their indicators, the NFI lies at around 0,7 and the SRMR at 0,82, even for the saturated model. Is there anything I can do to improve this? Where should I check for possible anomalies or errors?

    Thank you for the attention.

    0 Comments
    2025/01/16
    12:59 UTC

    0

    [Q] Logistic regression in PSSP

    Hi All,

    Background - Having collected some data for some initial research I have two variables:

    1 - Area of tumour on a slide preparation in mm2 - continous

    2 - Did the specimen process successfully for genetic testing -Binary (Could be nuanced as it can partially succeed but have classed part succeed as fail for now)

    My understanding is that I should be able to identify a value for variable 1 where we can say there is a greater than 50% likelihood of succeeding (or indeed greater than say 80%?)

    My statistics background is relatively basic unfortunately but google tells me that this may be solvable using logistic regression?

    I have put the data into PSPP and setup a logistic regression analysis and do get a result but I am now at a bit of a loss as to what the results mean or how I take them to get the information I want.

    Below is the output it gave. Any guidance would be much appreciated

    TIA

    Case Processing Summary

    ╭────────────────────┬──┬───────╮

    │Unweighted Cases │ N│Percent│

    ├────────────────────┼──┼───────┤

    │Included in Analysis│58│ 100.0%│

    │Missing Cases │ 0│ .0%│

    │Total │58│ 100.0%│

    ╰────────────────────┴──┴───────╯

    Model Summary

    ╭────┬─────────────────┬────────────────────┬───────────────────╮

    │Step│-2 Log likelihood│Cox & Snell R Square│Nagelkerke R Square│

    ├────┼─────────────────┼────────────────────┼───────────────────┤

    │1 │ 61.20│ .14│ .20│

    ╰────┴─────────────────┴────────────────────┴───────────────────╯

    Classification Table

    ╭──────────────────────────┬──────────────────────────╮

    │ │ Predicted │

    │ ├───────┬──────────────────┤

    │ │ VAR002│ │

    │ ├───┬───┤ │

    │ Observed │ 0 │ 1 │Percentage Correct│

    ├──────────────────────────┼───┼───┼──────────────────┤

    │Step 1 VAR002 0 │ 0│ 17│ .0%│

    │ 1 │ 0│ 41│ 100.0%│

    │ ╶───────────────────┼───┼───┼──────────────────┤

    │ Overall Percentage │ │ │ 70.7%│

    ╰──────────────────────────┴───┴───┴──────────────────╯

    Variables in the Equation

    ╭───────────────┬────┬────┬────┬──┬────┬──────╮

    │ │ B │S.E.│Wald│df│Sig.│Exp(B)│

    ├───────────────┼────┼────┼────┼──┼────┼──────┤

    │Step 1 VAR001 │ .87│ .40│4.69│ 1│.030│ 2.38│

    │ Constant│-.04│ .44│ .01│ 1│.930│ .96│

    ╰───────────────┴────┴────┴────┴──┴────┴──────╯

    5 Comments
    2025/01/16
    11:52 UTC

    0

    [Q] Best way of describing variance?

    Hi all.

    I have two columns of numbers:

    column 1 column 2
    7 6

    23 27

    15 13

    55 54

    I want to compare the "closeness" of values in column 1 to column 2, in a given row. What is the best way of numerically comparing the values? I can calculate their difference (delta). Is this the variance? How best to describe this in a sentence; aka,

    delta

    1

    4

    2

    1

    A comparison of "column 1" with "column 2" shows an excellent match, with highest variance of 4%.

    Thank you :)

    3 Comments
    2025/01/16
    11:26 UTC

    39

    [Q] Curiosity question: Is there a name for a value that you get if you subtract median from mean, and is it any useful?

    I hope this is okay to post.

    So, my friend and I were discussing salaries in my home country, I brought up average salary and mean salary, and had a thought - what I asked in title, if you subtract median from mean, does resulting value have a name and is it useful for anything at all? Looks like it would show how much dataset is skewed towards higher or lower values? Or would it be a bad indicator for that?

    Sorry for a dumb question, last time I had to deal with statistics was in university ten years ago, I only remember basics. Googling for it only gave the results for "what's the difference between median and mean" articles

    18 Comments
    2025/01/16
    11:07 UTC

    2

    [Q] Chebyshev's inequality with known skewness

    Is there an extension of Chebyshev's inequality for distributions with a known skewness?

    Putting mu, sigma and gamma as mean, std and skewness I'd like to obtain two, one sided inequalities

    P(X > mu + k * sigma) < f1(sigma, gamma, k)

    P(X < mu - k * sigma) < f2(sigma, gamma, k)

    It intuitively makes sense that knowing the skewness, we should obtain better estimates of both tails but I wasn't able to find any actual result on it.

    3 Comments
    2025/01/16
    10:32 UTC

    2

    [C] Any stats jobs overlap with political science?

    Currently I’m pursuing a Statistics B.S. at UC Davis. There is an option to pursue an Applied Statistics track, where you can choose a certain outside subjects to take quantitative courses in. I decided to do political science, mostly because I just wanted an excuse to take those courses.

    I’m wondering though if there are any jobs that fall within this overlap. I feel like I would need a graduate degree to do anything. If anyone has any insight, I would greatly appreciate it.

    4 Comments
    2025/01/16
    07:28 UTC

    2

    [Q] Combination lock probability query

    I only know really basic stats/probability, so was wondering if I could get help on a debate with my dorm mates here at uni. We have combination locks on our room doors with numbers one through five. Each of us have a code with 3 integers. The integer could be either one-digit (ex. 1, 2, etc.) or two-digit (ex. pressing 1 and 2 at the same time, which could be either 12 or 21). However, this means integers like 11, 22, 33, etc. are not possible integers in the code. Also, once a button has been pushed once, it cannot be pushed again, so a code could not be 2-53-24 because the 2 would be used twice.

    A few examples of acceptable combinations:

    • 12-3-45
    • 51-42-3
    • 41-53-2
    • 1-2-3

    I'm aware there are a ton of stipulations that come along with solving this problem, but I was just curious if someone could help us out in finding a number of possible combinations. Finally, we are looking not for a number of possible combinations, but a number of possible ways to push the buttons--so for our purpose, the codes 12-3-4 and 21-3-4 are identical, as the buttons would be pushed the same either way.

    5 Comments
    2025/01/16
    04:47 UTC

    10

    [Q] What salary range should I expect as a fresh college grad with a BS in Statistics?

    For context, I’m a student at UCLA, and am applying to jobs within California. But I’m interested in people’s past jobs fresh out of college, where in the country, and what the salary was.

    Tentatively, I’m expecting a salary of anywhere between $70k and $80k, but I’ve been told I should be expecting closer to $100k, which just seems ludicrous.

    27 Comments
    2025/01/16
    04:21 UTC

    19

    [C] Is it unrealistic to get a job doing statistical analysis?

    I ask this as someone with a good foundation in statistics and just finished a 6.5 hour YT biostatistics course. I like research, and while I am.not the best at math, enjoy statistics. Alas, I don't have tons of coursework in the area. I wanted to crunch the numbers and help with study design, but as I do not have a strong statistics foundation, my question is whether I can realistically expect this as a potential career avenue.

    Thoughts?

    84 Comments
    2025/01/16
    02:18 UTC

    0

    [Q] Advanced Imputation Techniques for Correlated Time Series: Insights and Experiences?

    Hi everyone,

    I’m looking to spark a discussion about advanced imputation techniques for datasets with multiple distinct but correlated time series. Imagine a dataset like energy consumption or sales data, where hundreds of stores or buildings are measured separately. The granularity might be hourly or daily, with varying levels of data completeness across the time series.

    Here’s the challenge:

    1. Some buildings/stores have complete or nearly complete data with only a few missing values. These are straightforward to impute using standard techniques.
    2. Others have partial data, with gaps ranging from days to months.
    3. Finally, there are buildings with 100% missing values for the target variable across the entire time frame, leaving us reliant on correlated data and features.

    The time series show clear seasonal patterns (weekly, annual) and dependencies on external factors like weather, customer counts, or building size. While these features are available for all buildings—including those with no target data—the features alone are insufficient to accurately predict the target. Correlations between the time series range from moderate (~0.3) to very high (~0.9), making the data situation highly heterogeneous.

    My Current Approach:

    For stores/buildings with few or no data points, I’m considering an approach that involves:

    1. Using Correlated Stores: Identify stores with high correlations based on available data (e.g., monthly aggregates). These could serve as a foundation for imputing the missing time series.
    2. Reconciling to Monthly Totals: If we know the monthly sums of the target for stores with missing hourly/daily data, we could constrain the imputed time series to match these totals. For example, adjust the imputed hourly/daily values so that their sum equals the known monthly figure.
    3. Incorporating Known Features: For stores with missing target data, use additional features (e.g., holidays, temperature, building size, or operational hours) to refine the imputed time series. For example, if a store was closed on a Monday due to repairs or a holiday, the imputation should respect this and redistribute values accordingly.

    Why Just Using Correlated Stores Isn’t Enough:

    While using highly correlated stores for imputation seems like a natural approach, it has limitations. For instance:

    • A store might be closed on certain days (e.g., repairs or holidays), resulting in zero or drastically reduced energy consumption. Simply copying or scaling values from correlated stores won’t account for this.
    • The known features for the missing store (e.g., building size, operational hours, or customer counts) might differ significantly from those of the correlated stores, leading to biased imputations.
    • Seasonal patterns (e.g., weekends vs. weekdays) may vary slightly between stores due to operational differences.

    Open Questions:

    • Feature Integration: How can we better incorporate the available features of stores with 100% missing values into the imputation process while respecting known totals (e.g., monthly sums)?
    • Handling Correlation-Based Imputation: Are there specific techniques or algorithms that work well for leveraging correlations between time series for imputation?
    • Practical Adjustments: When reconciling imputed values to match known totals, what methods work best for redistributing values while preserving the overall seasonal and temporal patterns?

    From my perspective, this approach seems sensible, but I’m curious about others' experiences with similar problems or opinions on why this might—or might not—work in practice. If you’ve dealt with imputation in datasets with heterogeneous time series and varying levels of completeness, I’d love to hear your insights!

    Thanks in advance for your thoughts and ideas!

    3 Comments
    2025/01/15
    22:12 UTC

    1

    [Q] Calculating statistical significance with "overlapping" datasets

    Hi all. I have two weighted datasets of survey responses covering overlapping periods, and I want to calculate if the difference between estimates taken from each dataset are statistically significant.

    So, for example, Dataset1 covers the responses from July to September, from which we've estimated the number of adults with a university degree as 300,000. Whereas from Dataset2, which covers August to October, that would be estimated at 275,000. Is that statistically significant or not?

    My gut instinct is that its not something I should even be trying to calculate, as the overlapping nature of the data would render any statistical test null (roughly two thirds of the datasets are the same records, albeit the weighting is calculated separately for each dataset).

    If it is possible to do this, what statistical test should I be using?

    Thanks!

    (And apologies if thats all a bit nonsensical, my stats knowledge is many years old now....If there's anyting extra I need to explain, please ask)

    4 Comments
    2025/01/15
    19:47 UTC

    0

    [Question] How to choose multipliers for outlier detection using MAD method?

    I'm using the median absolute deviation method for outlier detection in biological data. I can't find anything online about how to choose my multipliers. I'm finding that I need to adjust the multiplier depending on the median and spread of the data, but I want to find a statistically sound way to do that.

    Any help or resources on this topic would be amazing!

    5 Comments
    2025/01/15
    19:02 UTC

    0

    [Q] Doubt about linear mixed model with categorical data

    I am fitting a random intercept model, with only categorical data (that is the job that was given to me) so I am fitting the linear mixed model to have a different intercept for each group, but when I plot the predictions, I see that the result is not a straight line, but when the categorical data takes a different level it starts to have spikes in those points and the continues to be a constant line (which is expected when all the categorical variables take the same value)

    My doubt is this a mistake? I was expecting a straight line but with categorical data I do not know how that would be possible, can someone give a little bit of enlightenment here, please?

    3 Comments
    2025/01/15
    17:20 UTC

    13

    TidyLLM?? LLMs in R! [Q]

    The LLMverse is here! Here are some R packages I saw written by Hadley wickham et al that are centered around interacting with LLMs in R.

    https://luisdva.github.io/rstats/LLMsR/

    3 Comments
    2025/01/15
    13:46 UTC

    5

    [Q] Inferential statistics on population data?

    Hi all,

    I have a situation at work and I feel like I’m going a little crazy. I’m hoping someone here could help shed some light on it.

    I have a middling grasp of statistics. Right now my supervisor is having me look at the data of the clients we have served and wants me to determine if we have been declining in the dichotomous variable RHR over the past few years. Easy enough, that’s just descriptive data right?

    Well they want me to determine if the changes over time are “statistically significant.” And this is where I feel like I’m going crazy. Wouldn’t “statistically significant” imply inferential stats? And what’s the point of inferential stats if we already have the population data (i.e., the entire dataset of all the clients we serve).

    I’ve googled the question and everything seems to suggest that this would be an exercise in nonsense, but they were pretty insistent that they wanted statistical testing, and they have a higher degree and a lot more experience.

    So am I missing something? Is there a situation where it would make sense to run inferential stats on population data?

    11 Comments
    2025/01/15
    02:45 UTC

    0

    [Q] logistic regression with categorical treatment and control variables and binary outcome.

    Hi everyone, I’m really struggling with my research as I do not understand where I’m standing. I am trying to evaluate the effect of group affiliation (5 categories) in mobilization outcomes (successful/not succesful). I have other independent variables to control such as ‘area’ (3 possible categories), duration (number of days mobilization lasted), motive (4 possible motives). I have been using gpt4 to set up my model but I am more confused and can’t find proper academy to understand wht certain things need to be done on my model.

    I understand that for a binary outcome I need to use a logistic regression, but I need to establish my categorical variables as factors; therefore my control variables have a reference category (I’m using R). However when running my model do I need to interpret all my control variables against the reference category? Since I have coefficients not only for my treatment variable but also for my control variables.

    If anyone is able to guide me I’ll be eternally grateful.

    4 Comments
    2025/01/15
    02:10 UTC

    Back To Top