/r/statistics

Photograph via snooOG

/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.

This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.


Guidelines:

  1. All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:

Tag Abbreviation
[Research] [R]
[Software] [S]
[Question] [Q]
[Discussion] [D]
[Education] [E]
[Career] [C]
[Meta] [M]
  • This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.

  • Please try to keep submissions on topic and of high quality.

  • Just because it has a statistic in it doesn't make it statistics.

  • Memes and image macros are not acceptable forms of content.

  • Self posts with throwaway accounts will be deleted by AutoModerator


  • Related subreddits:


    Data:


    Useful resources for learning R:

  • r-bloggers - blog aggregator with statistics articles generally done with R software.

  • Quick-R - great R reference site.


  • Related Software Links:

  • R

  • R Studio

  • SAS

  • Stata

  • EViews

  • JMP

  • SPSS

  • Minitab


  • Advice for applying to grad school:

  • Submission 1


  • Advice for undergrads:

  • Submission 1


  • Jobs and Internships

    For grads:

    For undergrads:

    /r/statistics

    582,842 Subscribers

    0

    [Q] Statistical significance when proportion is bigger than 1

    Hey folks, I work with data and frequently I have to check if something is statistically significant with a specific confidence level, but I don't really know statistics that much. Usually for this I just open Evan Miller's Chi Squared website and input the numbers, but right now I have a proportion bigger than 100% (more conversions than expositions) so this test does not work. How can I check if one group is statistically better than the other one in this case?

    If it is needed I have the data disaggregated (total conversions by each exposed customers, and group that the customer participates)

    4 Comments
    2024/12/01
    22:10 UTC

    0

    [Q] How to determine weigths for submeasures?

    Hello everyone. I have a question regarding #weights - how do you assess them in measures? Besides arbitrarly of course:) I'd be very thankful for proper sources, toutorials, or papers that used some method, or step-by-step guidance.

    Let's say I know theoretically that one item of my scale is more important, diffenating or more informative than the other item, how do I decide how much more important it is? How do I know excatly what weight to give it?
    Suppose I have the data. What analyses should I do? (I was thinking IRT, but still I am not sure what to do with obtained discrimination and information values to determine the weights).

    Also, is the method different for weighting items in a scale, and subscales in general score?

    1 Comment
    2024/12/01
    21:24 UTC

    1

    [Q] Bi-factorial CFA with WLSMV estimator (R). Excellent fit, very bad loadings. How to interpret and improve?

    Hey there!

    I created a cognitive test with 40 true/false and 10 timed performance items, and stumped on the CFA. I am using lavaan with WLSMV estimation and need help!
    I used a bi-factorial model where 30 items load on VZ, 10 on GH and 10 timed items on UT. I have nearly perfect fit scores but many items load negatively or insignificantly on sub-factors. Here is my final model and the results. What can I do to improve the model? Are those loadings OK?

    # Bi Factor Model on Gv
    model3u <- '
    #general factor
    Gv =~ D1 + D4 + D5 + D6 + D7 + D8 + D9 + D10 +
      GT1 + GT2 + GT3 + GT5 + GT6 + GT8 + GT9 +
      UP1 + UP2 + UP3 + UP4 + UP5 + UP9 + UP10 +
      GH1 + GH2 + GH4 + GH5 + GH6 + GH8 + GH9 +
      UT1 + UT2 + UT4 + UT5 + UT6 + UT8 + UT9

    #group specific factors
    VZ =~ D1 + D4 + D5 + D6 + D7 + D8 + D9 + D10 +
    GT1 + GT2 + GT3 + GT5 + GT6 + GT8 + GT9 +
    UP1 + UP2 + UP3 + UP4 + UP5 + UP9 + UP10
    GH =~ GH1 + GH2 + GH4 + GH5 + GH6 + GH8 + GH9
    UT =~ UT1 + UT2 + UT4 + UT5 + UT6 + UT8 + UT9
    # Orthogonal constraints
    Gv ~~ 0*VZ
    Gv ~~ 0*GH
    Gv ~~ 0*UT
    VZ ~~ 0*GH
    VZ ~~ 0*UT
    GH ~~ 0*UT
    '
    fit3u <- cfa(model3u, data = hcvpt, estimator = "WLSMV", ordered = ordinal_items)
    summary(fit3u, fit.measures = TRUE, standardized=TRUE)

    lavaan 0.6-18 ended normally after 360 iterations

    Estimator DWLS
    Optimization method NLMINB
    Number of model parameters 115

    Number of observations 251

    Model Test User Model:
    Standard Scaled
    Test Statistic 462.108 568.981
    Degrees of freedom 558 558
    P-value (Chi-square) 0.999 0.364
    Scaling correction factor 1.553
    Shift parameter 271.351
    simple second-order correction

    Model Test Baseline Model:

    Test statistic 2422.928 1438.410
    Degrees of freedom 630 630
    P-value 0.000 0.000
    Scaling correction factor 2.218

    User Model versus Baseline Model:

    Comparative Fit Index (CFI) 1.000 0.986
    Tucker-Lewis Index (TLI) 1.060 0.985

    Robust Comparative Fit Index (CFI) NA
    Robust Tucker-Lewis Index (TLI) NA

    Root Mean Square Error of Approximation:

    RMSEA 0.000 0.009
    90 Percent confidence interval - lower 0.000 0.000
    90 Percent confidence interval - upper 0.000 0.023
    P-value H_0: RMSEA <= 0.050 1.000 1.000
    P-value H_0: RMSEA >= 0.080 0.000 0.000

    Robust RMSEA NA
    90 Percent confidence interval - lower NA
    90 Percent confidence interval - upper NA
    P-value H_0: Robust RMSEA <= 0.050 NA
    P-value H_0: Robust RMSEA >= 0.080 NA

    Standardized Root Mean Square Residual:

    SRMR 0.086 0.086

    Parameter Estimates:

    Parameterization Delta
    Standard errors Robust.sem
    Information Expected
    Information saturated (h1) model Unstructured

    Latent Variables:
    Estimate Std.Err z-value P(>|z|) Std.lv Std.all
    Gv =~
    D1 1.000 0.618 0.618
    D4 0.502 0.238 2.103 0.035 0.310 0.310
    D5 0.570 0.200 2.854 0.004 0.352 0.352
    D6 0.287 0.214 1.345 0.179 0.178 0.178
    D7 0.647 0.185 3.491 0.000 0.400 0.400
    D8 0.323 0.164 1.967 0.049 0.200 0.200
    D9 0.245 0.173 1.414 0.157 0.151 0.151
    D10 0.256 0.173 1.480 0.139 0.158 0.158
    GT1 0.492 0.190 2.589 0.010 0.304 0.304
    GT2 0.185 0.180 1.032 0.302 0.114 0.114
    GT3 0.292 0.190 1.537 0.124 0.181 0.181
    GT5 0.609 0.191 3.190 0.001 0.376 0.376
    GT6 0.348 0.176 1.969 0.049 0.215 0.215
    GT8 0.498 0.183 2.724 0.006 0.308 0.308
    GT9 0.419 0.201 2.088 0.037 0.259 0.259
    UP1 0.329 0.173 1.906 0.057 0.203 0.203
    UP2 0.674 0.183 3.689 0.000 0.417 0.417
    UP3 0.599 0.200 2.997 0.003 0.370 0.370
    UP4 0.431 0.172 2.500 0.012 0.266 0.266
    UP5 0.510 0.182 2.798 0.005 0.315 0.315
    UP9 0.933 0.231 4.038 0.000 0.576 0.576
    UP10 0.611 0.195 3.137 0.002 0.378 0.378
    GH1 0.624 0.203 3.077 0.002 0.385 0.385
    GH2 0.209 0.172 1.219 0.223 0.129 0.129
    GH4 0.272 0.161 1.686 0.092 0.168 0.168
    GH5 0.308 0.174 1.768 0.077 0.190 0.190
    GH6 0.673 0.221 3.039 0.002 0.416 0.416
    GH8 0.342 0.180 1.899 0.058 0.211 0.211
    GH9 0.851 0.251 3.395 0.001 0.526 0.526
    UT1 0.063 0.015 4.074 0.000 0.039 0.412
    UT2 0.037 0.011 3.402 0.001 0.023 0.336
    UT4 0.034 0.009 3.629 0.000 0.021 0.363
    UT5 0.028 0.008 3.566 0.000 0.017 0.381
    UT6 0.039 0.010 3.780 0.000 0.024 0.411
    UT8 0.037 0.009 3.994 0.000 0.023 0.462
    UT9 0.016 0.008 2.039 0.041 0.010 0.182
    VZ =~
    D1 1.000 0.281 0.281
    D4 1.878 0.968 1.941 0.052 0.527 0.527
    D5 1.706 0.829 2.058 0.040 0.479 0.479
    D6 1.236 0.736 1.679 0.093 0.347 0.347
    D7 1.297 0.689 1.882 0.060 0.364 0.364
    D8 1.463 0.781 1.873 0.061 0.411 0.411
    D9 1.572 0.823 1.910 0.056 0.441 0.441
    D10 1.502 0.804 1.869 0.062 0.422 0.422
    GT1 -0.162 0.467 -0.346 0.729 -0.045 -0.045
    GT2 0.355 0.480 0.739 0.460 0.100 0.100
    GT3 -0.009 0.462 -0.020 0.984 -0.003 -0.003
    GT5 -0.232 0.432 -0.536 0.592 -0.065 -0.065
    GT6 0.753 0.586 1.285 0.199 0.211 0.211
    GT8 0.850 0.678 1.254 0.210 0.239 0.239
    GT9 0.215 0.501 0.430 0.667 0.060 0.060
    UP1 0.342 0.458 0.747 0.455 0.096 0.096
    UP2 0.469 0.426 1.102 0.270 0.132 0.132
    UP3 0.532 0.474 1.121 0.262 0.149 0.149
    UP4 -0.092 0.449 -0.204 0.839 -0.026 -0.026
    UP5 -0.288 0.440 -0.655 0.513 -0.081 -0.081
    UP9 0.573 0.434 1.321 0.186 0.161 0.161
    UP10 -0.152 0.440 -0.346 0.730 -0.043 -0.043
    GH =~
    GH1 1.000 0.123 0.123
    GH2 -0.287 1.089 -0.264 0.792 -0.035 -0.035
    GH4 0.118 1.093 0.108 0.914 0.015 0.015
    GH5 -4.896 6.222 -0.787 0.431 -0.603 -0.603
    GH6 0.372 1.236 0.301 0.764 0.046 0.046
    GH8 -3.374 4.220 -0.800 0.424 -0.416 -0.416
    GH9 4.041 5.022 0.805 0.421 0.498 0.498
    UT =~
    UT1 1.000 0.036 0.384
    UT2 1.021 0.165 6.173 0.000 0.037 0.539
    UT4 1.017 0.159 6.377 0.000 0.037 0.630
    UT5 0.624 0.110 5.653 0.000 0.023 0.507
    UT6 0.666 0.150 4.432 0.000 0.024 0.416
    UT8 0.660 0.109 6.066 0.000 0.024 0.481
    UT9 0.734 0.158 4.649 0.000 0.027 0.479

    Covariances:
    Estimate Std.Err z-value P(>|z|) Std.lv Std.all
    Gv ~~
    VZ 0.000 0.000 0.000
    GH 0.000 0.000 0.000
    UT 0.000 0.000 0.000
    VZ ~~
    GH 0.000 0.000 0.000
    UT 0.000 0.000 0.000
    GH ~~
    UT 0.000 0.000 0.000

    Intercepts:
    Estimate Std.Err z-value P(>|z|) Std.lv Std.all
    .UT1 0.140 0.007 19.714 0.000 0.140 1.478
    .UT2 0.152 0.004 34.706 0.000 0.152 2.207
    .UT4 0.096 0.004 24.509 0.000 0.096 1.634
    .UT5 0.069 0.003 23.861 0.000 0.069 1.545
    .UT6 0.092 0.004 24.624 0.000 0.092 1.579
    .UT8 0.075 0.003 23.116 0.000 0.075 1.508
    .UT9 0.089 0.004 24.960 0.000 0.089 1.599

    Thresholds:
    Estimate Std.Err z-value P(>|z|) Std.lv Std.all
    D1|t1 -1.240 0.106 -11.705 0.000 -1.240 -1.240
    D4|t1 -1.198 0.104 -11.536 0.000 -1.198 -1.198
    D5|t1 -0.748 0.088 -8.516 0.000 -0.748 -0.748
    D6|t1 -0.933 0.093 -10.017 0.000 -0.933 -0.933
    D7|t1 0.015 0.079 0.189 0.850 0.015 0.015
    D8|t1 -0.278 0.080 -3.460 0.001 -0.278 -0.278
    D9|t1 0.005 0.079 0.063 0.950 0.005 0.005
    D10|t1 -0.206 0.080 -2.581 0.010 -0.206 -0.206
    GT1|t1 0.299 0.081 3.711 0.000 0.299 0.299
    GT2|t1 0.563 0.084 6.698 0.000 0.563 0.563
    GT3|t1 0.438 0.082 5.336 0.000 0.438 0.438
    GT5|t1 0.362 0.081 4.463 0.000 0.362 0.362
    GT6|t1 0.610 0.085 7.188 0.000 0.610 0.610
    GT8|t1 1.101 0.099 11.070 0.000 1.101 1.101
    GT9|t1 0.888 0.092 9.680 0.000 0.888 0.888
    UP1|t1 -0.516 0.083 -6.204 0.000 -0.516 -0.516
    UP2|t1 0.065 0.079 0.819 0.413 0.065 0.065
    UP3|t1 -0.035 0.079 -0.441 0.659 -0.035 -0.035
    UP4|t1 0.145 0.080 1.826 0.068 0.145 0.145
    UP5|t1 0.166 0.080 2.078 0.038 0.166 0.166
    UP9|t1 0.166 0.080 2.078 0.038 0.166 0.166
    UP10|t1 0.247 0.080 3.084 0.002 0.247 0.247
    GH1|t1 0.341 0.081 4.212 0.000 0.341 0.341
    GH2|t1 -0.085 0.079 -1.071 0.284 -0.085 -0.085
    GH4|t1 -0.352 0.081 -4.337 0.000 -0.352 -0.352
    GH5|t1 -0.309 0.081 -3.837 0.000 -0.309 -0.309
    GH6|t1 -0.671 0.086 -7.796 0.000 -0.671 -0.671
    GH8|t1 -0.125 0.079 -1.574 0.115 -0.125 -0.125
    GH9|t1 -1.240 0.106 -11.705 0.000 -1.240 -1.240

    Variances:
    Estimate Std.Err z-value P(>|z|) Std.lv Std.all
    .D1 0.540 0.540 0.540
    .D4 0.626 0.626 0.626
    .D5 0.647 0.647 0.647
    .D6 0.848 0.848 0.848
    .D7 0.708 0.708 0.708
    .D8 0.792 0.792 0.792
    .D9 0.783 0.783 0.783
    .D10 0.797 0.797 0.797
    .GT1 0.906 0.906 0.906
    .GT2 0.977 0.977 0.977
    .GT3 0.967 0.967 0.967
    .GT5 0.854 0.854 0.854
    .GT6 0.909 0.909 0.909
    .GT8 0.848 0.848 0.848
    .GT9 0.929 0.929 0.929
    .UP1 0.949 0.949 0.949
    .UP2 0.809 0.809 0.809
    .UP3 0.841 0.841 0.841
    .UP4 0.928 0.928 0.928
    .UP5 0.894 0.894 0.894
    .UP9 0.642 0.642 0.642
    .UP10 0.855 0.855 0.855
    .GH1 0.836 0.836 0.836
    .GH2 0.982 0.982 0.982
    .GH4 0.972 0.972 0.972
    .GH5 0.600 0.600 0.600
    .GH6 0.825 0.825 0.825
    .GH8 0.782 0.782 0.782
    .GH9 0.475 0.475 0.475
    .UT1 0.006 0.000 17.955 0.000 0.006 0.683
    .UT2 0.003 0.000 11.267 0.000 0.003 0.597
    .UT4 0.002 0.000 9.526 0.000 0.002 0.472
    .UT5 0.001 0.000 12.123 0.000 0.001 0.597
    .UT6 0.002 0.000 10.679 0.000 0.002 0.658
    .UT8 0.001 0.000 12.588 0.000 0.001 0.555
    .UT9 0.002 0.000 10.268 0.000 0.002 0.737
    Gv 0.382 0.148 2.571 0.010 1.000 1.000
    VZ 0.079 0.074 1.069 0.285 1.000 1.000
    GH 0.015 0.036 0.425 0.671 1.000 1.000
    UT 0.001 0.000 3.270 0.001 1.000 1.000

    0 Comments
    2024/12/01
    21:24 UTC

    1

    [Q] Help needed for interpreting test results

    Q]

    I'm reading this medical research paper and they have a table with results of these tests used to evaluate associations between some clinical variables and outcome. Depending on the variable type they have used chi squared/independent t/fischer's/Mann-whitney test. Lets say there are about 100 patients, 20 have diabetes. The outcome of interest is heart attack.

    The table has this format : Variable name -Diabetes Outcome (positive) - Diabetic patients who had a Heart attack-15 Outcome ( absent) - Diabetes patients who didnt have a heart attack-5 P value: <0.05

    AFAIK these tests only evaluate whether diabetes and heart attacks are associated. These tests can't establish causation. Neither can they determine the direction or size of association.

    In this paper, they interpreted it these ways ( there are several variables, I'm only using 1 here )

    1.patients with diabetes were more likely to suffer from a heart attack.

    1. Diabetes increased the risk of heart attack.

    2. Having diabetes has been associated with increased risk of heart attack

    3. Having diabetes is strongly related to occurrence of heart attack.

    I think any interpretation other than ' diabetes is associated with occurence of heart attack' is incorrect.

    I'm not a math or stats expert so I'd appreciate simple answers. 😅 Also, if someone could please suggest a few reliable and free online resources to understand this better...

    2 Comments
    2024/12/01
    19:33 UTC

    1

    [Q] If one were to display two 50% chances that don’t add up to a guarantee, how would you display that in a more accurate manner?

    So like, event happens > two independent 50% chances > not guaranteed that one of them will actually happen. What’s a more accurate percentage? Is there one? Thanks!

    1 Comment
    2024/12/01
    18:55 UTC

    1

    [Question] Need some help with complex dice probabilities

    I apparently don't know enough about advanced statistics to even find what function(s) are needed to do this, so I was hoping I could get some assistance here. Maybe it's too complex. I want to calculate the probability of all the outcomes for rolling a set of dice for a wargame (Star Wars Legion).

    The dice have four possible outcomes and there are three different types of dice that have different probabilities of each outcome. After hours of scouring the web and various mathematics sites, the best I could come up with was a polynomial to describe each die:

    DIE 1 = 0.625a + 0.125b + 0.125c + 0.125d

    DIE 2 = 0.375a + 0.375b + 0.125c + 0.125d

    DIE 3 = 0.125a + 0.625b + 0.125c + 0.125d

    And multiplying those polynomials:

    (.625a + .125b + .125c + .125d)^(2) * (.375a + .375b + .125c + .125d)^(2)

    Because expanding the formula will give me the probability of each of the 35 possible combinations:

    0.0531879 a^4 + 0.127997 a^3 b + 0.0570797 a^3 c + 0.0570797 a^3 d + 0.0986273 a^2 b^2 + 0.0975094 a^2 b c + 0.0975094 a^2 b d + 0.0225211 a^2 c^2 + 0.0450422 a^2 c d + 0.0225211 a^2 d^2 + 0.0260156 a b^3 + 0.0462891 a b^2 c + 0.0462891 a b^2 d + 0.0241406 a b c^2 + 0.0482813 a b c d + 0.0241406 a b d^2 + 0.00386719 a c^3 + 0.0116016 a c^2 d + 0.0116016 a c d^2 + 0.00386719 a d^3 + 0.00219727 b^4 + 0.00585938 b^3 c + 0.00585938 b^3 d + 0.00537109 b^2 c^2 + 0.0107422 b^2 c d + 0.00537109 b^2 d^2 + 0.00195313 b c^3 + 0.00585938 b c^2 d + 0.00585938 b c d^2 + 0.00195313 b d^3 + 0.000244141 c^4 + 0.000976563 c^3 d + 0.00146484 c^2 d^2 + 0.000976563 c d^3 + 0.000244141 d^4

    I can then manually manipulate this data and put the results in a spreadsheet or something to get a full table of probabilities. The problem is that making that table is very tedious and doing all combinations would become very large. Are there better functions I could use to calculate the probabilities of each combination of roll results and/or does anyone know a good way to dynamically generated a table with the results?

    2 Comments
    2024/12/01
    18:27 UTC

    1

    [Q] Is fisher t distribution a thing?

    Its mentioned in my university book and there are few articles online but there's no Wikipedia page for it even though there is one titled Student's t distribution. What really confused me was when I asked chatgpt about fisher's t but it told me about f and said there's no such thing as fisher's t.

    9 Comments
    2024/12/01
    18:16 UTC

    5

    [Q] Any source for statistics talks?

    Hi,

    Does anyone know a good place to find videos of statistics talks/discussion/seminars? I just found Simons Insititute is a good one but wonder if there is any other similar places? It would be better if it is about more recent research topics.

    Thanks

    0 Comments
    2024/12/01
    17:49 UTC

    3

    [Q] Is this a solid statistical background for research?

    I've been putting in a lot of thought into my undergraduate degree because I won't be pursuing a master's or PhD degree in statistics. My ultimate goal is to try for med school, so I'm doing a statistics bachelors and a minor in biology for pre-med (i.e. I will have my associate degree in chemistry soon). I'm passionate about research and economics but am having trouble choosing well-rounded electives. This is what I have so far. Would you change anything? Should I switch out econometrics with game theory or optimization?

    Calculus Series

    Introduction to Probability

    Statistical Theory

    Fundamentals of Computing

    Experimental Design

    Applied regression

    Real Analysis 1 (I heard while not applied directly it helps with evaluating/validating statistical methods/models)

    Linear Algebra

    Computational Bayesian Statistics

    Time Series

    Statistical Methods and Applications 1; R-language:
    discrete and continuous probability distributions, expectation, laws of large numbers, central limit theorem, statistical parameter estimation, hypothesis testing, and regression analysis

    Statistical Methods and Application 2; R-language:
    modern regression analysis, analysis of variance (ANOVA), experimental design, nonparametric methods, and an introduction to Bayesian data analysis

    Object Oriented Programming

    Data Structures and Program Design

    Econometrics 2

    Edit:

    So far, these are the ones I'm considering swapping:

    OOP - > Introduction to Optimization

    Data Structures and Program Design - > Applied Graph Theory

    Real Analysis 1 - > Discrete Structures

    3 Comments
    2024/12/01
    16:29 UTC

    74

    [D] I am the one who got the statistics world to change the interpretation of kurtosis from "peakedness" to "tailedness." AMA.

    As the title says.

    38 Comments
    2024/12/01
    15:19 UTC

    1

    [Q] Statistical analysis of experiment results

    Assume I am running an experiment with three independent trials (entire experiment is done from start to finish three times). In this experiment, we create a solution of yeast in water in a beaker. We prepare two separate test tubes with YPD (a growth medium) and transfer some of the yeast solution to both test tubes. In one of the test tubes (the experimental tube), we add a small amount of anabaena, with the other test tube serving as a control. We record the cell count (of the yeast) for both tubes. We incubate both tubes for 48 hours and do another cell count.

    As mentioned earlier, this experiment is repeated from start to finish three separate times. I’m not a statistician and I’m having a hard time figuring out how to go about statistical analysis: should I test for normality? How do I test for normality? Which tests should I perform? Why? Etc. The general question being: does the presence of anabaena have a statistically significant effect on the population/growth of yeast. Our data consists of three separate trials, each trial having a before and after (incubation) cell count of both the experimental and control test tubes.

    I was told by my professor that I’d most likely want to perform a Mann-Whitney-U test, but the more I looked into why, the more confused I got about the varying types of tests and the circumstances in which they are used. Is my data normal? Dependent/independent? (I would say that the separate trials are independent, but are we considering the data between trials or within each trial?), is the data paired? (I’m not even sure what this means/is applied to if I’m being honest), etc.

    8 Comments
    2024/12/01
    14:56 UTC

    2

    [Q] Daily averaged time series comparison -Linking plankton and aerosols emissions?

    Hi everyone, so we have this dataset of daily averaged pytoplankton time series over a full year; coccolithophores, chlorophytes, cyanobacteria, diatoms, dinoflagellates, phaecocystis, zooplankton.
    Then we have atmospheric measurements on the same time intervals of a few aerosols species; Methanesulphonic acid, carboxylic acids, aliphatics, sulphates, ammonium, nitrates etc...
    Our goal is to establish all the possible links between plankton types and aerosols, we want to find out which planktons matter the most for a given aerosols species.

    So here is my question; Which mathematical tools would you use to build a model with these (nonlinear) time series? Random Forest, cross-wavelets, transfer entropy, fractals analysis, chaos theory, Bayesian statistics? The thing that puzzle me most is that we know there is a lag between the plankton bloom and aerosols eventually forming in the atmosphere, it can take weeks for a bloom to trigger aerosols formation, so far many studies have just used lagged Pearson´s correlation, which I am not too happy with as correlation really isn´t reliable, would you know of any advanced methods to find out the optimal lag? What would be the best approach in your opinion?
    I would really appreciate any ideas, so please don´t hesitate to write down yours and I´d be happy to debate it, have a nice Sunday, cheers :)

    1 Comment
    2024/12/01
    10:14 UTC

    11

    [Q] Forgot much about calculus-based probability. Review or move onto measure-theoretic probability?

    I am looking to do research in theoretical statistics and machine learning, so I am trying to build solid quantitative foundations to start reading papers. One problem is, I don't remember much from the calculus-based probability course I took almost four years ago besides basic concepts at a high-level. At this point, should I review undergraduate-level probability (using textbooks like Ross or Blitzstein) or move onto measure-theoretic probability (Billinsley or Durrett) if I have already taken real analysis and got a brief taste of measure theory? Also, what do you think is the point/value of calculus-based probability for researchers if measure-theoretic probability is a more general treatment?

    4 Comments
    2024/12/01
    07:20 UTC

    20

    [Q] Any advice for an engineering student who is completely lost in Statistics?

    Not sure if this is appropriate for this sub, this is not a homework question, but rather me asking for advice. I am completely lost in my statistics course and my final is soon! It’s calculus based, and I’m specifically having trouble with random variables, parameter estimation, and hypothesis testing.

    Edit: I’m in a second year stats course, and I’m having trouble with stuff like other distributions other than normal, and things like Method of moments, maximum likelihood, and least square error

    10 Comments
    2024/12/01
    02:44 UTC

    2

    [Q] Have I Calculated This Correctly? Unusual Results in Ultimate Texas Hold'em

    As someone who has played hundreds of thousands of poker hands, primarily in MTT and cash games, I have a solid understanding of both the game, its variants, and probability calculations. Recently, while playing Ultimate Texas Hold'em on Betsafe, I experienced something highly unusual during a session of 278 hands.

    To add to this, I have photos of every single hand I played during this session, so I can back up everything I am describing here with concrete evidence.

    Before diving into the details, here’s a quick recap of the game’s rules for context:

    ##Rules Recap for Ultimate Texas Hold'em.

    • Both the dealer and the player receive two cards, and the best five-card hand is made using five community cards.
    • The dealer must qualify with at least a pair for the Ante bet to remain in play. If the dealer does not qualify, the Ante is returned to the player.
    • Players can bet up to 4x their Ante pre-flop, 2x after the flop, or 1x after all community cards are revealed.

    ##What Happened.

    Here’s what I experienced during my session of 278 hands:

    • The dealer failed to qualify in only 36 hands.
    • The dealer hit a straight or better in 39 hands.

    These results felt highly improbable to me, so I decided to calculate the likelihood of each outcome and their combined probability.

    ##Probability and Odds.

    Dealer fails to qualify in 36 or fewer hands:
    Statistically, the dealer should fail to qualify in around 55–58 hands (20% chance per hand).
    Actual result: 36 hands.
    Probability: 0.137% (odds of 1 in 729).
    Fun fact: This is just as likely as dealing yourself a pocket pair of aces 3 times in a row in standard Texas Hold’em.

    Dealer hits a straight or better in 39 or more hands:
    Statistically, the dealer should hit a straight or better in about 23–24 hands (8.5% chance per hand).
    Actual result: 39 hands.
    Probability: 0.143% (odds of 1 in 698).
    Fun fact: This is about as likely as rolling a perfect Yahtzee (five of a kind with dice) two times in a row.

    Both events occurring in the same session:
    Assuming the events are independent, I calculated their combined probability.
    Combined probability: 0.000196% (odds of 1 in 5,092,688).
    Fun fact: This is roughly as likely as being dealt a royal flush twice in a row in standard Texas Hold'em.

    ##My Questions

    I’m sharing this here because I’d like to confirm whether I’ve thought about and calculated these probabilities correctly.

    Are there better methods or approaches I could use to calculate these kinds of probabilities?
    Does the assumption of independence between these two events hold, or could there be some interaction I’m not considering?
    Are there any nuances I might have missed that could improve the accuracy of these calculations?
    I’m not accusing anyone of wrongdoing; my goal is purely to understand whether my statistical approach makes sense and to reflect on how extreme variance can sometimes appear in these games.

    Thanks in advance for your thoughts and feedback!

    3 Comments
    2024/12/01
    00:11 UTC

    1

    [Q] How do i analyse questionnaire results?

    Hey,

    While working on my thesis, i had released a questionnaire into public. It has 30 questions, with single answer or open answers. 140 people responded, and i was not expecting that much replies.

    I've exported the results into excel, which lead me to a quite messy sheets. First row is the question, second is the possible answers, then all the respondents. Every possible answer is separate column, and answer is marked by 1 in a cell, leaving all others empty.

    Mentor said that i should use basic descriptive analysis, CL95%, chi square with df. And thats where i ran into issues.

    So when simplyfying the answers, i get for example: Question 3, 17 chose A, 33 chose B, 89 chose c, 1 unresponded. I'm trying to use excel data analysis functions, but i keep getting errors. Tried looking into youtube for help, but in every video they are using those tools for 10+ numbers, not just 2-5 like in my case.

    What am i doing wrong? Did i misunderstand my mentor and i need to do different kind of analysis? I know for sure they mentioned CL and chi.

    Also tried using spss and R but i couldnt even import data properly, lol.

    Any tips will be greatly appreciated!!

    4 Comments
    2024/11/30
    19:06 UTC

    2

    [Q] How do I calculate the Bayes Factor for this?

    I have this model in R:

    (Update ~ Valence*Accuracy + EstErr + Vividness + Prior + (1|subn)

    And I'm interested in the Bayes Factor for the interaction term. Claude gave me this solution but I don't know if it's legit:

    Run brms on a model with the interaction term, and another on a model without it

    The calculate marginal likelihoods with bridge-sampler for both of these, and run Bayes Factor on these

    The result makes sense, but I want to make sure, and I need a reference for it (Claude gave this https://www.jstatsoft.org/article/view/v092i10)

    4 Comments
    2024/11/30
    12:30 UTC

    6

    [Q] more bias in regression coefficients after Entropy Balancing?

    I have survey respondents, which constitute a sample of the population from which I'd like to draw inferences. I used entropy balancing (EB) to adjust for non-response using the WeightIt package in R.

    I've read that EB qualifies as an "equal percent bias reducing" matching procedure, i.e., it guarantees that bias will be lower after matching. However, after I check the regression coefficients from a model with the sample that includes the weights obtained from EB they become less similar to the population coefficients. Is this possible or is it likely that I've made a huge error in my specification of the EB?

    5 Comments
    2024/11/30
    08:45 UTC

    0

    [Education]Why should I find statistics for DS/ML

    I started learning mathematics for DS/ML two months ago and i found myself tangled in it.

    I decided to unlearn and start fresh. Please recommend me yt playlist/notes for me.

    Thank you for reading. Glad if you respond🫶

    2 Comments
    2024/11/30
    08:30 UTC

    0

    [R] Sex differences in the water level task on college students

    I took 3 hours one friday on my campus to ask college subjects to take the water level task. Where the goal was for the subject to understand that water is always parallel to the earth. Results are below. Null hypothosis was the pop proportions were the same the alternate was men out performing women.

    || || | |True/Pass|False/Fail| | |Male|27|15|42| |Female|23|17|40| | |50|33|82|

    p-hat 1 = 64% | p-hat 2 = 58% | Alpha/significance level= .05

    p-pooled = 61%

    z=.63

    p-value=.27

    p=.27>.05

    At the signficance level of 5% we fail to reject the null hypothesis. This data set does not suggest men significantly out preform women on this task.

    This was on a liberal arts campus if anyone thinks relevent.

    17 Comments
    2024/11/30
    05:25 UTC

    2

    [Q] how do I calculate the odds?

    Here’s basically what happened: I was given a list of 30 items. I had to guess which 2 items would be randomly selected based on a random number generator. I guessed both correctly.

    We ran the experiment again and again I guessed the 2 items correctly.

    What are the odds here and how do I calculate them?

    10 Comments
    2024/11/30
    00:25 UTC

    2

    [Q] Visualizing Win Ratio

    I’m analyzing a clinical trial using Win Ratio as the primary outcome.

    More about Win Ratio https://pubmed.ncbi.nlm.nih.gov/21900289/#:~:text=The%20win%20ratio%20is%20the,win%20ratio%20are%20readily%20obtained.

    Is there a visual way to show the data? It is normally reported as a line in the text or a data table. Would be nice to have a pretty figure for a presentation/poster.

    Thank you

    0 Comments
    2024/11/30
    00:16 UTC

    2

    [Q] Static variable and time series variable tables in RFM

    I am creating a prediction model using random forest. But I don't understand how the model and script would consider both tables loaded in as dataframes.

    What's the best way to use multiple tables with a Random Forest model when one table has static attributes (like food characteristics) and the other has dynamic factors (like daily health habits)?

    Example: I want to predict stomach aches based on both the food I eat (unchanging) and daily factors (sleep, water intake).

    Tables:

    • Static: Food name, calories, meat (yes/no)
    • Dynamic: Day number, good sleep (yes/no), drank water (yes/no)

    How to combine these tables in a Random Forest model? Should they be merged on a unique identifier like "Day number"?

    0 Comments
    2024/11/29
    19:06 UTC

    3

    [Q] Help choosing the appropriate statistical test for ranking data

    Hello all, I have a dataset containing the ranked data by construction professionals. They were made to rank different materials (ex : wood, concrete, reinforced concrete) in preference order for a specific task. I would now like to analyze the ranking of those materials with other variables such as an age class, . As the materials have a strong effect on the ranking what should I use to analyze the other variables ? I am working in R and was heading toward a Kruskall-Wallis since I have ordinal data. Test operated like the following :

    kruskal.test(RANKING ~ interaction(AGE CLASS,MATERIAL))

    Sorry if the question seems dumb I havent practiced in a long while. I can provide sample data if necessary

    3 Comments
    2024/11/29
    16:50 UTC

    0

    [Q] Need help with complicated statistics

    Hi all, I was hoping to get some advice on a statistical problem i've been struggling with. the data is honestly a mess, and i would personally prefer to limit myself to descriptive statistics. But as some of you might know most journals require a significance value. So, here is a description of my data:

    i have multiple groups of animals that received different treatments. The groups are not equally distributed (control group has 2 animals, the others have 4). For each animal, i have 10 measurements of counts. These measurements have a large variation, which is expected. The main problem is that of these 10 measurements, some are not usable due to a variety of issues. This leaves me both with unequal groups and unequal measurements within each group/animal.

    I honestly have no clue how to do a proper statistical test on this data. I did a generalised linear model with a poisson distribution as a first attempt, but this i'm not sure if this properly accounts for the missing data. Any type of data selection would have huge bias, because the measurements are quite far apart. ChatGPT+Wolfram suggests a GLM with the following code (i use R): model_poisson <- glmer(Count ~ Group + (1 | Group/Animal), data = data, family = poisson(link = "log")). Would this be correct? Is there any way I can improve on this method or check assumptions?

    Any help is greatly appreciated and I would be happy to provide more context (although the data itself is unfortunately still confidential)

    7 Comments
    2024/11/29
    13:08 UTC

    31

    [E] Poisson Distribution - Explained

    Hi there,

    I've created a video here where I talk about the Poisson distribution and how it is derived as an edge case of the Binomial distribution when the probability of success tends to 0 and the number of trials tends to infinity.

    I hope it may be of use to some of you out there. Feedback is more than welcomed! :)

    3 Comments
    2024/11/29
    10:43 UTC

    8

    [Career] Unsure about career trajectory…any advice would be greatly appreciated

    Hi, I’ve been having a sort of quarter life crisis and would appreciate some guidance!

    I’m a MPH grad (epidemiology/biostatistics emphasis) and have hopped around a few jobs. Started off as an epidemiologist, then went into a statistician role at a university, and now working as a biostatistician at a medical device company.

    In both my statistician and biostatistician roles, I don’t find myself doing tons of actual statistical analyses. I thought I’d be learning how to do complex modeling or running a bunch of statistical tests day in and day out, but that’s not really the case.

    In my current role (biostatistician), I find a lot of it is producing listings and tables (summary statistics, like means or frequencies) for reports, and then working with a larger team to produce said reports for studies. I have dabbled in some hypothesis testing (t-tests, ANOVA) and simulations, but nothing extreme.

    I don’t necessarily hate that I’m not working with more complex statistical work - I actually enjoy using SAS and developing programs. But I’m worried about how this will set me up for success long-term.

    I’m thinking about career progression - while I’m not looking to leave my role anytime soon, does it make sense to continue looking for biostatistician roles in the future? Is this pretty common in this field, or is my current job more of an anomaly? If this isn’t common, are there other job titles that may be better suited for the type of work I’m doing (simpler statistics, developing programs in SAS, producing listings and tables for reports, etc.)?

    Thanks in advance!

    5 Comments
    2024/11/29
    05:52 UTC

    1

    [Q] Standard Deviation in Combined Data

    I am working on a stats project and have a question surrounding standard deviation. My project is as follows:

    People in a competition receive marks from multiple judges, lets say 5. These scores are used to make the final placements. I take the difference between each judges individual score and the final placement. Now I have 5 sets of numbers with differences. I then take the standard deviation of each to get an understanding of how varied a judges results were on the whole from the real results. Here is where my question is.

    The above process gives me an understanding of how varied an individual judges scores are. I'm trying to capture the general "variance" in all the datasets. Would it be better to take the average of the five standard deviations, or is there a way I can take the union of the 5 sets and take the standard deviation of that?

    A few more pieces: the average for the judges differences is always 0 (yes I verified this mathematically). Also the size of each of the 5 data sets are always the same. Any other questions I'm happy to answer

    1 Comment
    2024/11/29
    02:47 UTC

    1

    [Q] multiple imputation

    3 Comments
    2024/11/28
    20:08 UTC

    1

    [Q] Item response theory on a new cognitive test with both multiple-choice (dichotomous) and performance (continuous) items

    In a new cognitive test that I am developing, I was (and still is) planning to use a CFA model with WLSMV estimation. But I am intrigued by the potential benefits of IRT. Is it viable to use an IRT model in my situation?

    1 Comment
    2024/11/28
    15:11 UTC

    Back To Top