/r/statistics
/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.
This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.
Guidelines:
All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:
Tag | Abbreviation |
---|---|
[Research] | [R] |
[Software] | [S] |
[Question] | [Q] |
[Discussion] | [D] |
[Education] | [E] |
[Career] | [C] |
[Meta] | [M] |
This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.
Please try to keep submissions on topic and of high quality.
Just because it has a statistic in it doesn't make it statistics.
Memes and image macros are not acceptable forms of content.
Self posts with throwaway accounts will be deleted by AutoModerator
Related subreddits:
Data:
Useful resources for learning R:
r-bloggers - blog aggregator with statistics articles generally done with R software.
Quick-R - great R reference site.
Related Software Links:
Advice for applying to grad school:
Advice for undergrads:
Jobs and Internships
For grads:
For undergrads:
/r/statistics
Hi everyone, is there an non-parametric alternative to the Wilcoxon rank sums test? My data tends to violate the assumption of equal variances? My data is paired.
When you have a historical timeseries and you are trying to measure volatility in that timeseries, should you be using sample vs population SD? For context, an application might be using volatility of a stock's historical price data to predict future prices.
I've been tossing it around in my head.
One view is that the historical data is just a sample of the theoretical "total population", which consists of all data in the past AND the future, even though we haven't seen the future data yet.
But another view is that the historical data is all the data that currently exists, and the future data doesn't exist yet, so what exists is the population, and therefore use the population SD.
Is there a correct answer here? e.g. Morningstar switched from using population SD to sample SD in calculating Sharpe Ratio in 2005, but I don't know if that is because sample SD is more correct or if it's just a matter of interpretation.
Hi there,
I've created a video here where I talk about the L1 and L2 regularization, two techniques that help us in preventing overfitting, and explore the differences between them.
I hope it may be of use to some of you out there. Feedback is more than welcomed! :)
So, I have both model and ADCP time-series ocean current data in a specific point and I applied a 6-day moving median to the U and V component and proceeded to compute its correlation coefficient separately using nancorrcoef function in MATLAB. The result yielded an unacceptable correlation coefficient for both U and V (R < 0.5).
My thesis adviser told me to do a 30-day moving median instead and so I did. To my surprise, the R-value of the U component improved (R > 0.5) but the V component further decreased (still R < 0.4 but lower). I reported it to my thesis adviser and she told me that U and V R values should increase or decrease together in applying moving median.
I want to ask you guys if what she said is correct or is it possible to have such results? For example, U component improved since it is more attuned to lower-frequency variability (monthly oscillations) while V worsened since it is better to higher-frequency variability such as weekly oscillations.
Thank you very much and I hope you can help me!
P.S.: I already triple checked my code and it's not the problem.
Hey folks, I work with data and frequently I have to check if something is statistically significant with a specific confidence level, but I don't really know statistics that much. Usually for this I just open Evan Miller's Chi Squared website and input the numbers, but right now I have a proportion bigger than 100% (more conversions than expositions) so this test does not work. How can I check if one group is statistically better than the other one in this case?
If it is needed I have the data disaggregated (total conversions by each exposed customers, and group that the customer participates)
Hello everyone. I have a question regarding #weights - how do you assess them in measures? Besides arbitrarly of course:) I'd be very thankful for proper sources, toutorials, or papers that used some method, or step-by-step guidance.
Let's say I know theoretically that one item of my scale is more important, diffenating or more informative than the other item, how do I decide how much more important it is? How do I know excatly what weight to give it?
Suppose I have the data. What analyses should I do? (I was thinking IRT, but still I am not sure what to do with obtained discrimination and information values to determine the weights).
Also, is the method different for weighting items in a scale, and subscales in general score?
Hey there!
I created a cognitive test with 40 true/false and 10 timed performance items, and stumped on the CFA. I am using lavaan with WLSMV estimation and need help!
I used a bi-factorial model where 30 items load on VZ, 10 on GH and 10 timed items on UT. I have nearly perfect fit scores but many items load negatively or insignificantly on sub-factors. Here is my final model and the results. What can I do to improve the model? Are those loadings OK?
# Bi Factor Model on Gv
model3u <- '
#general factor
Gv =~ D1 + D4 + D5 + D6 + D7 + D8 + D9 + D10 +
GT1 + GT2 + GT3 + GT5 + GT6 + GT8 + GT9 +
UP1 + UP2 + UP3 + UP4 + UP5 + UP9 + UP10 +
GH1 + GH2 + GH4 + GH5 + GH6 + GH8 + GH9 +
UT1 + UT2 + UT4 + UT5 + UT6 + UT8 + UT9
#group specific factors
VZ =~ D1 + D4 + D5 + D6 + D7 + D8 + D9 + D10 +
GT1 + GT2 + GT3 + GT5 + GT6 + GT8 + GT9 +
UP1 + UP2 + UP3 + UP4 + UP5 + UP9 + UP10
GH =~ GH1 + GH2 + GH4 + GH5 + GH6 + GH8 + GH9
UT =~ UT1 + UT2 + UT4 + UT5 + UT6 + UT8 + UT9
# Orthogonal constraints
Gv ~~ 0*VZ
Gv ~~ 0*GH
Gv ~~ 0*UT
VZ ~~ 0*GH
VZ ~~ 0*UT
GH ~~ 0*UT
'
fit3u <- cfa(model3u, data = hcvpt, estimator = "WLSMV", ordered = ordinal_items)
summary(fit3u, fit.measures = TRUE, standardized=TRUE)
lavaan 0.6-18 ended normally after 360 iterations
Estimator DWLS
Optimization method NLMINB
Number of model parameters 115
Number of observations 251
Model Test User Model:
Standard Scaled
Test Statistic 462.108 568.981
Degrees of freedom 558 558
P-value (Chi-square) 0.999 0.364
Scaling correction factor 1.553
Shift parameter 271.351
simple second-order correction
Model Test Baseline Model:
Test statistic 2422.928 1438.410
Degrees of freedom 630 630
P-value 0.000 0.000
Scaling correction factor 2.218
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000 0.986
Tucker-Lewis Index (TLI) 1.060 0.985
Robust Comparative Fit Index (CFI) NA
Robust Tucker-Lewis Index (TLI) NA
Root Mean Square Error of Approximation:
RMSEA 0.000 0.009
90 Percent confidence interval - lower 0.000 0.000
90 Percent confidence interval - upper 0.000 0.023
P-value H_0: RMSEA <= 0.050 1.000 1.000
P-value H_0: RMSEA >= 0.080 0.000 0.000
Robust RMSEA NA
90 Percent confidence interval - lower NA
90 Percent confidence interval - upper NA
P-value H_0: Robust RMSEA <= 0.050 NA
P-value H_0: Robust RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.086 0.086
Parameter Estimates:
Parameterization Delta
Standard errors Robust.sem
Information Expected
Information saturated (h1) model Unstructured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Gv =~
D1 1.000 0.618 0.618
D4 0.502 0.238 2.103 0.035 0.310 0.310
D5 0.570 0.200 2.854 0.004 0.352 0.352
D6 0.287 0.214 1.345 0.179 0.178 0.178
D7 0.647 0.185 3.491 0.000 0.400 0.400
D8 0.323 0.164 1.967 0.049 0.200 0.200
D9 0.245 0.173 1.414 0.157 0.151 0.151
D10 0.256 0.173 1.480 0.139 0.158 0.158
GT1 0.492 0.190 2.589 0.010 0.304 0.304
GT2 0.185 0.180 1.032 0.302 0.114 0.114
GT3 0.292 0.190 1.537 0.124 0.181 0.181
GT5 0.609 0.191 3.190 0.001 0.376 0.376
GT6 0.348 0.176 1.969 0.049 0.215 0.215
GT8 0.498 0.183 2.724 0.006 0.308 0.308
GT9 0.419 0.201 2.088 0.037 0.259 0.259
UP1 0.329 0.173 1.906 0.057 0.203 0.203
UP2 0.674 0.183 3.689 0.000 0.417 0.417
UP3 0.599 0.200 2.997 0.003 0.370 0.370
UP4 0.431 0.172 2.500 0.012 0.266 0.266
UP5 0.510 0.182 2.798 0.005 0.315 0.315
UP9 0.933 0.231 4.038 0.000 0.576 0.576
UP10 0.611 0.195 3.137 0.002 0.378 0.378
GH1 0.624 0.203 3.077 0.002 0.385 0.385
GH2 0.209 0.172 1.219 0.223 0.129 0.129
GH4 0.272 0.161 1.686 0.092 0.168 0.168
GH5 0.308 0.174 1.768 0.077 0.190 0.190
GH6 0.673 0.221 3.039 0.002 0.416 0.416
GH8 0.342 0.180 1.899 0.058 0.211 0.211
GH9 0.851 0.251 3.395 0.001 0.526 0.526
UT1 0.063 0.015 4.074 0.000 0.039 0.412
UT2 0.037 0.011 3.402 0.001 0.023 0.336
UT4 0.034 0.009 3.629 0.000 0.021 0.363
UT5 0.028 0.008 3.566 0.000 0.017 0.381
UT6 0.039 0.010 3.780 0.000 0.024 0.411
UT8 0.037 0.009 3.994 0.000 0.023 0.462
UT9 0.016 0.008 2.039 0.041 0.010 0.182
VZ =~
D1 1.000 0.281 0.281
D4 1.878 0.968 1.941 0.052 0.527 0.527
D5 1.706 0.829 2.058 0.040 0.479 0.479
D6 1.236 0.736 1.679 0.093 0.347 0.347
D7 1.297 0.689 1.882 0.060 0.364 0.364
D8 1.463 0.781 1.873 0.061 0.411 0.411
D9 1.572 0.823 1.910 0.056 0.441 0.441
D10 1.502 0.804 1.869 0.062 0.422 0.422
GT1 -0.162 0.467 -0.346 0.729 -0.045 -0.045
GT2 0.355 0.480 0.739 0.460 0.100 0.100
GT3 -0.009 0.462 -0.020 0.984 -0.003 -0.003
GT5 -0.232 0.432 -0.536 0.592 -0.065 -0.065
GT6 0.753 0.586 1.285 0.199 0.211 0.211
GT8 0.850 0.678 1.254 0.210 0.239 0.239
GT9 0.215 0.501 0.430 0.667 0.060 0.060
UP1 0.342 0.458 0.747 0.455 0.096 0.096
UP2 0.469 0.426 1.102 0.270 0.132 0.132
UP3 0.532 0.474 1.121 0.262 0.149 0.149
UP4 -0.092 0.449 -0.204 0.839 -0.026 -0.026
UP5 -0.288 0.440 -0.655 0.513 -0.081 -0.081
UP9 0.573 0.434 1.321 0.186 0.161 0.161
UP10 -0.152 0.440 -0.346 0.730 -0.043 -0.043
GH =~
GH1 1.000 0.123 0.123
GH2 -0.287 1.089 -0.264 0.792 -0.035 -0.035
GH4 0.118 1.093 0.108 0.914 0.015 0.015
GH5 -4.896 6.222 -0.787 0.431 -0.603 -0.603
GH6 0.372 1.236 0.301 0.764 0.046 0.046
GH8 -3.374 4.220 -0.800 0.424 -0.416 -0.416
GH9 4.041 5.022 0.805 0.421 0.498 0.498
UT =~
UT1 1.000 0.036 0.384
UT2 1.021 0.165 6.173 0.000 0.037 0.539
UT4 1.017 0.159 6.377 0.000 0.037 0.630
UT5 0.624 0.110 5.653 0.000 0.023 0.507
UT6 0.666 0.150 4.432 0.000 0.024 0.416
UT8 0.660 0.109 6.066 0.000 0.024 0.481
UT9 0.734 0.158 4.649 0.000 0.027 0.479
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Gv ~~
VZ 0.000 0.000 0.000
GH 0.000 0.000 0.000
UT 0.000 0.000 0.000
VZ ~~
GH 0.000 0.000 0.000
UT 0.000 0.000 0.000
GH ~~
UT 0.000 0.000 0.000
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.UT1 0.140 0.007 19.714 0.000 0.140 1.478
.UT2 0.152 0.004 34.706 0.000 0.152 2.207
.UT4 0.096 0.004 24.509 0.000 0.096 1.634
.UT5 0.069 0.003 23.861 0.000 0.069 1.545
.UT6 0.092 0.004 24.624 0.000 0.092 1.579
.UT8 0.075 0.003 23.116 0.000 0.075 1.508
.UT9 0.089 0.004 24.960 0.000 0.089 1.599
Thresholds:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
D1|t1 -1.240 0.106 -11.705 0.000 -1.240 -1.240
D4|t1 -1.198 0.104 -11.536 0.000 -1.198 -1.198
D5|t1 -0.748 0.088 -8.516 0.000 -0.748 -0.748
D6|t1 -0.933 0.093 -10.017 0.000 -0.933 -0.933
D7|t1 0.015 0.079 0.189 0.850 0.015 0.015
D8|t1 -0.278 0.080 -3.460 0.001 -0.278 -0.278
D9|t1 0.005 0.079 0.063 0.950 0.005 0.005
D10|t1 -0.206 0.080 -2.581 0.010 -0.206 -0.206
GT1|t1 0.299 0.081 3.711 0.000 0.299 0.299
GT2|t1 0.563 0.084 6.698 0.000 0.563 0.563
GT3|t1 0.438 0.082 5.336 0.000 0.438 0.438
GT5|t1 0.362 0.081 4.463 0.000 0.362 0.362
GT6|t1 0.610 0.085 7.188 0.000 0.610 0.610
GT8|t1 1.101 0.099 11.070 0.000 1.101 1.101
GT9|t1 0.888 0.092 9.680 0.000 0.888 0.888
UP1|t1 -0.516 0.083 -6.204 0.000 -0.516 -0.516
UP2|t1 0.065 0.079 0.819 0.413 0.065 0.065
UP3|t1 -0.035 0.079 -0.441 0.659 -0.035 -0.035
UP4|t1 0.145 0.080 1.826 0.068 0.145 0.145
UP5|t1 0.166 0.080 2.078 0.038 0.166 0.166
UP9|t1 0.166 0.080 2.078 0.038 0.166 0.166
UP10|t1 0.247 0.080 3.084 0.002 0.247 0.247
GH1|t1 0.341 0.081 4.212 0.000 0.341 0.341
GH2|t1 -0.085 0.079 -1.071 0.284 -0.085 -0.085
GH4|t1 -0.352 0.081 -4.337 0.000 -0.352 -0.352
GH5|t1 -0.309 0.081 -3.837 0.000 -0.309 -0.309
GH6|t1 -0.671 0.086 -7.796 0.000 -0.671 -0.671
GH8|t1 -0.125 0.079 -1.574 0.115 -0.125 -0.125
GH9|t1 -1.240 0.106 -11.705 0.000 -1.240 -1.240
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.D1 0.540 0.540 0.540
.D4 0.626 0.626 0.626
.D5 0.647 0.647 0.647
.D6 0.848 0.848 0.848
.D7 0.708 0.708 0.708
.D8 0.792 0.792 0.792
.D9 0.783 0.783 0.783
.D10 0.797 0.797 0.797
.GT1 0.906 0.906 0.906
.GT2 0.977 0.977 0.977
.GT3 0.967 0.967 0.967
.GT5 0.854 0.854 0.854
.GT6 0.909 0.909 0.909
.GT8 0.848 0.848 0.848
.GT9 0.929 0.929 0.929
.UP1 0.949 0.949 0.949
.UP2 0.809 0.809 0.809
.UP3 0.841 0.841 0.841
.UP4 0.928 0.928 0.928
.UP5 0.894 0.894 0.894
.UP9 0.642 0.642 0.642
.UP10 0.855 0.855 0.855
.GH1 0.836 0.836 0.836
.GH2 0.982 0.982 0.982
.GH4 0.972 0.972 0.972
.GH5 0.600 0.600 0.600
.GH6 0.825 0.825 0.825
.GH8 0.782 0.782 0.782
.GH9 0.475 0.475 0.475
.UT1 0.006 0.000 17.955 0.000 0.006 0.683
.UT2 0.003 0.000 11.267 0.000 0.003 0.597
.UT4 0.002 0.000 9.526 0.000 0.002 0.472
.UT5 0.001 0.000 12.123 0.000 0.001 0.597
.UT6 0.002 0.000 10.679 0.000 0.002 0.658
.UT8 0.001 0.000 12.588 0.000 0.001 0.555
.UT9 0.002 0.000 10.268 0.000 0.002 0.737
Gv 0.382 0.148 2.571 0.010 1.000 1.000
VZ 0.079 0.074 1.069 0.285 1.000 1.000
GH 0.015 0.036 0.425 0.671 1.000 1.000
UT 0.001 0.000 3.270 0.001 1.000 1.000
Q]
I'm reading this medical research paper and they have a table with results of these tests used to evaluate associations between some clinical variables and outcome. Depending on the variable type they have used chi squared/independent t/fischer's/Mann-whitney test. Lets say there are about 100 patients, 20 have diabetes. The outcome of interest is heart attack.
The table has this format : Variable name -Diabetes Outcome (positive) - Diabetic patients who had a Heart attack-15 Outcome ( absent) - Diabetes patients who didnt have a heart attack-5 P value: <0.05
AFAIK these tests only evaluate whether diabetes and heart attacks are associated. These tests can't establish causation. Neither can they determine the direction or size of association.
In this paper, they interpreted it these ways ( there are several variables, I'm only using 1 here )
1.patients with diabetes were more likely to suffer from a heart attack.
Diabetes increased the risk of heart attack.
Having diabetes has been associated with increased risk of heart attack
Having diabetes is strongly related to occurrence of heart attack.
I think any interpretation other than ' diabetes is associated with occurence of heart attack' is incorrect.
I'm not a math or stats expert so I'd appreciate simple answers. 😅 Also, if someone could please suggest a few reliable and free online resources to understand this better...
So like, event happens > two independent 50% chances > not guaranteed that one of them will actually happen. What’s a more accurate percentage? Is there one? Thanks!
I apparently don't know enough about advanced statistics to even find what function(s) are needed to do this, so I was hoping I could get some assistance here. Maybe it's too complex. I want to calculate the probability of all the outcomes for rolling a set of dice for a wargame (Star Wars Legion).
The dice have four possible outcomes and there are three different types of dice that have different probabilities of each outcome. After hours of scouring the web and various mathematics sites, the best I could come up with was a polynomial to describe each die:
DIE 1 = 0.625a + 0.125b + 0.125c + 0.125d
DIE 2 = 0.375a + 0.375b + 0.125c + 0.125d
DIE 3 = 0.125a + 0.625b + 0.125c + 0.125d
And multiplying those polynomials:
(.625a + .125b + .125c + .125d)^(2) * (.375a + .375b + .125c + .125d)^(2)
Because expanding the formula will give me the probability of each of the 35 possible combinations:
0.0531879 a^4 + 0.127997 a^3 b + 0.0570797 a^3 c + 0.0570797 a^3 d + 0.0986273 a^2 b^2 + 0.0975094 a^2 b c + 0.0975094 a^2 b d + 0.0225211 a^2 c^2 + 0.0450422 a^2 c d + 0.0225211 a^2 d^2 + 0.0260156 a b^3 + 0.0462891 a b^2 c + 0.0462891 a b^2 d + 0.0241406 a b c^2 + 0.0482813 a b c d + 0.0241406 a b d^2 + 0.00386719 a c^3 + 0.0116016 a c^2 d + 0.0116016 a c d^2 + 0.00386719 a d^3 + 0.00219727 b^4 + 0.00585938 b^3 c + 0.00585938 b^3 d + 0.00537109 b^2 c^2 + 0.0107422 b^2 c d + 0.00537109 b^2 d^2 + 0.00195313 b c^3 + 0.00585938 b c^2 d + 0.00585938 b c d^2 + 0.00195313 b d^3 + 0.000244141 c^4 + 0.000976563 c^3 d + 0.00146484 c^2 d^2 + 0.000976563 c d^3 + 0.000244141 d^4
I can then manually manipulate this data and put the results in a spreadsheet or something to get a full table of probabilities. The problem is that making that table is very tedious and doing all combinations would become very large. Are there better functions I could use to calculate the probabilities of each combination of roll results and/or does anyone know a good way to dynamically generated a table with the results?
Its mentioned in my university book and there are few articles online but there's no Wikipedia page for it even though there is one titled Student's t distribution. What really confused me was when I asked chatgpt about fisher's t but it told me about f and said there's no such thing as fisher's t.
Hi,
Does anyone know a good place to find videos of statistics talks/discussion/seminars? I just found Simons Insititute is a good one but wonder if there is any other similar places? It would be better if it is about more recent research topics.
Thanks
I've been putting in a lot of thought into my undergraduate degree because I won't be pursuing a master's or PhD degree in statistics. My ultimate goal is to try for med school, so I'm doing a statistics bachelors and a minor in biology for pre-med (i.e. I will have my associate degree in chemistry soon). I'm passionate about research and economics but am having trouble choosing well-rounded electives. This is what I have so far. Would you change anything? Should I switch out econometrics with game theory or optimization?
Calculus Series
Introduction to Probability
Statistical Theory
Fundamentals of Computing
Experimental Design
Applied regression
Real Analysis 1 (I heard while not applied directly it helps with evaluating/validating statistical methods/models)
Linear Algebra
Computational Bayesian Statistics
Time Series
Statistical Methods and Applications 1; R-language:
discrete and continuous probability distributions, expectation, laws of large numbers, central limit theorem, statistical parameter estimation, hypothesis testing, and regression analysis
Statistical Methods and Application 2; R-language:
modern regression analysis, analysis of variance (ANOVA), experimental design, nonparametric methods, and an introduction to Bayesian data analysis
Object Oriented Programming
Data Structures and Program Design
Econometrics 2
Edit:
So far, these are the ones I'm considering swapping:
OOP - > Introduction to Optimization
Data Structures and Program Design - > Applied Graph Theory
Real Analysis 1 - > Discrete Structures
As the title says.
Assume I am running an experiment with three independent trials (entire experiment is done from start to finish three times). In this experiment, we create a solution of yeast in water in a beaker. We prepare two separate test tubes with YPD (a growth medium) and transfer some of the yeast solution to both test tubes. In one of the test tubes (the experimental tube), we add a small amount of anabaena, with the other test tube serving as a control. We record the cell count (of the yeast) for both tubes. We incubate both tubes for 48 hours and do another cell count.
As mentioned earlier, this experiment is repeated from start to finish three separate times. I’m not a statistician and I’m having a hard time figuring out how to go about statistical analysis: should I test for normality? How do I test for normality? Which tests should I perform? Why? Etc. The general question being: does the presence of anabaena have a statistically significant effect on the population/growth of yeast. Our data consists of three separate trials, each trial having a before and after (incubation) cell count of both the experimental and control test tubes.
I was told by my professor that I’d most likely want to perform a Mann-Whitney-U test, but the more I looked into why, the more confused I got about the varying types of tests and the circumstances in which they are used. Is my data normal? Dependent/independent? (I would say that the separate trials are independent, but are we considering the data between trials or within each trial?), is the data paired? (I’m not even sure what this means/is applied to if I’m being honest), etc.
Hi everyone, so we have this dataset of daily averaged pytoplankton time series over a full year; coccolithophores, chlorophytes, cyanobacteria, diatoms, dinoflagellates, phaecocystis, zooplankton.
Then we have atmospheric measurements on the same time intervals of a few aerosols species; Methanesulphonic acid, carboxylic acids, aliphatics, sulphates, ammonium, nitrates etc...
Our goal is to establish all the possible links between plankton types and aerosols, we want to find out which planktons matter the most for a given aerosols species.
So here is my question; Which mathematical tools would you use to build a model with these (nonlinear) time series? Random Forest, cross-wavelets, transfer entropy, fractals analysis, chaos theory, Bayesian statistics? The thing that puzzle me most is that we know there is a lag between the plankton bloom and aerosols eventually forming in the atmosphere, it can take weeks for a bloom to trigger aerosols formation, so far many studies have just used lagged Pearson´s correlation, which I am not too happy with as correlation really isn´t reliable, would you know of any advanced methods to find out the optimal lag? What would be the best approach in your opinion?
I would really appreciate any ideas, so please don´t hesitate to write down yours and I´d be happy to debate it, have a nice Sunday, cheers :)
Not sure if this is appropriate for this sub, this is not a homework question, but rather me asking for advice. I am completely lost in my statistics course and my final is soon! It’s calculus based, and I’m specifically having trouble with random variables, parameter estimation, and hypothesis testing.
Edit: I’m in a second year stats course, and I’m having trouble with stuff like other distributions other than normal, and things like Method of moments, maximum likelihood, and least square error
As someone who has played hundreds of thousands of poker hands, primarily in MTT and cash games, I have a solid understanding of both the game, its variants, and probability calculations. Recently, while playing Ultimate Texas Hold'em on Betsafe, I experienced something highly unusual during a session of 278 hands.
To add to this, I have photos of every single hand I played during this session, so I can back up everything I am describing here with concrete evidence.
Before diving into the details, here’s a quick recap of the game’s rules for context:
##Rules Recap for Ultimate Texas Hold'em.
##What Happened.
Here’s what I experienced during my session of 278 hands:
These results felt highly improbable to me, so I decided to calculate the likelihood of each outcome and their combined probability.
##Probability and Odds.
Dealer fails to qualify in 36 or fewer hands:
Statistically, the dealer should fail to qualify in around 55–58 hands (20% chance per hand).
Actual result: 36 hands.
Probability: 0.137% (odds of 1 in 729).
Fun fact: This is just as likely as dealing yourself a pocket pair of aces 3 times in a row in standard Texas Hold’em.
Dealer hits a straight or better in 39 or more hands:
Statistically, the dealer should hit a straight or better in about 23–24 hands (8.5% chance per hand).
Actual result: 39 hands.
Probability: 0.143% (odds of 1 in 698).
Fun fact: This is about as likely as rolling a perfect Yahtzee (five of a kind with dice) two times in a row.
Both events occurring in the same session:
Assuming the events are independent, I calculated their combined probability.
Combined probability: 0.000196% (odds of 1 in 5,092,688).
Fun fact: This is roughly as likely as being dealt a royal flush twice in a row in standard Texas Hold'em.
##My Questions
I’m sharing this here because I’d like to confirm whether I’ve thought about and calculated these probabilities correctly.
Are there better methods or approaches I could use to calculate these kinds of probabilities?
Does the assumption of independence between these two events hold, or could there be some interaction I’m not considering?
Are there any nuances I might have missed that could improve the accuracy of these calculations?
I’m not accusing anyone of wrongdoing; my goal is purely to understand whether my statistical approach makes sense and to reflect on how extreme variance can sometimes appear in these games.
Thanks in advance for your thoughts and feedback!
Hey,
While working on my thesis, i had released a questionnaire into public. It has 30 questions, with single answer or open answers. 140 people responded, and i was not expecting that much replies.
I've exported the results into excel, which lead me to a quite messy sheets. First row is the question, second is the possible answers, then all the respondents. Every possible answer is separate column, and answer is marked by 1 in a cell, leaving all others empty.
Mentor said that i should use basic descriptive analysis, CL95%, chi square with df. And thats where i ran into issues.
So when simplyfying the answers, i get for example: Question 3, 17 chose A, 33 chose B, 89 chose c, 1 unresponded. I'm trying to use excel data analysis functions, but i keep getting errors. Tried looking into youtube for help, but in every video they are using those tools for 10+ numbers, not just 2-5 like in my case.
What am i doing wrong? Did i misunderstand my mentor and i need to do different kind of analysis? I know for sure they mentioned CL and chi.
Also tried using spss and R but i couldnt even import data properly, lol.
Any tips will be greatly appreciated!!
I have this model in R:
(Update ~ Valence*Accuracy + EstErr + Vividness + Prior + (1|subn)
And I'm interested in the Bayes Factor for the interaction term. Claude gave me this solution but I don't know if it's legit:
Run brms on a model with the interaction term, and another on a model without it
The calculate marginal likelihoods with bridge-sampler for both of these, and run Bayes Factor on these
The result makes sense, but I want to make sure, and I need a reference for it (Claude gave this https://www.jstatsoft.org/article/view/v092i10)
I have survey respondents, which constitute a sample of the population from which I'd like to draw inferences. I used entropy balancing (EB) to adjust for non-response using the WeightIt package in R.
I've read that EB qualifies as an "equal percent bias reducing" matching procedure, i.e., it guarantees that bias will be lower after matching. However, after I check the regression coefficients from a model with the sample that includes the weights obtained from EB they become less similar to the population coefficients. Is this possible or is it likely that I've made a huge error in my specification of the EB?
I started learning mathematics for DS/ML two months ago and i found myself tangled in it.
I decided to unlearn and start fresh. Please recommend me yt playlist/notes for me.
Thank you for reading. Glad if you respond🫶
I took 3 hours one friday on my campus to ask college subjects to take the water level task. Where the goal was for the subject to understand that water is always parallel to the earth. Results are below. Null hypothosis was the pop proportions were the same the alternate was men out performing women.
|| || | |True/Pass|False/Fail| | |Male|27|15|42| |Female|23|17|40| | |50|33|82|
p-hat 1 = 64% | p-hat 2 = 58% | Alpha/significance level= .05
p-pooled = 61%
z=.63
p-value=.27
p=.27>.05
At the signficance level of 5% we fail to reject the null hypothesis. This data set does not suggest men significantly out preform women on this task.
This was on a liberal arts campus if anyone thinks relevent.
Here’s basically what happened: I was given a list of 30 items. I had to guess which 2 items would be randomly selected based on a random number generator. I guessed both correctly.
We ran the experiment again and again I guessed the 2 items correctly.
What are the odds here and how do I calculate them?
I’m analyzing a clinical trial using Win Ratio as the primary outcome.
More about Win Ratio https://pubmed.ncbi.nlm.nih.gov/21900289/#:~:text=The%20win%20ratio%20is%20the,win%20ratio%20are%20readily%20obtained.
Is there a visual way to show the data? It is normally reported as a line in the text or a data table. Would be nice to have a pretty figure for a presentation/poster.
Thank you
I am creating a prediction model using random forest. But I don't understand how the model and script would consider both tables loaded in as dataframes.
What's the best way to use multiple tables with a Random Forest model when one table has static attributes (like food characteristics) and the other has dynamic factors (like daily health habits)?
Example: I want to predict stomach aches based on both the food I eat (unchanging) and daily factors (sleep, water intake).
Tables:
How to combine these tables in a Random Forest model? Should they be merged on a unique identifier like "Day number"?
Hello all, I have a dataset containing the ranked data by construction professionals. They were made to rank different materials (ex : wood, concrete, reinforced concrete) in preference order for a specific task. I would now like to analyze the ranking of those materials with other variables such as an age class, . As the materials have a strong effect on the ranking what should I use to analyze the other variables ? I am working in R and was heading toward a Kruskall-Wallis since I have ordinal data. Test operated like the following :
kruskal.test(RANKING ~ interaction(AGE CLASS,MATERIAL))
Sorry if the question seems dumb I havent practiced in a long while. I can provide sample data if necessary
Hi all, I was hoping to get some advice on a statistical problem i've been struggling with. the data is honestly a mess, and i would personally prefer to limit myself to descriptive statistics. But as some of you might know most journals require a significance value. So, here is a description of my data:
i have multiple groups of animals that received different treatments. The groups are not equally distributed (control group has 2 animals, the others have 4). For each animal, i have 10 measurements of counts. These measurements have a large variation, which is expected. The main problem is that of these 10 measurements, some are not usable due to a variety of issues. This leaves me both with unequal groups and unequal measurements within each group/animal.
I honestly have no clue how to do a proper statistical test on this data. I did a generalised linear model with a poisson distribution as a first attempt, but this i'm not sure if this properly accounts for the missing data. Any type of data selection would have huge bias, because the measurements are quite far apart. ChatGPT+Wolfram suggests a GLM with the following code (i use R): model_poisson <- glmer(Count ~ Group + (1 | Group/Animal), data = data, family = poisson(link = "log")). Would this be correct? Is there any way I can improve on this method or check assumptions?
Any help is greatly appreciated and I would be happy to provide more context (although the data itself is unfortunately still confidential)
Hi there,
I've created a video here where I talk about the Poisson distribution and how it is derived as an edge case of the Binomial distribution when the probability of success tends to 0 and the number of trials tends to infinity.
I hope it may be of use to some of you out there. Feedback is more than welcomed! :)
Hi, I’ve been having a sort of quarter life crisis and would appreciate some guidance!
I’m a MPH grad (epidemiology/biostatistics emphasis) and have hopped around a few jobs. Started off as an epidemiologist, then went into a statistician role at a university, and now working as a biostatistician at a medical device company.
In both my statistician and biostatistician roles, I don’t find myself doing tons of actual statistical analyses. I thought I’d be learning how to do complex modeling or running a bunch of statistical tests day in and day out, but that’s not really the case.
In my current role (biostatistician), I find a lot of it is producing listings and tables (summary statistics, like means or frequencies) for reports, and then working with a larger team to produce said reports for studies. I have dabbled in some hypothesis testing (t-tests, ANOVA) and simulations, but nothing extreme.
I don’t necessarily hate that I’m not working with more complex statistical work - I actually enjoy using SAS and developing programs. But I’m worried about how this will set me up for success long-term.
I’m thinking about career progression - while I’m not looking to leave my role anytime soon, does it make sense to continue looking for biostatistician roles in the future? Is this pretty common in this field, or is my current job more of an anomaly? If this isn’t common, are there other job titles that may be better suited for the type of work I’m doing (simpler statistics, developing programs in SAS, producing listings and tables for reports, etc.)?
Thanks in advance!