/r/statistics

Photograph via snooOG

/r/Statistics is going dark from June 12-14th as an act of protest against Reddit's treatment of 3rd party app developers.

This community will not grant access requests during the protest. Please do not message asking to be added to the subreddit.


Guidelines:

  1. All Posts Require One of the Following Tags in the Post Title! If you do not flag your post, automoderator will delete it:

Tag Abbreviation
[Research] [R]
[Software] [S]
[Question] [Q]
[Discussion] [D]
[Education] [E]
[Career] [C]
[Meta] [M]
  • This is not a subreddit for homework questions. They will be swiftly removed, so don't waste your time! Please kindly post those over at: r/homeworkhelp. Thank you.

  • Please try to keep submissions on topic and of high quality.

  • Just because it has a statistic in it doesn't make it statistics.

  • Memes and image macros are not acceptable forms of content.

  • Self posts with throwaway accounts will be deleted by AutoModerator


  • Related subreddits:


    Data:


    Useful resources for learning R:

  • r-bloggers - blog aggregator with statistics articles generally done with R software.

  • Quick-R - great R reference site.


  • Related Software Links:

  • R

  • R Studio

  • SAS

  • Stata

  • EViews

  • JMP

  • SPSS

  • Minitab


  • Advice for applying to grad school:

  • Submission 1


  • Advice for undergrads:

  • Submission 1


  • Jobs and Internships

    For grads:

    For undergrads:

    /r/statistics

    561,110 Subscribers

    1

    [Q] Check if one value from a set is different from other values

    Let's say that I have a set of related values:
    Set1 (Website Visits): 5, 4, 3, 7, 2
    Set2 (Website Sales): 1, 7, 8, 2, 0
    Set3 (Conversion Rates): Set2/Set1 (1/5), etc.
    How can I compare if any value from Set3 (conversion rate) is statistically significantly worsethanm the other values from the same set

    3 Comments
    2024/03/28
    20:40 UTC

    2

    [Q] What statistical test do i need for my repeated measure study?

    Hey

    My research question was to see whether 2 groups (intervention group vs control group) would differ in their cold showering behavior after 8 weeks. The intervention group received an online "training" to help them get used to taking cold shower.

    So, cold showering behavior is the dependent variable, which was measured 9 times weekly online (week 0 baseline, up to week 8). The dependent variable cold showering was is (probably) ordinal scale, as the question answers which participants could choose were:

    How many times did you take a cold shower the past week?

    • 0 times
    • 1-3 times
    • more than 3 times

    My main issue of finding a statistical test is that the variable was measured 9 times, and i'm not sure which model suits such an analysis. Also, since the same people were measured repeatedly in their cold showering behavior, i think residuals should not be independent.

    7 Comments
    2024/03/28
    15:39 UTC

    0

    [Question] What is the difference between the F p.r. of the Regression and the t. p.r. value of the Constant in GenStat's Simple Linear Regression function?

    I am wanting to see if there is a relationship between weather conditions and the number of species I have caught within a moth trap, and so I found a weather database that is giving me the past data for temperature, humidity, precipitation, windspeed, cloud cover, and moonphase and I am comparing that with the number of species that I have recorded in a moth trap.

    I have used the Simple Linear Regression function in GenStat to compare each of the weather variables (separately, I couldn't figure out how to get the Multiple one to work, and that's probably way over my head anyway) with the number of species, and I got some trends that I more or less expected. I'm just confused on a couple things.

    What is the difference between the F p.r. value of the regression, and the t p.r. value of the Constant? For example, with temperature the F p.r. value is <.001, and the t p.r. is 0.005 (and looking at the produced scatter graph, there seems to be a clear and strong correlation between the two, something which I expected). However, for total precipitation, the F p.r. value is 0.449 (so not significant), and the the t p.r. is <.001. From my experience, the more it rains the less moths there are (which the scatter graph appears to support), but the F p.r. value is confusing me.

    2 Comments
    2024/03/28
    14:45 UTC

    0

    [Q] anomaly or normal

    i have probably guessed people's birthdays less than 25 times so far in my 18 years of living, of these times ive been right on my first try 5/6 times and a few days off another 5 times

    1. I have never met or known about the actual birthday of the people i've guessed for before
    2. there are 366 possible days these people could be born

    is this a normal fraction of times i've been right the first time, or is it an anomaly? i was with my new classmates today discussing birthdays and we were all rlly confused as to why i managed to pull this off and was wondering if somebody thats interested could explain the likelihood of me achieving this

    6 Comments
    2024/03/28
    09:12 UTC

    1

    [Q] How to learn: MAXQDA Analytics Pro . Any resources/guidance?

    Not sure if this the right sub but I'm trying to learn how to use this program for a clinical psych project. I'm pretty sure my professor wants me to self-learn, but I'm not sure where to really look for/start.

    0 Comments
    2024/03/28
    08:40 UTC

    3

    [Question] Best approach for modeling signal

    I'm currently working on a project where I have a timeseries for a signal that is stationary, fluctuating continuously between values of -10 to 10 with a mean of 0. I have data every 1 minute for 2 years, and have 50 different signals, but I believe each is computed in the same way

    The goal is to figure out what this signal is, or be able to recreate it from other features. My first thought on how to approach this is to generate lots of features that are also stationary from price and volume data. various moving averages differentials divided by rolling volatility, offsets from various moving averages, 2nd and 3rd derivatives of various moving averages etc

    My guess is that this signal is based on some linear combination of features that are created from another non-stationary time series

    My main 3 questions are below

    1. What model/approach is best? I was thinking lasso or ridge regression since I suspect the signal is linear, and will have many correlated features
    2. Should I reduce the frequency from 1 minute to 1 hour intervals? I'm not sure if how autocorrelated the series is will cause problems
    3. Should I be differencing the signal and features even though they are already stationary?Thanks and any advice is greatly appreciated
    3 Comments
    2024/03/28
    03:49 UTC

    3

    [Q] Distribution of double pendulum angles

    The angles (and the X/Y coordinates of the tip) of a double pendulum exhibit chaotic behavior, so it seems like it would be interesting to look at their cumulative distribution functions.

    I googled a bit but I can't find anything like that. I see plenty of pretty random-walk graphs of angles over time, but not distributions. Any pointers where I could find that, or do I need to simulate it myself?

    Should I expect different distributions for different initial conditions? Or is the distribution dependent on the size and mass parameters, but not on the initial angles and velocities?

    7 Comments
    2024/03/27
    21:30 UTC

    3

    [Q] Ordinal logistic regression or chi square test, most interesting test for ones study?

    Hey! So i'm building a design for my study and have decided on either one of these two methods based on my circumstantial data. Which one do you believe is more optimized to be my test of choice if the following is true:

    I have 279 observations with categorical nominal values as my independent variable and the dependent variable being the one on the ordinal-scale. What the study wants to do is to see whether we can trace any tendency or correlation that one's heritage plays a role in the conflicts we're more interested and/or engaged in. Therefore I plan to compare 5 different groups and their self-estimation of how much they care for an interest, to see if there's any significant differences between how much they are interested in a foreign conflict.
    Hopefully I haven't forgotten to mention something very important..
    Thank you for reading and i'm interested in what you guys could think. :)

    2 Comments
    2024/03/27
    20:40 UTC

    2

    [Question] Comparing means of 2 groups: n1 and n2 known, variance/SEs unknown (individual data not provided)

    Hello!

    I am using a database that has presented me with this issue.

    I have a series of sample means, but not the individual data that was used to generate these means. To my understanding, the raw data is not accessible. I have the number of individuals used to generate each sample mean. Is there any way of comparing the means statistically when I have no way of assessing the variance within each group?

    5 Comments
    2024/03/27
    19:54 UTC

    3

    [R] Need some help with spatial statistics. Evaluating values of a PPP at specific coordinates.

    I have a dataset. It has data on two types of electric poles (blue and red). I'm trying to find out if the density and size of blue electric poles have an effect of the size of red electric poles.

    My data set looks something like this:

    xytypesize
    8532.2blue12
    84.332.1red11.1
    85.232.5blue
    ------------

    So I have the x and y coordinates of all poles, the type, and the size. I have separated the file into two for the red and blue poles. I created a PPP out of the blue data and used density.ppp() to get the kernel density estimate of the PPP. Now I'm confused how to go about applying the density to the red poles data.

    What I'm specifically looking for is that around a red pole, what the blue pole density and what is the average size of the blue poles around the red pole (using like a 10m buffer zone). So my red pole data should end up looking like this:

    xytypesizebluePoleDenavgBluePoleSize
    8532.2red120.03410.2
    84.332.1red11.10.001213.8
    ------------------

    Following that, I then intend to run regression on this red dataset

    So far, I have done the following:

    • separated the data into red and blue poles
    • made a PPP out of blue pooles
    • used density.ppp to generate kernel density estimate for the blue poles ppp
    • used the density.ppp result as a function to generate density estimates at each (x,y) position of red poles. so like:

         den = density.ppp(blue)
     f = as.function(den)
     blueDens = f(red$x, red$y)
     red$bluePoleDen = blueDens

    Now I am stuck here. I've been stuck on what packages are available to go further like this in R. I would appreciate any pointers and also corrections if I have done anything wrong so far.

    1 Comment
    2024/03/27
    19:49 UTC

    3

    [Q] Help reporting failed MANOVA

    Hello, Im currently doing my finale year project for university. I was planning on doing a MANOVA but failed the assumption of homogeneity so resulted to using a one way anova for each dependent variable instead of a MANOVA specifically a Welch’s anova. I’m just wondering how to report my results. Do I state every anova separately and how would I report Welch’s anova or would I need to as I’m not doing post hoc test due to the difference between groups not being sig. this is what I have so far (also I have another paragraph before this explains the violation of the assumption of the MANOVA) :

    Results of the one-way ANOVA revealed there was a statistically significant difference between the groups for GASS personal no (F (1,87)=17.243 p<.001). However the remaining dependent variables showed no statistically significant differences between the group means as determined by a one-way ANOVA for social anxiety (F (1,87)=.979. p>.001), GHSQ Personal/emotional (F(1,87)=.002. p>.001), GHSQ Suicidal thoughts (F (1,87)=.143. p>.001), GHSQ total (F (1,87)=.048. p>.001) and GASS Perceived stigma (F (1,87)=.146. p>.001). A large difference in mean scores was seen between the groups for GASS personal, whereas the remaining dependent variables displayed a small difference between the means of the groups. A statistically significant difference in the Welch ANOVA was also only demonstrated by GASS personal F(1, 72.17) = 18.36, p < .001. Due to the non-significant difference found between the groups a post-hoc test will not be run and instead a bivariate analysis will be conducted.

    4 Comments
    2024/03/27
    15:31 UTC

    2

    [Q] Bayestraits for continuous data

    I'm reading a paper that uses Bayestraits random walk (Model A) for continuous data and a question arises. Reading through the manual, there is no discussion of what the likelihood actually is, and what assumptions are made about the continuous data. The paper in question has values ranging from 0-1, but my best guess of what Bayestraits is doing is that it assumes a normally distributed data. I tried reading the source code but it is uncommented and I can't find what I'm looking for. Does anyone have any idea? Thanks!

    0 Comments
    2024/03/27
    12:17 UTC

    5

    [Q] How do i "prove" that a formula explains the results

    I have recently just gone back to university to do a graduate diploma after over half a decade working in hospo. had a science double major background as well as a strong math/stat year 1 but i can't seem to bloody remember what to do. just started so only on first and second year level papers.

    Writing a lab report for the first time in a long time is a bit of a whiplash. it is only worth 5% and i'm probably overthinking and not even necessary but.

    let's say you did an experiment. u have the control which is a, and the experiment which is b. there is an obvious difference so you do a simple t-test to reject null (which it does). but being an earlier course. this is on a topic that is widely studied and have a formula that predicts the outcome. How do you PROVE that the formula explains the difference with statistical significant? i thought to do a t.test with formula applied to A vs B but it obviously just show a p value of >0.05, which in hindsight was obvious. since a t can only reject a null, it can't confirm an alternative so now i'm stump. looking through previous lab reports/notes and looking up random "buzzwords" like anova but to no avail.

    is there a statistical analysis to "confirm" that my data is explained via a researched formula or is the best i can do is "the results appear to be consistent with research done by z"

    4 Comments
    2024/03/27
    10:47 UTC

    1

    [Q] Help confirming logic of combining results from two subgroups

    Hi - made a second post because I realized previous one was wrong.

    So, just checking if I'm right. Lets say I have this:

    • Population X, option A obtained 20% and I know population A is 100.
    • Population Y option A obtained 15% and I know population is 200.
    • So for X, 0.20*100 = 20 and for Y, 0.15*200=30.
    • People who choose A for X + Y = 50, and combine population is 100+200. That means option A: 50/300=16%

    This is NOT homework, I know it's a lame simple question, but it is not. What I'm interested in is, how can I translate this in excel if I have database for population/samples X and Y.

    1 Comment
    2024/03/27
    04:02 UTC

    0

    [Q] Help confirming logic of combining results from two subgroups

    Hi!
    So, just checking if I'm right. Lets say I have this:

    Let’s say for

    Population X, option A obtained 20% and I know population A is 100.

    Population Y option A obtained 15% and I know population is 200.

    So for X, 0.20*100 = 20 and for Y, 0.15*200=30.

    People who choose A for X + Y = 50, and combine population is 100+200. That means option A: 50/300=16%

    This is NOT homework, I know it's a lame simple question, but it is not.

    What I'm interested in is, how can I translate this in excel if I have database for population/samples X and Y.

    9 Comments
    2024/03/27
    02:35 UTC

    1

    [Q] Community Comparisons with Small Sample Sizes

    Hello all, I am preparing a master's thesis and need some assistance with a statistical analysis approach.

    My project involves culturing communities of microbes from three specific areas under various conditions. At the end, once I have identified the members of each community, my goal was to explore the differences between them. This would be fairly simple if I had a decent sample size, but I know my total number of samples will be quite low so I am not sure how to proceed while still maintaining statistical integrity. My professor has specifically requested that I decide on an approach to interpret the differences between the sites, so he clearly expects me to be able to achieve at least some analysis with my data.

    I currently have 19 sequences, representing only 5 species. I have another 13 sequences which have not yet been identified, so potentially up to 18 species at the absolute maximum but likely far fewer.

    Similar analyses of community comparisons use ACE, Chao, or Shannon diversity index, but according to my understanding (which is limited, statistics is not my strong suit), these all seem ill suited to the data that I have.

    Is there any approach that would be useful or even feasible in this case?

    1 Comment
    2024/03/26
    22:58 UTC

    102

    [Q] I was told that classic statistical methods are a waste of time in data preparation, is this true?

    So i sent a report analyzing a dataset and used z-method for outlier detection, regression for imputing missing values, ANOVA/chi-squared for feature selection etc. Generally these are the techniques i use for preprocessing.

    Well the guy i report to told me that all this stuff is pretty much dead, and gave me some links for isolation forest, multiple imputation and other ML stuff.

    Is this true? Im not the kind of guy to go and search for advanced techniques on my own (analytics isnt the main task of my job in the first place) but i dont like using outdated stuff either.

    63 Comments
    2024/03/26
    20:43 UTC

    7

    I'm having some difficulties with bayesian statistics [Q]

    I don't mean the math in it, I mean, the intuition, how it's used in actual real world problems?

    For example let's say you have three 🎲 in a box, one is six-sided and the second is eight-sided and the third is twelve sided. You pick one at random and draw it, it came out as 1, what's the probability that the selected dice is the six-sided dice?

    From here, the math is simple, getting the prior distribution and the posterior one is also simple, we start treating each dice as a hypothesis with a uniform distribution, each element has an equal chance of being selected, but what does UPDATING POSTERIOR DISTRIBUTION mean? How is that used in anything? It makes no sense to me to be honest.

    If you know a good resource for this please hit us with it in the comments

    12 Comments
    2024/03/26
    20:17 UTC

    0

    [Q] Low r and high p - I don't know how to interpret

    Hi all! Noob in statistics here. I am confused about how to interpret my data. My sample size is small (n=14) and I am getting a high p but my r is = 0.03. Can I say that there is no correlation? Or I cannot say that because the null hypothesis cannot be rejected?
    I am a geologist, we very hardly get amazing correlations, as nature is basically unpredictable. Because lab work is very time-consuming and expensive, I can't increase the sample size.

    10 Comments
    2024/03/26
    19:53 UTC

    1

    [q] Identifying if one group has a better numerical response to intervention than the other

    Hi, I've got a dataset of, say, 100 patients with measured heamaglobin (Hb). We've given them an intervention (iron) and measured Hb again at 6 months. The dataset as a whole shows an increase in Hb which is demonstrable clearly in a box whisker graph.

    What I want to do is compare sub-groups within the dataset. Men vs women, or different age groups, or whatever. I'm struggling to find a way to do this. I've tried doing box-whisker graphs of the different groups but they are hard to interpret (although they appear to show hetrogenicity between the groups, wihch is an interesting finding!). Is there a numerical way of modelling or describing this? My worry is I don't have enough data for this to be statistically significant and i'm just reading into noise.

    5 Comments
    2024/03/26
    19:40 UTC

    43

    [D] To-do list for R programming

    Making a list of intermediate-level R programming skills that are in demand (borrowing from a Principal R Programmer job description posted for Cytel):
    - Tidyverse: Competent with the following packages: readr, dplyr, tidyr, stringr, purrr, forcats, lubridate, and ggplot2.
    - Create advanced graphics using ggplot() and ploty() functions.
    - Understand the family of “purrr” functions to avoid unnecessary loops and write cleaner code.
    - Proficient in Shiny package.
    - Validate sections of code using testthat.
    - Create documents using Markdown package.
    - Coding R packages (more advanced than intermediate?).
    Am I missing anything?

    33 Comments
    2024/03/26
    19:18 UTC

    1

    [Q] Question about Bayes formula usage

    I know bayes formula isn't anything crazy but i'm struggling to understand how my textbook explains using it. I've kind of got down how the formula works but in this example, I don't understand why there is the need to differentiate between accident prone and non-accident prone drivers. Why is the probability not .6? Is it because the different drivers don't accurately reflect the entire population?

    2 Comments
    2024/03/26
    18:24 UTC

    1

    [Q] Causal Inference for sets of Time Series data

    I have multiple measurements, all of which are time series'. I am interested in understanding whether Signal Quality (SQ) affects the latency between two devices. I have 5 samples of both SQ, and latency under high SQ, 5 samples of SQ and latency under low SQ, 1 sample under increasing SQ, and 1 sample under decreasing SQ.

    I know that I can use Vector Autoregression to understand whether fluctuations in SQ impact latency, within the same test. However, I am also interested in finding out whether latency is impacted in some way when the SQ is high vs. low (this is across different tests, not within the same test).

    Technically, I can do a t-test, where I take the mean/stddev of latency across the 5 samples, and test for statistical significance under high and low SQ. However, I want to preserve the time series properties of both of the metrics. I'd also like to use the increasing and decreasing samples to help prove my hypothesis since I have them. Does anyone have any ideas what statistical tools I can use to accomplish this?

    3 Comments
    2024/03/26
    17:34 UTC

    44

    It feels difficult to have a grasp on Bayesian inference without actually “doing” Bayesian inference [Q]

    Im a MS stats student whose taken Bayesian inference in undergrad, and now will be taking it in my MS. While I like the course, I find that these courses have been more on the theoretical side, which is interesting, but I haven’t even been able to do a full Bayesian analysis myself. If someone said to me to derive the posterior for various conjugate models, I could do it. If someone said to me to implement said models, using rstan, I could do it. But I have yet to be able to take a big unstructured dataset, calibrate priors, calibrate a likelihood function, and make some heirarchical mixture model or more “sophisticated” Bayesian models. I feel as though I don’t get a lot of experience doing Bayesian analysis. I’ve been reading BDA3, roughly halfway through it now, and while it’s good I’ve had to force myself to go through the Stan manual myself to learn how to do this stuff practically.

    I’ve thought about maybe trying to download some kaggle datasets and practice on here. But I also kinda realized that it’s hard to do this without lots of data to calibrate priors, or prior experiments.

    Does anyone have suggestions on how they got to practice formally coding and doing Bayesian analysis?

    23 Comments
    2024/03/26
    17:24 UTC

    1

    [Q] Would a statistics undergrad be beneficial for an undecided masters?

    For context, I've been majoring in CS because I wanted to see if I would enjoy it, but I've found that I really hate coding. I don't really know what I want to major in now, but I have thought about switching to statistics.

    My goal is to return to the army as a commissioned officer, so my reasoning is that a BS in Statistics would be more beneficial if I were to pursue a different Masters later on in my career once I figure out what I want to study, if that makes sense.

    For those of you who got an undergrad in statistics, would you say this is a good idea? I'm at a crossroads here as I don't really know what I want to study, but statistics may be a solid choice.

    15 Comments
    2024/03/26
    15:18 UTC

    1

    [Q] Yates continuity

    Yates continuity

    Hello, so I have a question about Yates continuity. The only reason we use it for is Chi-square analysis of GOT, independence, homogeneity, and McNemar’s.

    So I was wondering what is the amount for homogeneity. Because some people in my class say it’s 0.5 and some say it’s 1.

    That was my question.

    3 Comments
    2024/03/26
    09:51 UTC

    3

    [Q] What is the base for this log transformation?

    Hi all,

    I am trying to extract some data from Guillermo 2017 (Perceiver- And Stimulus-Driven Effects on Preferential Attention to Racial Outgroup Faces) and have been slamming my face against this paper for hours.

    The paper says that it log transformed the mean reaction time values for it's analysis. But it doesn't specify the base. Using base 10 and e gives me a number that seems too small ( I am expecting a number from 100-1000ms).

    Here is an example:

    "Next, we analyzed our primary predictions. First, to assess whether the magnitude of attention differed based on Race, we tested the Race X Validity effect. The Race X Validity interaction was not significant, F(1, 159) = 0.00, p<0.981, η2p = 0.000, offering no evidence that attention to Black faces (M = 0.0573, SD = 0.0877) was greater than attention to White faces (M = 0.0570 SD = 0.0862)."

    What am I doing wrong?

    4 Comments
    2024/03/26
    05:52 UTC

    6

    [C] Looking for Feedback on the Hiring Manager. Is this a standard interaction or am I being pulled around?

    Hey everyone,

    I'm still a little new to the corporate field. I'm still in my first job as baby data analyst. Upcoming on ~2 yrs. in this position, I'm ready to move on. The hiring process turnaround was fast-ish compared to what I'm working through now. I breezed through my interviews for my current position, but I'm having trouble getting through the texting-phase in current interviews.

    My most recent interaction with a hiring manager rubbed me a little wrong. I feel like my time may have not been respected. I'm looking to see if anyone one else has had a similar experience lately. I've copy/pasted my email chain minus identifying information:

    Received 2024-03-24 8:21am

    Greetings! I hope you're doing well. I came across your information from the job posting for the remote job position of DATA ANALYST on [COMPANY NAME] on [Some Job aggregator idk]. I am delighted to inform you that our team has thoroughly reviewed your resume and we are highly impressed with your qualifications. Kindly inform me of your availability for a virtual interview. I eagerly await your response.

    Warm regards,
    Hiring Manager [henceforth HM]
    Sent from my iPhone

    Sent 2024-03-24 9:18 pm

    Hi HM,

    Thanks for finding my resume in the pile. I'd appreciate the opportunity to interview for this position. I'm freest Tuesday afternoon; anything after lunch would work (I'm based in [my timezone] or [my timezone but UTC offset]). Otherwise, I've got Wednesday before 11:00, Thursday afternoons, and Friday afternoons. Let me know if something in those blocks works for you.
    Thanks,
    AntiLoquacious

    Received 2024-03-24 10:14 pm

    Monday 12pm to 1pm is very okay by me. I'll be looking forward to your text at the scheduled time please be punctual. Have a wonderful day!

    Sent from my iPhone

    Sent 2024-03-25 09:17 am

    Sorry, HM. Monday isn't a day that I had listed in my previous email. Did you mean to pick a different day, or is Monday the only time you had available?

    Also, I don't think I have your phone number to text. I would definitely text you if I receive your number, but, lacking that, my number is [My personal cell].
    Thanks,
    AntiLoquacious

    Sent 2024-03-24 11:56 am

    Hi HM,

    As the time you've provided is in 5 minutes, would you have a phone number to provide that I could text?
    Thanks,
    AntiLoquacious

    Received 2024-03-25 12:48 pm

    Hello 👋AntiLoquacious are you ready complete your application

    Sent from my iPhone

    Man, that emoji gets me. And a response 45min late to a time I didn't agree to. My mondays aren't free because I have meetings w/ my manager at the start of the week. I just got lucky my manager called sick this morning. The emails go on after this. Looks like the next step is a text interview (not some application?).

    Does anyone think this could be indicative of company culture? Maybe a bit of a sloppy hiring manager?

    12 Comments
    2024/03/26
    01:35 UTC

    7

    [Q] Running the same (and different) tests in different programs, getting different results. Why and which do I trust?

    Chemist here (statistics scary)...have been working up some complex data from a large experiment. Got some advice here on how to test it. Was using excel, but felt it was running out of power for what I wanted to do. So I switched to R. Re ran some of my earlier basic tests I had done in excel in R, along with some new ones, and the results are different. And now I don't know what to think. I'll lay it out as simply as I can (and I'll put * by significant p-values)...

    Basic setup: I comparing results from 4 groups; 3 different treatment (1,2,3) and a Control (CON).

    Starting with basic ANOVAS, I get the same exact results for excel and R. I did two different ANOVAs: all 4 groups (1, 2, 3, CON), p=0.0707 and one with just the treatments (1,2,3), p=0.0422*.

    Originally in excel, I then did two sample t-tests w/equal variances (checked that variances were equal with f-test first). I did this in excel by running 6 different t-tests to compare all 4 groups with each other. I got insignificant results for all but two of them... (CON,1); (CON,2); (CON,3); (1,3) all have p-values greater than 0.05. The two significant ones: (1,2) p=0.02903* and (2,3) p=0.03953*.

    But, If I do this in R with t.test, I get slightly different p values for the two significant results: (1,2) p=0.0301* and (2,3) p= 0.03978*.

    I know that is just a slight difference. But is there a significant reason? And should I trust one more than the other?

    Further: I learned that in actuality I should be doing a Turkey test for this. So I did that in R using HSD.test from agricolae. I again did it two times, the same as the ANOVAs above; (1, 2, 3) and (1, 2, 3, CON). But in both cases I get no significant difference between any group...just "a"'s for all groups.

    Now I recognize that the Turkey test works slightly differently from the normal t test. But what I don't get is how in R the ANOVA can be significant; (1, 2, 3) p=0.04224* but the turkey test for that same group is not?

    So that's where I stand. Confused and not sure which results to use. I would very much like to say that some are different... it would be great to use the T test results for (1,2) and (2,3)...but I am not sure if I justifiably can. And if it matters... the means for 1 and 3 are both lower than 2...which makes sense and is expected given the experiment. I just don't know if I can say so statistically.

    Thank you very much. Confused Chemist out.

    12 Comments
    2024/03/25
    06:17 UTC

    0

    [Q] How to transform this chess database?

    The database has entries like this, each one of them being a full chess game:

    ['e2e4', 'g8f6', 'd2d4', 'g7g6', 'c2c4', 'f8g7', 'b1c3', 'e8g8', 'e2e4', 'd7d6', 'f1e2', 'e7e5', 'e1g1', 'b8c6', 'd4d5', 'c6e7', 'c1g5', 'h7h6', 'g5f6', 'g7f6', 'b2b4', 'f6g7', 'c4c5', 'f7f5', 'f3d2', 'g6g5', 'a1c1', 'a7a6', 'd2c4', 'e7g6', 'a2a4', 'g6f4', 'a4a5', 'd6c5', 'b4c5', 'f5e4', 'c4e3', 'c7c6', 'd5d6', 'c8e6', 'c3e4', 'd8a5', 'e2g4', 'e6d5', 'd1c2', 'a5b4', 'e4g3', 'e5e4', 'c1b1', 'b4d4', 'b1b7', 'a6a5', 'g3f5', 'f8f5', 'e3f5']

    e2e4 means the piece on e2 (the pawn) moved to e4. Problem is, I have no way of knowing which piece is moving somewhere. For example, "g7h8" means the piece on g7 moved to h8 but unless I run all the previous moves I have no way of knowing which piece is that.

    How can I transform this into a more understandable dataset?

    I'm not sure this is the sub to ask this, if it isn't I'd appreciate if you could tell me where to ask it

    PD: I've checked the chess library in python but I haven't found anything

    12 Comments
    2024/03/25
    01:48 UTC

    Back To Top