/r/AskStatistics

Photograph via snooOG

Ask a question about statistics

(other than homework). Don't solicit academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.

Ask a question about statistics.

Posts must be questions about statistics. The sub is not for homework or assessment help (try /r/HomeworkHelp). No solicitation of academic misconduct. Don't ask people to contact you externally to the subreddit. Use informative titles.

See the rules.

If your question is "what statistical test should I use for this data/hypothesis?", then start by reading this and ask follow-ups as necessary. Beware: it's an imperfect tool.

If you answer questions, you can assign your own flair to briefly describe your educational or professional background in statistics.

/r/AskStatistics

105,373 Subscribers

1

SPSS Moderation Analysis

My study goes like this:
Independent Variable: Level of Knowledge, composed of three parameters, measured in a 4-point Likert scale, and the three parameters were combined using the compute variable function is SPSS (mean).
Dependent Variable: Level of Compliance, with five questions measured in a 4-point Likert scale and the results were combined also using the compute variable function in SPSS (mean).
Moderating Variable: Completion of a Training Course measured as either the respondent has or has not completed a certain training course.

My study wants to know the correlation between the IV and the DV and whether the MV strengthens or weakens the correlation between the IV and DV. Can someone lend me help or advise on how should I conduct my moderation analysis on SPSS?

0 Comments
2024/12/01
06:18 UTC

2

Is it possible to estimate how a specific group voted based on data of less specific groups?

For example in a US election, is it possible to estimate the ratio of Dem-Rep votes of the group "White Catholic Males" when I have the Dem-Rep ratio of each of the following groups and also the size of each group?

  • White
  • Catholic
  • Male
  • White Catholic
  • White Male
  • Catholic Male
0 Comments
2024/12/01
03:39 UTC

2

Manufacturing Data Reality. What do these datasets typically look like?

Hey Guys,
So I have an interview coming up for a food manufacturing company and they are going to give me a case study on Excel to work on. The job desc is focused on:
Recognize trends and patterns, utilizing large live and historical data,
Forecasting, Drawing hypothesis e.g. investigating sugar levels on a candy.

Does anyone here work in manufacturing (or better food manufacturing) and help give me an idea of what a typical dataset could look like?

I would love to start practising on some fake datasets, I asked ChatGPT but it isn't giving the most realistic datasets.

Any help us much appreciated!!

1 Comment
2024/12/01
03:06 UTC

1

Can you use data from a hyperbolic function in a correlation equation?

Hello, I have to write a statistics paper for my undergrad class and something I'm considering doing is correlating Delay Discounting with some other trait obtained through a Likert Scale.

My problem with Delay Discounting is that individual values obtained through its standard protocol follow a hyperbolic curve. I'm a bit wary about whether this would do something funky with the correlation, especially since I think the other data set will follow a very different pattern. Is it a-okay to use something like this? Am I misapplying something? Or is there some property to correlation equations that would prohibit me from using that kind of data?

I'm using the formulas from this paper and then calculating the geometric mean at the point of indifference:

https://www.cambridge.org/core/journals/psychological-medicine/article/temporal-discounting-in-major-depressive-disorder/6E097CAFD29115C260827A88A89A7F81

1 Comment
2024/11/30
22:43 UTC

1

Calculating confidence intervals

In an instrumental variable study, I have the point estimate of the outcome (a health outcome) and the corresponding confidence interval. I also have the mean of the instrument and the range (number of health worker visits in a year in a given population). I have to calculate how the confidence interval would change given a change in the instrument. That is, how the health outcome would change if the number of visits by the health worker changed. Can someone please guide me on how I can calculate this?

0 Comments
2024/11/30
20:25 UTC

1

Help regarding research analysis.

I’m using SPSS and I’m unsure of how to proceed with my analysis.

I’m using a national inpatient hospital database (NIS) to look at how a specific procedure volume changed pre vs. post COVID. I’ve already combined the years I’m looking at (2018-2021),  filtered the data for only the procedure code I’m interested in, introduced a time period variable (2018/2019 =1, 2020/2020 =2) and weighed my cases by the “discharge weight” variable to represent population estimates. At this point, each row is basically a count for the procedure.

Now I’m stuck and don’t know what kind of statistical analysis I should be doing and what variables to use. I’ve played around with using independent t test using time period x discharge weights, thinking that each row x discharge weight = estimate of procedures, but I’m not really sure if that’s right. 

I’d appreciate it if someone could please advise me on this.

6 Comments
2024/11/30
20:08 UTC

1

How do i analyse questionnaire results?

Hey,

While working on my thesis, i had released a questionnaire into public. It has 30 questions, with single answer or open answers. 140 people responded, and i was not expecting that much replies.

I've exported the results into excel, which lead me to a quite messy sheets. First row is the question, second is the possible answers, then all the respondents. Every possible answer is separate column, and answer is marked by 1 in a cell, leaving all others empty.

Mentor said that i should use basic descriptive analysis, CL95%, chi square with df. And thats where i ran into issues.

So when simplyfying the answers, i get for example: Question 3, 17 chose A, 33 chose B, 89 chose c, 1 unresponded. I'm trying to use excel data analysis functions, but i keep getting errors. Tried looking into youtube for help, but in every video they are using those tools for 10+ numbers, not just 2-5 like in my case.

What am i doing wrong? Did i misunderstand my mentor and i need to do different kind of analysis? I know for sure they mentioned CL and chi.

Also tried using spss and R but i couldnt even import data properly, lol.

Any tips will be greatly appreciated!!

6 Comments
2024/11/30
19:01 UTC

1

Preprocess two different kind of datasets for a machine learning problem

I am working on two health-related datasets. And I use Python.

  • One tabular dataset (called A) contains patient-level information (by id) and a bunch of other features which I have already transformed and cleaned. This dataset has around 3000 rows. The dataset contains labels (y) for a classification problem.
  • The other data is a collection of dataframes. Each dataframe represents time-series data on a particular patient (by id also). There are around 1000 dataframes (only 1000 patients have available information on this time-series data).

My methods so far:

  • For the collection of dataframes, for each dataframe/patient-id, I selected only the mean, median, max, and min for each column. Then transformed the a dataframe into a single row of data: for example: "patient_id", "min_X", "max_X", "median_X", "mean_X" instead of lengthy timestep-level dataframe. Do you think this is a good idea to preserve key information about the time-series data? Otherwise, I think of a machine learning model to select the time-series features but not sure how to do so.
  • Now, I would have this single dataframe (called B) of patient-level time-series data and want to join it with the first cleaned dataframe (A) but the rows are mismatched. That is, A has 3000 rows but B only has 1000 rows. The patient ids of B are subset of the patient ids of A. I don't know how to deal with this. I'm thinking of just using the 1000 rows of B and left join A but would it be a lot of data loss?

Any advice/thoughts are appreciated.

2 Comments
2024/11/30
17:45 UTC

0

Choose the right Statistical test in my case

I have a survey dataset where each participant answered 20 unique questions. I want to analyze the data statistically but am unsure which test to use: ANOVA, repeated measures ANOVA, or mixed-effects model?

the data i want to use for the statistical test is the question type and the response time.
https://docs.google.com/spreadsheets/d/16cwLFGaF4KqLvwYNjHIcCyHaWup8vSJpL7gOEqk_XPA/edit?usp=sharing

3 Comments
2024/11/30
13:43 UTC

3

Is there a formula for number of occurrences of n equally likely events when the same outcome can't occur twice in a row?

I understand binomial, negative binomial, hyper geometric, COMBIN in Excel and what have you. Say I have 5 colored marbles I can pull out of a jar all equally likely, (edit) I don't put back the marble I pull until drawing another. So after the first draw there are only 4 marbles in the jar from then on. No blue twice or red etc. in a row is possible.

Is there some formula for n colors and t trials that tells me the chance of exactly k successes? Like what would be the odds of pulling blue twice after 10 pulls?

I can work this through on a spreadsheet that grows with the number of trials but I don't think that's necessary. I realize that at high possibilities and/or high trials that the probability converges to negative binomial. Also that odd or even trials matter for small n and t but I'm not sure how to derive a closed form expression since I'm verging into permutations.

5 Comments
2024/11/30
13:30 UTC

7

Rigorous book for statistical proofs

Hey, I’m a student and my exam requires me to know all statistical proofs from square one. I have Wackerley Mendel, mathstats. It’s a good book, but it doesn’t have the rigour expected from us. Is there maybe a book/document with only full mathematical statistical proofs listed in a coherent manner? I guess it’s more like an encyclopedia type of book. Derivation of test statistics using GLR, Np lemma proof, all of bayesian statistics. Distributions and proofs of their relationships with one another etc…

4 Comments
2024/11/30
12:01 UTC

2

Help establishing multiple linier regression knowledge attitudes and practices study

I must apologise for my statistical naivety, I understand that to allot of you these questions will seem haphazard and possibly quite stupid.

Background: I am aiming to write a knowledge attitudes and practices study (example 1 https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-10353-3, example 2 https://pmc.ncbi.nlm.nih.gov/articles/PMC9684283/ ). This assesses and scores questionnaires of knowledge attitudes and practices and in most cases multiple linear regression models were used to identify the variables that significantly influenced knowledge, attitude and practices.

My data: The data I have gathered askes a series of questions for three categories. Answers for knowledge and attitudes were scored in a binary fashion correct or positive = 1, incorrect or negative = 0. This means that each person has an additive score of binary values for attitudes and a score for knowledge. Questions about practices where similarly dichotomous however where not added and can be used to represent demographics of people with certain behaviour, eg… people uneducated vs educated, people who have previously been tested for covid vs untested, and these populations used to assess a likelihood to have a higher or lower knowledge or attitudes.

Problem: This is where it all breaks down from my understanding perspective. I don’t understand how these previous studies have done their multiple liner regressions. I have read up about multiple liner regressions and from what I understand one dependent variable and multiple independent variables are used to create a multidimensional analysis of a population. My thinking was that my dependent variable would be total score of knowledge or attitudes on the X-axis if imagining a graph. But what is my Y variable? There is no continuous variable? It is simply a one-dimensional analysis of populations (score of a population that tested, score of an educated vs uneducated population?) But then how can I create a multiple linier regression if I can’t plot my variable meaningly on a scatter plot anyway? But then how did the other studies do it if they followed a multiple linier regression?

What can you help with? I cannot make sense of what the two previous studies have done it and how they did did a multiple linier regression, and I would like to replicate what they have done for my own study. I would greatly appreciate an answer on how to compare my populations practices to their attitude score, to see what makes a difference and what does not using multiple linier regression.

Manny thanks

1 Comment
2024/11/30
11:04 UTC

8

How do I find if the difference between two slopes is statistically significant?

I ran separate regressions for different ethnic groups to calculate the slopes (ex: BMI vs. Sleep Apnea score). I then combined these slopes into one graph to visually compare them across ethnicities.

How can I statistically test if the differences in slopes between the ethnic groups are significant? I'm researching and cant figure out of I should use a T test, Z test, or ANOVA, and if so, what type?

I have the slope, X&Y intercepts, standard deviation, and standard error. Each ethnic group is a sub-sample pulled from a larger sample pool containing multiple ethnic groups.

25 Comments
2024/11/30
10:01 UTC

2

Choosing appropriate Statistical model in Stata

Hello Community.
We recently conducted a prospective study on retention and viral load suppression among children and adolescents in HIV care. The study was conducted on children and adolescents receiving ART from two study sites. Our one-year study from the two sites finally came to an end. Our interest currently is to see if the interventions we implemented helped to improve our two major binary outcomes [Rentention 1-"Retained" 0-"Not retained". AND Viral load suppression 1-"Suppressed" 0-"Not suppressed." We also collected data on some independent variables like ARV days dispensed, adherence scores, OVC enrollment status, tuberculosis status, and ARV regimen line, among others, both before and after the 12-month study.

Our challenge now is to choose an appropriate statistical model/test to help us realize whether our interventions had a significant improvement in Retention and Viral load suppression.

Also, note that we measured these two outcomes at the start of the study (baseline data).

Kindly suggest an appropriate model we can adopt and probably the implementation of that model.

Thank you all.

0 Comments
2024/11/30
06:46 UTC

3

How to visualize Win Ratio analysis

I am analyzing a clinical trial using Win Ratio as the primary outcome.

It is normally reported as a line in the results section of the manuscript or part of a results table.

Is there a nice way to visually display the data? A catchy figure would be amazing at a conference

More info about Win Ratio: https://pubmed.ncbi.nlm.nih.gov/21900289/#:~:text=The%20win%20ratio%20is%20the,win%20ratio%20are%20readily%20obtained.

Thank you!

0 Comments
2024/11/30
01:11 UTC

1

If i have 5 independent and 3 dependent variables do i need to form hypothesis for all the possibilites?(Like 5x3=15 total hypothesis) And do i need to analyze them all indivually?

5 Comments
2024/11/29
22:50 UTC

1

Should I re-do my approach?

I'm self-studying statistics, and I picked up Intro to Probability by Blitzstein and Hwang based on some recommendations I've found some long time ago. I'm working through the first chapter and it's unsurprisingly heavy on combinatorics, which I am finding to be challenging. I definitely don't want to get stuck here, so now I am wondering if I am barking up the wrong tree and working through something unnecessary. I was expecting to look at mean, median, mode and stuff like t-tests, normality, RMSE, etc.

3 Comments
2024/11/29
19:44 UTC

7

Probability of 10 cards in a row being the same suit?

10 cards are dealt from a well-shuffled deck. What is the probability all 10 will be the same suit?

12 Comments
2024/11/29
19:30 UTC

1

Assumptions of normality

Hello, I'm struggling to understand the difference between the assumption of a normally distributed DV (which I belive refers to the data point for the DV should look normally distributed for each IV group) and the assumption of normality (which I'm not quite sure I understand but believe it has something to do with residuals - which is another concept I'm still trying to figure out..)

Are those two assumptions related? And could someone help me understand the normality assumption better?

Thanks so much!

5 Comments
2024/11/29
19:01 UTC

2

Basic question on standard deviation of a prediction

If I have a model that predicts a certain outcome, let’s say how many people will visit this restaurant in a given day (and I feed into the model a few details like the date, the weather, ect), and the model has an average error rate and standard deviation on that error rate, how do I know the standard deviation of the predicted outcome.

As in if the model predicts 100 visitors, the average error is 5%, the standard deviation on error rate is 7%, what can I say is the standard deviation of my prediction of 100? My instinct is to add them up and say 100 +- 12

What’s the actual right answer?

8 Comments
2024/11/29
18:40 UTC

2

Correlations within participants

I have a large number of participants, and for each of them I'm looking at the correlation between an ordinal independent variable (6 levels, repeated measures) and a dichotomous outcome (0 or 1 for each level, for each participant), so some kind of logistic regression. How do I assess the overall model? i.e., across all participants an increase in the IV is associated with an increased likelihood of a positive outcome (DV = 1). Thank you!

1 Comment
2024/11/29
15:40 UTC

1

Test for two independent average of averages

Hello,

Would appreciate any insight into suggesting an approach.

I have two independent datasets, A and B, representing data from different facilities.

Sample size for each dataset is 24, representing each hour in a day. Within each hour, count data was observed and recorded, which was averaged. Then the average for each hour was used to calculate an average of averages to get the overall average over 24 hours.

I would like to test if the 24-hour average for A is not significantly different from B.

To my understanding, I cannot use a 2-sample T-test because my dataset is discrete in addition to the sample size being quite small. So I was looking at non-parametric tests, but cannot use the Mann-Whitney as that assumes the data is on an ordinal scale.

At this point I’m quite unsure which statistical test can be used on this dataset and would appreciate any insight.

3 Comments
2024/11/29
14:16 UTC

2

Analysis for KPI

I am being asked to set a target as a KPI for task finishing time. The workflow mainly consists of service requests and within each service request, we have a set of tasks.

Initially, I thought to make a KPI like task avg time, however, there are many SRs and many tasks within each one, so I thought that it'd be better to construct something like ETA for each task and based on 1 or 0 operator, I build a Service level metric (sum of 1 / total tasks)

I've been working as a BI specialist for a long time but it's the first time I am trying to learn stats and implement it. The issue here is that I am stuck with the data as the company is a start up and I have like 7 SRs as a max number of types while for others I have 1, therefore, I have few tasks as well associated with each SR.

Taking into consideration that the dataset is small with high variance due to some values that are not possible to be corrected, nor dropped, is it correct to use mean, medians instead for a while till we gather clean data or there's actually something to build off the current view?

P.S : the task finish time has some features that might affect the time, like the SR type, operator who handles the task, geo

0 Comments
2024/11/29
14:04 UTC

6

What is appropriate to wear to a statistics/math/DS career fair?

I hope this question is ok for this sub. This career fair is for graduates in the field, and I’ve recently just graduated with a diploma in statistics. I’ve never attended a career fair before and I don’t really know what would be overdressing for these industries?

I’m a woman. Should I wear a suit? Is that too much? Is business casual ok? No dress code was mentioned.

Thank you!

9 Comments
2024/11/29
12:26 UTC

3

Test for data with some repeated measures, some not?

I’m interested in comparing the mean values of a biomarker between a group of randomly sampled athletes vs non athletes. Normally I’d use an independent 2sample T-test. However, the biomarker is measured 4 times per person at different locations in the body (arm, leg, back, chest). If I take the mean of these 4 measurements, the assumption of independence between my groups is fulfilled. However, I’d like to keep the measurements separate. If n=5 people per group, I end up with 20 measurements.

Is there a single test to compare the mean of the 20 measurements between the two groups? I know a paired ttest isn’t appropriate because there are different subjects in each group. I’d like to avoid averaging the locations down to a single 5 v 5 test, and I’d also like to avoid running multiple 5 v 5 tests; one per body location.

The best single test I can think of is a two-way Anova where athlete status and location as the two independent variables, but I’m not sure. Do you have any advice for how to approach my situation?

5 Comments
2024/11/29
10:43 UTC

1

Can anyone prove that a permutation of normally distributed values is also normally distributed?

I'm looking for a very short and clean proof of the following claim: Given a finite set of reals from a normal distribution, any permutation (or preferably any derangement) is also normally distributed. I know if seems trivial but If anyone could give their approach I would really appreciate it.

6 Comments
2024/11/29
08:39 UTC

4

LMM vs PANEL REGRESSION

Im getting very very confused between the difference of fixed and random effects, because both definitions are not the same in the panel data, and longitudinal data context.

For starters, panel data is essentially longitudinal data right? Observing individuals over time.

For panel data and panel data regression, I have read several papers saying that fixed effects are models with varying intercepts, while random effects has one general intercept. Even in STATA and R, this seem to be the case in terms of the coefficients. And the test used to identify which is more appropriate is using Hausman Test.

However, for longitudinal data and when Linear mixed model is considered. Random effects model is the one with varying intercepts, and fixed effects is the one with constant estimates. And the one that was told to me to use in order to determine if fixed or random effects is appropriate is by doing LRT test.

I am really confused. So can anyone help me?

3 Comments
2024/11/29
07:04 UTC

0

How many people can actually implement an LLM or image generation AI themselves from scratch? (Question cross-post 👇)

Sorry if this isn't the right place to ask this question (I originally asked in r/AskProgramming ), but I'm curious. For example, I recently saw this book on Amazon:

Build a Large Language Model (From Scratch)

I'm curious how many people can sit down at a computer and with just the C++ and/or Python standard library and at most a matrix library like NumPy (plus some AWS credit for things like data storage and human AI trainers/labelers) implement an LLM or image generation AI themselves (from scratch).

Like estimate a number of people. Also, what educational background would these people have? I have a Computer Science bachelor's degree from 2015 and Machine Learning/AI wasn't even part of my curriculum.

14 Comments
2024/11/29
01:21 UTC

Back To Top