/r/q?req.query.q -- Subreddit Search

108,874 Subscribers

Logistic regression with time variable: Can I average probability across all time values for an overall probability?

Say I have a model where I am predicting an event occurring, such as visiting the doctor (0 or 1). As my predictors, I include a time variable (which is spaced in equal intervals, say monthly) which has 12 values and another variable for gender (which is binary, 0 as men and 1 as women).

I would like to be able to report the probability that being a woman has on whether a person will visit the doctor across these times. Of course, I can estimate the probability at any given time period, but I wondered whether it is appropriate to take the average of probabilities at each time period (1 through 12) to get an overall probability increase that being a woman has over the reference category (man).

Thanks for any help.

1 Comment

2025/01/31
18:59 UTC

Books about "clean" statistical practice?

Hello! I am looking for book recommendations about how to avoid committing „statistic crimes“. About what to look out for when evaluating data in order to have clean and reliable results, how not to fall into typical traps and how to avoid bending the data to my will without noticing it. (My field is mainly ecology if that’s relevant, but I guess the topic I‘m inquiring about is universal.)

1 Comment

2025/01/31
17:00 UTC

Who is responsible and how could they be held responsible?

Over and over, we see it:
"I have collected massive huge steaming gobs of chunks of data and I have no idea at all how to analyze it!" Who should be held responsible for this destructive and wasteful behavior? The poor kids (it's usually kids) who actually make this mistake are floundering blindly. They really can't be blamed. So, who should be raked over the coals for putting them in such situations?

How can the actual miscreants be held responsible, and why are they still tolerated?

16 Comments

2025/01/31
13:49 UTC

Aggregating ordinal data? Helppp

In my research, I am examining the impact of AI labels (with vs. without) on various brand perceptions and behavioral intentions. Specifically, I analyze how the stimulus (IV, 4 stimuli in 2 subgroups) influences brand credibility (DV, 2 dimensions), online engagement willingness (DV1), and purchase intention (DV2). Attitudes toward AI and brand transparency act as moderators, while brand credibility serves as a mediator of the effects on the other variables.

With a sample size of about 248 participants (approximately 120 per group) and all constructs measured on a 5-point Likert scale, I am using Jamovi for the statistical analyses.

At first, I thought it would be perfectly fine to aggregate ordinally measured scales into continuous variables by calculating the mean of the items. However, I have realized that aggregating ordinal scales into means can be problematic, as the assumption of equal distances between categories in ordinal scales does not always hold. This led me to reconsider my approach.

After recognizing this issue, I questioned whether aggregating in this way is truly valid. It turned out that the mean aggregation of ordinal data is frequently used in practice and is often considered valid, especially when internal consistency is high, as in my case. While this finding provided some reassurance, I am still unsure how the normality assumption and the distances between categories might affect the results.

For the analysis, I used non-parametric tests and applied bootstrapping. The issue here, however, was that I used continuous aggregated variables as the basis for the tests, which is not ideal because these tests are typically used for ordinal data.

To investigate the moderators and mediation, I tested attitudes toward AI and brand transparency as moderators and considered brand credibility as a mediator in my analysis (using MedMod in Jamovi).

Finally, I considered conducting an ordinal logistic regression for the control variables such as age, buyer status, and gender. However, I realized that the dependent variable is now considered continuously aggregated, which made this method problematic. This raised the question of whether I could round the item means to treat the variables as ordinal again and apply non-parametric tests, but this would lead to a loss of precision. Given the different measurement levels of the variables, I am considering using MANCOVA instead, but I also face the challenge of violations of normality.

Using meadians or IQR might help, but tbh I don't know how. Any ideas on the whole thing?

0 Comments

2025/01/31
13:18 UTC

N points on circle question

Hi, I was doing a question that goes like so: N points are randomly selected on a circle’s circumference, what is the probability of all N points lying on the same semi-circle?

My approach was to count all possibilities by assigning each point a value along the circle’s circumference.

Let’s denote infinity with x. The possible ways to assign N points would be x^N. Then, choose one of the random points and make it the ‘starting point’ such that all other points are within x/2 (half the circumference of the circle) of the starting point when tracing the circle in the clockwise direction. There are x possibilities for the starting point and x/2 possibilities for all other points so we get x * (x/2)^(N-1)

So the answer is x*(x/2)^(N-1) / x^N which equates to 1/[2^(n-1)]. This gives us 1/2 when there are two points, which is clearly wrong.

The answer is N/[2^(n-1)], which makes sense if all the points are unique (I would multiply my result by N). I looked up other approaches online but they don’t click for me. Could someone please try to clarify this using my line of thought, or point out any logical flaws in my approach?

1 Comment

2025/01/31
11:13 UTC

An appropriate method to calculate confidence intervals for metrics in a study?

I'm running a study to compare the performances of several machine learning binary classifiers on a data group with 75 samples. The classifiers give a binary prediction, and the predictions are compared with the ground truth to get metrics (accuracy, dice score, auc etc.). Because the data group is small, I used 10 fold cross validation to make the predictions. That means that each sample is put in a fold, and it's prediction is made by the classifier after it was trained on samples on the other 9 folds. As a result, there is only a single metric for all the data, instead of a series of metrics. How can confidence intervals be calculated like this?

5 Comments

2025/01/31
05:51 UTC

Alpha value with a chosen Survey confidence level of 90%

Hi, I’m a student and I have a question and it’s actually very stupid but i can’t seem to figure it out on my own. I did a survey and I chose a 90% confidence level and 5% error margin. There are variables results from the survey that I want to statistically test like for example association between “gender” and “interest in x topic”, so I’ll use a Chi-square test of independence. Now what I don’t understand, is which alpha value I have to choose…the standard is 0.05, but is that only possible when the survey confidence level is 95% or are these two things completely unrelated and can I still choose α=0.05 with a survey confidence level of 90%? Thank you in advance!

6 Comments

2025/01/31
00:43 UTC

Is there a name for a predictive model that periodically adjusts a weighting parameter to re-fit the model to historical data?

My question is in the context of a variation of an epidemiological SIR model that has an extra "factor" for the Infections term so that the difference between the predicted infections and actual infections can be minimized. We have newly reported daily infections and then the SIR model itself makes predicted daily infections. Then every couple of weeks, we run an optimization process to minimize the difference between the two and update that weighting factor going forward.

In a sense, this overfits the model to historical data, but doing this generally makes the model more accurate in the near term, which is the main goal of this model's use. However the conceptual driver behind this is that a populace may change behaviors in a way that's difficult to measure that impacts the number of new infections (e.g. starting or stopping activities like masking, hand-washing, social distancing, getting vaccinated).

Is there term for a predictive model that has a parameter that is regularly adjusted to force the model to better match historical data?

4 Comments

2025/01/30
21:25 UTC

Can anyone tell me if this is correct about sampling a population and the law of large numbers?

Suppose a population has two classes class_#1 and class_#2 with proportion P and (1-P) respectively. If I take many random samples will the proportion of times each class is the MAJORITY (ie >50% of the sample) in the sample converge to the population portions of each class? For example 30% of the time class_#2 will be the majority in a sample because it's true proportion is .3 in the population?

8 Comments

2025/01/30
18:57 UTC

ANCOVA power

Feeling very dumb getting confused by this.

The study is a pilot of an intervention. Same group of participants measured over 3 time periods. The variables of interest are responses to 7 different self report measures on a variety of symptoms. We also want to evaluate the potential influence of intervention completion and demographics.

I think this is an ANCOVA? Confused of what to input into GPower to get a needed sample size for a medium effect with .95 power.

Thanks for any help!

9 Comments

2025/01/30
18:52 UTC

Zero rate incidence analysis

I'm working on a medical research project comparing the incidence of a surgical complication with and without a prophylactic anti-fungal drug. The problem is, in the ~2000 cases without the anti-fungal, we have had 4 complications. In the ~900 cases with the anti-fungal, we have had 0 complications. How do I analyze this given that the rate of complication in the treatment group is technically 0? I have a limited background in statistics so am kind of struggling with this. Any help greatly appreciated?

1 Comment

2025/01/30
18:29 UTC

Quarto in R Studio (updating tlmgr)

Hello,

I was wondering if anyone has an explanation for why every time I render a qmd file as a PDF, in the background jobs, it will often say things like "updating tlmgr" or some other package. Why would it need to update every time I run this?

Thank you,

2 Comments

2025/01/30
17:56 UTC

Ancova dataset request

I am looking for a dataset suitable for ANCOVA analysis with quantitative covariate and categorical explanatory variable with at least three categories.

Can anyone point me in the right direction ? thanks.

2 Comments

2025/01/30
17:42 UTC

What do best for lines tell us?

If I have a set of data, say “widgets produced per month” that I plot out for a ton of data. Then do a line of best fit for it.

How do I tell if a given data point is significantly deviating from that value?

Cause if I find that one month we produced 5 more widgets than the LOBF suggests. And then another month we produced 500 more than it predicts, obviously one of those is significant and the other likely isint. But how do I determine that threshold?

2 Comments

2025/01/30
17:03 UTC

Help Fréchet Distribution in Accelerated Failure Time Framework error

Has anyone ever seen the Fréchet Distribution used in an accelerated failure time framework? Given that it assumes a minimum value of zero and models for an unbounded maximum, I think it would be the most appropriate distribution for some fire truck arrival data I am trying to model. But I am having trouble determining how to find the error term for that distribution in an AFT framework. I know the related Weibull uses a Gumbel distribution. Since the Fréchet can be written as a Weibull with negated Term, see link below, can I just used Gumbel with a similarly negated term. :)

https://en.m.wikipedia.org/wiki/Weibull_distribution

0 Comments

2025/01/30
17:02 UTC

Technical definition of "infant mortality rate": Why is the numerator for the same period as the denominator?

It seems the standard measure of infant mortality rates is [1k x deaths in a given year] divided by [births in a given year]. An "infant" is a live birth from age 0 to one year (can be further disaggregated to "neonatal" etc.). To me it seems like this measure would be rife with inconsistencies given that some/many of those counted as deaths were born the prior year.

For example, if a city is rapidly growing in birth rate during a given year YYY1 compared with YYY0 but returns to its typical growth rate in YYY2, the city will have a deflated infant mortality rate in YYY1 and inflated infant mortality rate in YYY2. This is because many of the deaths in a given year belong to births from the previous year.

I can't seem to find any methods papers that discuss this issue (I found one Brazilian paper, actually). Does anyone know of a resource that shows how to account for this? Is there something I'm missing here?

* I also posted this on public health and will try to share insights from there.

6 Comments

2025/01/30
16:34 UTC

Med student w/ stats background - career advice

I’m about to graduate from medical school (US). In a few months I’ll be matching into internal medicine residency. Looking for career advice/ideas.

My path: statistics major & computer science minor -> gap year as medical scribe -> medical school. I’ve been using R for research projects in med school - mostly basic stuff. Never had formal stats internships or jobs.

I like medicine, but I do miss using the quantitative side of my brain. I really love math and stats too. A couple of options I’ve thought of are:

Academic medicine with some research and some clinical
Work part-time clinical and part-time something else (industry? Government? Not sure what’s out there)
Pivot to a full-time statistical job. Maybe my medical experience could help me in a bio/medical stats role?

I guess I’m wondering what the options are for a medical trained person with stats background.

Also looking for general career/skill building advice for stats. I haven’t worked on my non-clinical resume much. I just updated my LinkedIn. I don’t have a GitHub portfolio but could make one. Where should I begin? What are some ways to build my skill set within the time constraints of residency (80-hour work weeks)?

0 Comments

2025/01/30
14:11 UTC

Seeking Guidance on Transitioning from Accounting to Data Analysis

Hello,

I am an accountant with seven years of experience in the banking sector, currently seeking to transition into a data analyst role. I have recently updated my resume and LinkedIn profile to reflect this career shift and would greatly appreciate your feedback on how I can enhance them to better align with data analysis positions.

Specifically, I am interested in advice on:

Resume Improvement: How can I effectively highlight my transferable skills and relevant experiences to appeal to potential employers in the data analysis field?
LinkedIn Profile Optimization: What strategies can I employ to showcase my career transition and skills effectively to attract the attention of recruiters and hiring managers?
Skill Development: Are there any essential skills or certifications you recommend pursuing to strengthen my candidacy for data analyst roles?

I am committed to making this transition and am eager to learn from those who have navigated a similar path. Your insights and suggestions would be invaluable to me.

Thank you in advance for your time and assistance.

Portfolio

Resume

0 Comments

2025/01/30
13:36 UTC

How should I structure my approach a course on measure-theoretic probability?

First, my background: I have a bachelor's degree in software engineering which required me to pass the standard calculus 1 to 3.

I'm currently at my first pursuing a two year long master's degree in Probability Theory and Statistics, which requires me to take measure-theoretic probability in my second year .

Given that I have not taken any measure theory or real analysis course, can you advice me on which one will be may be a better approach:

Take an undergrad introduction to measure theory before my theoretical probability course, fail it, then learn the basics of real analysis and then take the Probability course.
First focus on self-study of real analysis, then take the Probability course, fail it, then take measure theory in the summer and finally retake Probability theory after the end of my second year.

Note that I'm not planning to finish the master's degree in the two years that it's intended to, instead I will be spending 3 or 3.5 years to finish it. I am allowed 8 retakes for every course I have been enrolled in. As to why this is possible - I'm in a small country where very few people are willing to study mathematics and universities are very lenient in allowing more attempts to the ones who would.

TLDR: Of my options, which one is better:

Self-study real analysis -> Measure theoretic probability -> Introduction to Measure Theory -> Retake measure theoretic probability
Introduction to Measure Theory (Fail) -> Self-study real analysis -> Measure theoretic probability -> Retake Introduction to Measure theory

0 Comments

2025/01/30
12:47 UTC

Starting from Bayesian, how would it be done?

As I've become more comfortable with Bayesian methods, I've begun to wonder. Would it be possible to introduce statistics on a Bayesian footing from the beginning, at the same pedagogical levels currently used for teaching frequentist methods--not as a supplement to frequentism, but as the approach to use? If so, how would it be taught?

34 Comments

2025/01/30
12:23 UTC

Welch-ANOVA,Post hoc and then ANCOCA?

I am currently writing my Master’s thesis and have a question.

I have three groups that I would like to analyze, so I performed the Welch-ANOVA (as the standard ANOVA didn’t work with my distribution, etc.). Afterward, I conducted a post hoc analysis. Now, I want to examine whether age and sex make a difference.

Would it be appropriate to use ANCOVA for this?

6 Comments

2025/01/30
10:27 UTC

How do I get prevalence after adjusting for gender and age?

Hi everyone, apologies if this is something really basic that I have missed.

I have a dataset that has samples divided into a number of ethnicities, each sample having gender, age, and a bunch of biochemical and socio demographic information. I want to see what is the prevalence of high cholesterol in each ethnicity. Initially I had just calculated the raw prevalence but considering that age and gender distributions are different in each ethnicity, I figured I have to adjust for these factors.

I cannot figure out how to do this. Should I run a glm of cholesterol against ethnicity, using sex and age as covariates? Please help!

13 Comments

2025/01/30
06:23 UTC

Quadrant Map

I have data I want to put on a quadrant map. I think I need to normalize the columns. I have a number with decimal in one and percent with decimal in the other. What method is best? I think they have to be the same in order make it work.

Could I possibly combine three varieties and still plot in quadrants?

Time, magnitude and volatility?

6 Comments

2025/01/30
04:15 UTC

New to statistics and hypothesis testing

Can somebody confirm if I am understanding this correctly? We have 5 samples and 4 out of 5 fail to reject the null while the one does? Now,what?

Also, what does sample size mean? Does it mean the number of individuals or distinct items in one sample or total number of distinct samples ?

Thanks in advance

13 Comments

2025/01/30
03:51 UTC

Can someone explain the procedure for a two tailed and one tailed null-hypothesis significance test in a way that makes sense?

I am currently taking intro to statistics in college and I am having trouble understanding how to conduct both two tailed and one tailed null-hypothesis significance tests. I have yet to find a comprehensive youtube video that explains the logic, steps, and how to derive information from your data set. (ex. standard deviation, z score) Can someone please explain it to me like i’m ten?

4 Comments

2025/01/30
02:26 UTC

Is MANOVA Appropriate?

Hi everyone

Quick question, I’m new to the stats world. If assuming all the assumptions for a MANOVA are met, would it be the proper statistical test for the following:

1 IV (Left Hemisphere Brain Injury vs Right Hemisphere Brain Injury) 4 DVs (All continuous variables)

I think I know the answer but want to make sure, as from what I understand 4 separate independent samples t-tests in this scenario would not be not ideal for Type 1 error.

Also, say the MANOVA comes back as significant. Would the univariate ANOVAs that are significant be the DVs that significantly differed between the two levels of my IV? I wouldn’t need to do any more pairwise comparisons for those univariate ANOVAs because I only have one dichotomous IV, right? Or is there something I need to do to similar to other ANOVAs and do pairwise comparisons with Bonferroni correction?

Thanks for the help!

24 Comments

2025/01/29
21:17 UTC

Intro statistic question about Q1

Just started statistics, it is online only. We are on week 3 and I followed the professors videos but he confused me when learning about finding Q1. He showed us =PERCENTILE.EXC(array,0.25) then on week 3 he did formula =QUARTILE.INC(array,1) . It gives me different answers. (Did not post array numbers because I don't want it to be flagged homework) . I emailed my professor and it's day 2 he hasn't responded. Not to discredit the guy and do not mean any bad intentions but he is foreign so on videos it is hard to understand his explanations on things. If anyone can help me understand what formula is recommended and if any of the ones he mentioned are useful.

1 Comment

2025/01/29
21:09 UTC