/r/datascience
A space for data science professionals to engage in discussions and debates on the subject of data science.
/r/datascience
Every time I wanted to use LLMs in my existing pipelines the integration was very bloated, complex, and too slow. This is why I created a lightweight library that works just like scikit-learn, the flow generally follows a pipeline-like structure where you “fit” (learn) a skill from sample data or an instruction set, then “predict” (apply the skill) to new data, returning structured results.
High-Level Concept Flow
Your Data --> Load Skill / Learn Skill --> Create Tasks --> Run Tasks --> Structured Results --> Downstream Steps
Installation:
pip install flashlearn
Learning a New “Skill” from Sample Data
Like a fit/predict pattern from scikit-learn, you can quickly “learn” a custom skill from minimal (or no!) data. Below, we’ll create a skill that evaluates the likelihood of buying a product from user comments on social media posts, returning a score (1–100) and a short reason. We’ll use a small dataset of comments and instruct the LLM to transform each comment according to our custom specification.
from flashlearn.skills.learn_skill import LearnSkill
from flashlearn.client import OpenAI
# Instantiate your pipeline “estimator” or “transformer”, similar to a scikit-learn model
learner = LearnSkill(model_name="gpt-4o-mini", client=OpenAI())
data = [
{"comment_text": "I love this product, it's everything I wanted!"},
{"comment_text": "Not impressed... wouldn't consider buying this."},
# ...
]
# Provide instructions and sample data for the new skill
skill = learner.learn_skill(
data,
task=(
"Evaluate how likely the user is to buy my product based on the sentiment in their comment, "
"return an integer 1-100 on key 'likely_to_buy', "
"and a short explanation on key 'reason'."
),
)
# Save skill to use in pipelines
skill.save("evaluate_buy_comments_skill.json")
Input Is a List of Dictionaries
Whether the data comes from an API, a spreadsheet, or user-submitted forms, you can simply wrap each record into a dictionary—much like feature dictionaries in typical ML workflows. Here’s an example:
user_inputs = [
{"comment_text": "I love this product, it's everything I wanted!"},
{"comment_text": "Not impressed... wouldn't consider buying this."},
# ...
]
Run in 3 Lines of Code - Concurrency built-in up to 1000 calls/min
Once you’ve defined or learned a skill (similar to creating a specialized transformer in a standard ML pipeline), you can load it and apply it to your data in just a few lines:
# Suppose we previously saved a learned skill to "evaluate_buy_comments_skill.json".
skill = GeneralSkill.load_skill("evaluate_buy_comments_skill.json")
tasks = skill.create_tasks(user_inputs)
results = skill.run_tasks_in_parallel(tasks)
print(results)
Get Structured Results
The library returns structured outputs for each of your records. The keys in the results dictionary map to the indexes of your original list. For example:
{
"0": {
"likely_to_buy": 90,
"reason": "Comment shows strong enthusiasm and positive sentiment."
},
"1": {
"likely_to_buy": 25,
"reason": "Expressed disappointment and reluctance to purchase."
}
}
Pass on to the Next Steps
Each record’s output can then be used in downstream tasks. For instance, you might:
Below is a small example showing how you might parse the dictionary and feed it into a separate function:
# Suppose 'flash_results' is the dictionary with structured LLM outputs
for idx, result in flash_results.items():
desired_score = result["likely_to_buy"]
reason_text = result["reason"]
# Now do something with the score and reason, e.g., store in DB or pass to next step
print(f"Comment #{idx} => Score: {desired_score}, Reason: {reason_text}")
Comparison
Flashlearn is a lightweight library for people who do not need high complexity flows of LangChain.
If you like it, give us a star: Github link
Asking as I am new to the space and wondering what are the best practises for:
Assessing balance
Choosing confounders
Examples of a rigorous observational study done to learn from
any tools made currently to help speed up the process
Many thanks
This is a follow up on previous post.
Long story short got a raise from my current role before I even told them about the new job offer. To my knowledge our boss is very generous with raises. Typically around 7% but my case i went by 20%. Now my role pays more.
I communicated this to the recruiter and they were stressed but it is hard for me to make a choice now. They said they cant afford me, as they see me as a high intermediate and their budget at the max is 120 and were offering 117. I told them that my comp is total now 125. I then explained why I am making so much more. My current employer genuinely believes that i drive a lot of impact.
Edit: they do not know that i have a job offer yet.
The title is self-explanatory. How are people landing jobs in the data space right now?
I'm working on a tool that is collaborative in nature and has real-time sync (think multiplayer mode in a video game). If anyone has any guidance on designing a statistical test for this kind of game, or if the juice is worth the squeeze, I'd really appreciate it!
I'm rebuilding a model in Python that I previously built in R.
In R, I used the "changepoint" package to changepoint identification, which, in Python, I've been trying to replicate using the "ruptures" package -- but holy hell is there ever a difference.
R's package gave me exactly what I expected every time without configuration, but Ruptures is spotty at best.
Is anyone aware of a better changepoint detection package?
Hey everyone,
I'm wondering for those of you working on observational studies and using methods like psm,tmle, matching etc.
How long does that project take you end to to end(getting the data to final evaluation result)? and have you found anyways to speed up your process?
Looking to see if theres any ways I could be speeding up the whole process, as they take forever normally(2-3 months)
So DeepSeek-R1 has just landed on OpenRouter and you can now run the API key for free. Check how to get the API key and codes : https://youtu.be/jOSn-1HO5kY?si=i6n22dBWeAino0-5
I'm a head of at a large-ish ecommerce company so do not code much these days but created said assistant to help me with programming tasks that has been massively helpful. just sharing nand wondering what anyone else would use. The do all charts in the style of the economist is massively helpful (though works better in r and not python which is what we primarily use at work but c'est la vie)
- when I prompt you initially for a code related task, make sure that you first understand the business objectives of the work that we are doing. Ask me clarifying questions if you have to.
- When you are not clear on a task ask clarifying questions, feel free to give me a list of queries that we can run to help you understand the task better
- for any charting requests always do in the style of the economist or the Mckinsey / harvard business review (and following the principles of Edward Tufte outlined below)
- try to give all responses integrated into the one code block that we were discussing
- always run debugging code within larger code blocks (over 100 lines) and code to explicitly state where new files have been created. Debugging code should partition the larger query into small chunks and understand where any failures may be occurring
- if I want to break away from the current train of thought , without starting a new chat I will preface my prompt with # please retain memory but be aware that we may be switching context
- when we create a data frame or source data to perform analysis on or create charts from , assign it a number, we will use that number when writing prompts but the table / data frame will remain the same in the code that we use ( we will just be assigning a number to allow for shorthand when communicating by prompt) i.e. sales_table may just be 1 so therefore a prompt to extract total sales from 1 - should return the code select sum(sales) from sales_table
- when I use the word innovation or any of its derivatives feel free to suggest out of the box ideas or procedural improvements to the topic we are discussing
- use python unless I specify otherwise, r would be the next most likely language to be used
- when printing out charts also if you feel necessary print out summary statistics . keep the tabular format clean and tidy (do not use base r / python to achieve this)
- for any charting abide by the principles of visualisation pioneer Edward Tufte which are comprehensively summarised here:
Graphical Excellence: Show complex ideas communicated with clarity, precision, and efficiency. Tufte argues that graphics should reveal data, avoid distorting what the data has to say, encourage the eye to compare different pieces of data, and make large datasets coherent.
Data-Ink Ratio: Maximize the ratio of data-ink to total ink used in a graphic. Tufte advocates for removing all non-essential elements ("chartjunk") – decorative elements, heavy gridlines, unnecessary borders, and redundant information that don't contribute to understanding.
Data Density: Present as much data as possible in the smallest possible space while maintaining clarity. High-density graphics can be both elegant and precise.
Small Multiples: Use repeated small charts with the same scale and design to show changing data across multiple dimensions or time periods. This allows for easy comparison and pattern recognition. (this one is important use small multiples wherever possible)
Integration of Text and Graphics: Words, numbers, and graphics should be integrated rather than separated. Labels should be placed directly on the graphic rather than in legends when possible.
Truthful Proportions: The representation of numbers should be directly proportional to the numerical quantities represented. This means avoiding things like truncated axes that can mislead viewers.
Causality and Time Series: When showing cause and effect or temporal sequences, graphics should read from left to right and clearly show the relationship between variables.
Aesthetics and Beauty: While prioritizing function, Tufte argues that the best statistical graphics are also beautiful, combining complexity, detail, and clarity in an elegant way.
See prompt above.
Received an offer for an Applied Scientist II (L5) role at AWS Kumo (Bellevue) and wondering if it's on the lower side?
Base : $165K
Year 1 Sign-On: $165K
Year 2 Sign-On : $125K
RSUs: 1,600 shares (5%, 15%, 20% every 6 months in years 3 & 4)
Estimated Year 1 TC: ~$350K
Does this seem competitive for an Applied Scientist II position? I was told the correct range from AS 2 is about 318k - 419k. Base can go up to 193K.
C3 AI (just joined this week)
Senior Data Scientist, GenAI
TC : 245K
YoE : 3 (~0 full time in US.)
Does it seem like a lowball of an offer?
Is it pointless to use data science techniques in businesses that don’t collect a huge amount of data (For example a dental office or a small retain chain)? Would using these predictive techniques really move the needle for these types of businesses? Or is it more of a nice to have?
If not, how much data generation is required for businesses to begin thinking of leveraging a data scientist?
Hey DS community,
Mid level data scientist here.
I’m currently involved in a project where I’m expected to work on delivering an appropriate AI strategy for my firm…. I’d like to benefit from the hive’s experience.
I’m interested looking at ideas and philosophies behind the AI strategy for the companies you work for.
What products do you use? For your staff, clients? Did you use in-house solutions or buy a product? How did you manage security and Data governance issues? Were there open source solutions? Why did you/did you not go for them?
I’d appreciate if you could also share resources that aided you in defining a strategy for your team/firm.
Cheers.
I recently interviewed for a data scientist role, and the format of the interview turned out to be quite different from what the recruiter had initially described.
Specifically, I was told that the interview would focus on a live coding test for SQL and Python, but during the actual interview, it included a case study. While I was able to navigate the interview, the difference caught me off guard.
Has anyone else experienced a similar situation? How common is it for interview formats to deviate from what was communicated beforehand? Also, is it appropriate to follow up with the recruiter for clarification or feedback regarding this mismatch?
Would love to hear your thoughts and experiences!
Hey guys, this is literally my first time attending an professional interview in my entire life. I dont know how this roadmap works but i just got a email for hirevue as my first round and this is virtual interview which i was not expecting. Any inputs that you can give will potentially help me!!
TIA
I've been browsing jobs recently (since my current role doesn't pay well). I usually search for jobs in the data field in general rather than a particular title, since titles have so much variance. But one thing I've noticed is that there are way more data engineering roles than either data scientists or ML engineers on the job boards. When I say data engineering jobs, I mean the roles where you are building ETL pipelines, scalable/distributed data infrastructure and storage in the cloud, building data ingestion pipelines, DataOps, etc.
But why is this? I thought that given all the hype over AI these days, that there would be more LLM/ML jobs. And there's certainly a number of those, don't get me wrong, but I just feel like they pale in comparison to the amount of data engineering openings. Did I make a mistake in choosing data science and ML? Is data engineering in more demand and secure? If so, why? Should I fully transition to data engineering?
Hey everyone,
I'm constantly hearing news of layoffs and was wondering what areas you think are more secure and how secure do you think your job is?
How worried are you all about layoffs? Are you always looking for jobs just in case?
Hi everyone,
I created a job board and decided to share here, as I think it can useful. The job board consists of job offers from FAANG companies (Google, Meta, Apple, Amazon, Nvidia, Netflix, Uber, Microsoft, etc.) and allows you to filter job offers by location, years of experience, seniority level, category, etc.
You can check out the "Data Science" positions here:
https://faang.watch/?categories=Data+Science
Let me know what you think - feel free to ask questions and request features :)
A while ago I thought about using LLMs for classic machine learning tasks - which is stupid, I know? But I tried it anyway.
Never use it if:
Do use it if:
Key findings:
Dataset | IMDB 50k Dataset | Cats and dogs |
---|---|---|
Data | Text data - Positive negative sentiment | Picture data - Predict what is on the picture |
Accuracy | 96% - SOTA (98+%) | 97% - SOTA (99%+) |
Model | gpt-4o-mini | gpt-4o-mini |
As you can see LLMs perform worse than SOTA specialized models, but if we have a use case with minimal data it can be very useful.
It took some time to code it in a way that can be also used by others, here is a minimal example of how you can use it when applicable.
You can install FlashLearn using pip:
pip install flashlearn
Below is a sample code snippet demonstrating how to classify text using FlashLearn in just 10 lines of code:
import os
from openai import OpenAI
from flashlearn.skills.classification import ClassificationSkill
# You can use OpenAI or DeepSeek or any OpenAI compatible endpoint
deep_seek = OpenAI(api_key='YOUR DEEPSEEK API KEY', base_url="https://api.deepseek.com")
data = [{"message": "Where is my refund?"}, {"message": "My product was damaged!"}]
skill = ClassificationSkill(
model_name="gpt-4o-mini",
client=OpenAI(),
categories=["billing", "product issue"],
system_prompt="Classify the request."
)
tasks = skill.create_tasks(data)
results = skill.run_tasks_in_parallel(tasks)
print(results)
Feel free to experiment and figure out if it's useful for your work flow. Her is just some tips:
You can ask anything in the comments below!
P.S: Full code ready to be abused available at https://github.com/Pravko-Solutions/FlashLearn
NVIDIA has announced free access (for a limited time) to its premium courses, each typically valued between $30-$90, covering advanced topics in Generative AI and related areas.
The major courses made free for now are :
Note: There are redemption limits to these courses. A user can enroll into any one specific course.
Platform Link: NVIDIA TRAININGS
I have a set of ML algorithms to be fit to the same data on a df. Some of them takes days to run while others usually take minutes. What I'd like to do is to set up a max model fitting timer, so once the fitting/training of an algorithm exceeds that, it will forgot that algo and move onto the next one. Is there way to terminate the model.fit() after it is initiated based on a prespecified time? Here are my code excerpts.
ml_model_param_for_price_model_simple = {
'Linear Regression': {
'model': LinearRegression(),
'params': {
'fit_intercept': [True, False],
'copy_X': [True, False],
'n_jobs': [None, -1]
}
},
'XGBoost Regressor': {
'model': XGBRegressor(objective='reg:squarederror', random_state=random_state),
'params': {
'n_estimators': [100, 200, 300],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 5, 7],
'subsample': [0.7, 0.8, 1.0],
'colsample_bytree': [0.7, 0.8, 1.0]
}
},
'Lasso Regression': {
'model': Lasso(random_state=random_state),
'params': {
'alpha': [0.01, 0.1, 1.0, 10.0], # Lasso regularization strength
'fit_intercept': [True, False],
'max_iter': [1000, 2000] # Maximum number of iterations
}
}, }
The looping and fitting of data below:
X = df[list_of_predictors]
y = df['outcome_var']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=self.random_state)
# Hyperparameter tuning and model training
tuned_models = {}
for model_name, current_param in self.param_grids.items():
model = current_param['model']
params = current_param['params']
if params: # Check if there are parameters to tune
if model_name == 'XGBoost Regressor':
model = RandomizedSearchCV(
model, params, n_iter=10, cv=5, scoring='r2', random_state=self.random_state
)
else:
model = GridSearchCV(model, params, cv=5, scoring='r2')
start_time = datetime.now() # Start timing
model.fit(X_train, y_train) # NOTE: I want this to break out when a timer is done!!
end_time = datetime.now() # End timing
tuned_models[model_name] = model.best_estimator_ # Store the best fitted model
logger.info(f"\n{model_name} best estimator: {model.best_estimator_}")
logger.info(f"{model_name} fitting time: {end_time - start_time}") # Print the fitting time
else:
start_time = datetime.now() # Start timing
model.fit(X_train, y_train) # Fit model directly if no params to tune
end_time = datetime.now() # End timing
tuned_models[model_name] = model # Save the trained model
logger.info(f"{model_name} fitting time: {end_time - start_time}") # Print the fitting time
I recently received a job offer from a mid-to-large tech company in the gig economy space. The role comes with a competitive salary, offering a 15-20k increase over my current compensation. While the pay bump is nice, the job itself will be challenging as it focuses on logistics and pricing. However, I do have experience in pricing and have demonstrated my ability to handle optimization work. This role would also provide greater exposure to areas like causal inference, optimization, and real-time analytics, which are areas I’d like to grow in.
That said, I’m concerned about my career trajectory. I’ve moved around frequently in the past—for example, I spent 1.5 years at a big bank in my first role but left due to a toxic team. While I’m currently happy and comfortable in my role, I haven’t been here for a full year yet.
My current total compensation is $102k. While the work-life balance is great, my team is lacking in technical skills, and I’ve essentially been responsible for upskilling the entire practice. Another area of concern is that technically we are not able to keep up with bigger companies and the work is highly regulated so innovation isnt as easy.
Given the frequency move what would you do in my shoes? Take it and try to improve career opportunities for big tech?
When to stop on the developer track ?
how much do I need to master to help me being a good MLE
Demo of live updating chart on samplesizecalc.com
It's been a while since I've worked on my sample size calculator tool (last post here). But I had a lot of fun adding an interactive chart to visualize required sample size, and thought you all would appreciate it! Made with d3.js
Check it out here: https://www.samplesizecalc.com/calculator?metricType=proportion
What I love about this is that it helps me understand the relationship between each of the variables, statistical power and sample size. Hope it's a nice explainer for you all too.
I also have plans to add a line chart to show how the statistical power increases over time (ie. the longer the experiment runs, the more samples you collect and the greater the power!)
As always, let me know if you run into any bugs.
If your potential employer requires you to sign an NDA for a take home assignment, they’re exploiting you for free work.
In particular, if the work they want you to do is remarkably specific, definifely do not do it.