/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hi, so I’m trying to install foldx into YASARA and I have tried the method that the foldx manual and the YASARA manual showed. But for some reason, in analyze, I don’t get the FoldX clickable option. Am I doing something wrong??btw I have a MacBook Air M2
For me it was today when I found out about the PyMOL plugin PyMod.
✅ Beautiful UI ✅ Integration of a lot of tools I use (PSI-BLAST, Clustal Omega, HMMER, MUSCLE, CAMPO, PSIPRED, and MODELLER) ✅ Open source
I’m very new to this but we have scATAC and scRNA data, and we are looking to see if there is acetylation or methylation in certain conditions or some histones, mainly H3K27ac and H3K4me1, and if there are changes we would have trained immunity.
When I look into how to do analysis, it says we need CHIP Seq data. But my postdoc says it can be done with scATAC as well as seen in publications below:
https://pubmed.ncbi.nlm.nih.gov/25258085/
https://www.sciencedirect.com/science/article/pii/S0092867422003932
https://www.sciencedirect.com/science/article/pii/S0092867417315118
I’d appreciate any help! I’m not sure how to do this at all.
Well what can you do after MS in Agro Biotechnology. PhD or a full time job ?
What are the options for PhD and what are the specific job titles I can apply for after my Masters ?
(Did my Masters in Germany,so I would preferably want to explore options in Germany.)
Thanks....
Hey folks, does anyone know of reference VCFs for somatic variant calling for mouse genomes? I'm thinking in line with gnomAD, illumina panel of normals, etc, for using with Mutect2 without needing/trying/testing liftover from the human versions of these files (or whether this approach would work - surely someone here has tried?)
My plan is probably just to throw Mutect2 at it without the benefit of any of these resources, but obviously making Mutect's job easier makes the data better.
I have two single-cell datasets: one from mouse and one external human dataset. I want to integrate these two datasets using the SCTransform workflow. I am also planning to try other integration methods, but I chose SCTransform because it works well with my mouse samples.
To align the genes between mouse and human, I am using an orthologs table to match the genes. However, I wanted to confirm if this approach is appropriate or if there is a better method for integrating mouse and human data.
I came across a paper (https://www.nature.com/articles/s41467-023-41855-w) that benchmarks different integration methods across species. However, this study did not test the SCTransform workflow and did not exclusively integrate mouse and human datasets. I was wondering if anyone has experience with a similar integration or can offer insights into the best practices for cross-species single-cell integration.
I appreciate any suggestions. Thank you!
Hey all , i'm currently trying to download sra files using aspera connect , but as soon as i'm entering the commmand , it's asking for a password...... [the password is neither ibm aspera account password nor the computer password ] , also just an additional info : aspera connect 4.2 versions doesn't need Ssh keys....
Description
Human migration introduces new genetic variants to host populations that may be passed on and eventually reach modern populations. For millennia, the Silk Roads facilitated the exchange of genetic information between the East (China) and the West (the Roman Empire). However, we know little about WHO and WHAT traveled these roads. This is the first attempt to study the Silk Roads genetically by sequencing the first ancient DNA of the mysterious Parthians who paved the Firsk Silk Road and disappeared centuries later, almost without leaving any written evidence.
By harnessing AI and analyzing ancient genomes, we will gain insights into their ancestry, social practices, dietary habits, and more. This is a novel study that focuses on a poorly known civilization that ruled Central Asia for 500 years and a historical highway of ideas, beliefs, and genes.
Requirements
Applicants must have a Ph.D. or equivalent degree (within three years of the application deadline, with exceptions for special circumstances) in a relevant field such as machine learning, mathematics, biostatistics, or statistical genetics. Essential skills include:
The full post is here: http://www.eranelhaiklab.org/PostdocAd.html
Start: The Expected start date is 1/3/25 or as soon as possible.
Questions and contact: please contact eran dot elhaik at biol.lu.se for questions
Keywords: #SilkRoads, #AncientDNA, #AI, #MachineLearning
I've been working with sequencing data and found the following:
The first image shows the per base sequence quality graph which does usually decrease towards the end but this one has the minimum values all across the positions, yet in the basic statistics it states that 0 sequences were flagged as poor quality. How should I trim this? The second image belongs to the same fastq file.
In the third image I encountered this really weird per base sequence content graph. Usually, there are many variations toward the beginning of the graph but this one is all mixed up, there are two overrepresented sequences but I really don't know until what extend it influences this.
Both graphs are from different fastq files
Hello,
I'm performing an analysis that is fairly new to me and would like to check my statistics are correct. I have quantities for <100 proteins measured for M x samples. These samples group into Z x demographics, which contain demographics of interest, each of which is paired to 1 control demographic (e.g. 'diseased old person', 'healthy old person'). In the table below, you see 1 protein, 1 demographic of interest (Demographic 1, samples s1 - s3) and 1 control for that demographic on interest (samples s4 - s6):
Demographic 1 | Control 1 | |||||
---|---|---|---|---|---|---|
Protein | s1 | s2 | s3 | s4 | s5 | s6 |
Protein 1 | Amount | Amount | Amount | Amount | Amount | Amount |
I am pulling out interesting proteins by doing a Mann Whitney U test, using samples in the demographic of interest vs samples in the control for that demographic. These are represented as a Volcano Plot, with one plot per demographic of interest.
Question: Should I be doing multiple testing correction to set an alpha for the test p value? I was under the impression this is only needed if I am doing a lot of redundant tests (e.g. Demographic 1 vs Control, Demographic 1 vs Demographic 2, ...). But it seems to be a common step before making Volcano plots, and so it might just be a case of 'do it if you do a lot of tests in general'.
I analyze a lot of high-dimensional biological data. Usually, I have 25-50 biomarkers that I compare between two conditions. My go-to analysis, is to perform a Wilcox test across these variables, followed by a correction for multiple testing (Benjamini & Hochberg). Usually, we don't have another dataset to validate findings, unless we generate this data ourselves.
Often, the biological effects are sufficiently large that I end up with a subset of significant biomarkers (P.adjust < 0.05, ~5-10 biomarkers) that we can evaluate further. I now encountered a setting in which none of the biomarkers are significant after multiple testing correction. However, (as expected or would occur by chance), I do find a set of biomarkers that are significant before correcting.
If I cluster based on these markers, I get a distinct clustering that almost perfectly separates two patient groups (n = 40) with a limited set (8) of biomarkers. This seems interesting to me, but I don't want to be over-optimistic, as I'm now entering "cherry picking territory".
Are there any alternatives to this typical "test-correct" pipeline to navigate this? I want to keep the analysis simple and robust. As I'm not working on RNA-seq data, typical packages for that type of data do not apply..
Hi, I am a bioinformatic assistant who works primarily with RNAsequencing. The DESeq2 package is amazing, but I noticed I often cannot get the comparisons that I want with the Results option, and I do not know if its because I lack enough data for sufficient calculations and/or because I am struggling with understanding experimental design.
Here is an example of how I find DEGs for samples and want to know if it is a good strategy or if I have a misunderstanding. Say I have three controls, C1, C2, and C3, as well PT1. I have nonstimulated samples and stimulated samples: C1_NS, C2_NS, C3_NS, PT1_NS, C1_STIM, C2_STIM, C3_STIM, PT1_STIM. My current strategy is to separate the controls into a separate dataframe,then run
dds_control <- DESeqDataSetFromMatrix(control,
colData = colData_control,
design = ~ stimulation)
dds_control <- DESeq(dds_control)
Now I can use results comparing Stim with NS:
res_control <- results(dds_control, contrast = c("stimulation", "STIM", "NS"))
With res_control I can remove genes based on log2fc and pval and any other statistical judgements. Then my rownames are what I consider DEGs based on stimulation and I susbet my orginal dataframe that includes the patients for just the DEGs.
While this seems to logically work, for whatever reason it leaves a bad taste in my mouth. Can anyone validate this strategy, or if its bad do you have any others you can recommend? I always feel like I am missing an important step or a better way to do it. Thanks!
Hi,
I am looking to perform differential gene expression analysis using DESeq2 in R. I initially used TPM data for this which now I realize was incorrect. My question is where do I get TCGA raw count data that is appropriate for DESeq2? I looked at Xena at they had log transformed raw counts, but if my understanding is correct, I can't use that for DESeq2. Specifically for TCGA KIRC
Thx
Has anyone used these tools? I have cloned the repositories. I need to use both the tools to generate a complete between two small protein, one around 117 and other 10.
Hi all, I am an undergrad pursuing a degree in bioinformatics. I want to do something bioinformatics X agriculture for my coming research, specifically drought tolerance gene research on an African orphan crop. This I've seen heavily limits what I can do in terms of data availability, but I've been able to find RNA-Seq data of cowpea and I'm looking to work with that. My plan right now is to utilize ML and bioinformatics to indentify and prioritize drought-responsive genes in cowpea. Given that there are other research that have used other methods to identify drought tolerance genes but none using ML approach(to the best of my knowledge), would this be considered a contribution to knowledge, or do I have to do more as a bioinformatician. Any reply will be appreciated
Hi. I am working with a 10x Visium dataset and I would like to calculate the Number of Cells per spot in my dataset. Inspecting colData(spe)
shows that I do not have a cell_counts column in my metadata. I will appreciate any helpful information that can enable me achieve this and add to my SpatialExperiment
object for further downstream analyses in R.
I’m a high school student, who has self-learnt RNA-Sequencing. I don’t have a supervisor or mentor. At the high school level, does this methodology seem sound for a research project:
Research question: How does Factor X impact genetic expression in heart tissue of Mus Musculus?
Methodology: I can’t tests on mice because I’m in highschool, and I don’t have connections to labs to make it happen. So I’ll find an online publicly available database which has data for a control group and experimental group exposed to Factor X. For each group, I’ll make sure that there are enough mice replicates. I’ll find two more datasets from different experiments that also have an experimental group of mice which received factor X. Then I'll download the fastq files, do QC, trimming, alignment, get counts files, find DEG, do GO, and GSEA. Then I look at the data from each datasets and see what’s in common between them. Then conclude stuff like this: “genes A and B and etc… we’re down regulated and play a role in C function in the heart, suggesting that heart function C may be negatively affected when the heart tissue is exposed to Factor X.
Please critique this methodology, but do keep in mind that I’m a high schooler with very beginner knowledge without the means to do my own experimentation.
Thank you for your assistance and guidance.
Hi everyone,
Wanted to ask if anyone knew how to retrieve "Uniprot keywords" for Unitprot IDs? Is there an R package for this? Familiar with accessing GO and KEGG with clusterprofiler but this is my first time seeing the ability to classify proteins according to post-translational modification as seen in this figure and I would like to try it with my proteomics dataset.
Here's the link to paper: Engineered nanoparticles enable deep proteomics studies at scale by leveraging tunable nano–bio interactions | PNAS, as well as the the figure I want to replicate.
On the note of retrieving info from Uniprot too, is there any way to easily retrieve the number of amino acids per protein in R?
Thanks very much!
Is Illumina's Dragen RNA aligner based on the STAR aligner? They have similar output formats including a one-pass / two-pass alignment approach, but nowhere could I see conclusively that Dragen RNA is based on STAR.
If anyone has had experience using both, I'd appreciate it if you could share your experience and if there are notable alignment differences between the two.
So I'm trying maker pipeline to generate gff files for fungi species, but I'm not able to download some pre requisite for it like snap and exonerate, the site from where I have to download it is not opening, is there any other way for it to download. Or do you know any other pipeline to generate gff files for my data? Any other pipeline?
Hello. I have sequencing data of the V3-V4 region of the 16S paired-end rRNA gene, the libraries were sequenced using the MiSeq Sequencing System equipment.How to find which adapters were used to trim with cutadapt?
Hi verydody,
Could someone please explain why sequence quality decreases after using Fastp? I am currently analyzing small RNA-Seq data, specifically miRNAs. Could this be due to the removal of adapters by Fastp?
Hello guys!
So, straight to the problem.
I have a proteomics dataset in the form of a matrix, with 20 samples (as columns), and 6000 proteins (as rows). It's inside the picture inside this post. Protein expression is already log2 transformed.
Performing a PCA with FactoMiner and Factoextra packages, with the following code:
res.pca <- prcomp(datiprova_df_numeric, center=T, scale=F)
> fviz_pca_var(res.pca)
I obtain the PCA labeled 1 in the picture inside this post.
By writing
res.pca <- prcomp(datiprova_df_numeric, center=T, scale=T)
> fviz_pca_var(res.pca)
I obtain PCA 2 instead.
Now, when I transpose the matrix, and by writing
res.pca_t<- prcomp(datiprova_df_numeric_t, center=T, scale=T)
> fviz_pca_ind(res.pca_t)
I obtain PCA 3.
Why do I have the difference in how the PCAs look? I mean, using the same matrix i should get the same results, but with plots inverted if I transpose the matrix. I get why variables become individuals if i transpose, but not the change in PCA.
Can someone help?
Thanks!
This question is intended to be broad because I hope to gain a variety of perspectives on the potential for AI to enhance and accelerate research in the field. Whether it's generating code for analysis or summarizing articles with LLMs, exploring literature more efficiently, using tools like AlphaFold or genomic LLMs for specific problems, or applying traditional machine learning techniques to make discoveries. Whatever way you use AI, feel free to share it.
Hello,
In my data, I have nine different types of samples (group 0 to group 8). I want to know whether group 0 is a "group" so there is within-group similarity, while I also want to know whether group 0 is different from 1,2,3,4... and so on.
I know I can run DGE, but I need a global assessment. I want something besides PCA or t-sne
.
Do you know what I can do?
Hi everyone, I am working on a project where I use nanopore sequencing to compare methylation between two different conditions of A549 cells. I'd like to compare the promotor methylation but I am not sure how to define the promotors. I thought about using data on TSS and then defining the promotors as x bases upstream and y bases downstream of the TSS but then I am unsure how to choose the values for that. Do you guys have any ideas what kind of resources I might want to look at to answer this? Or if you have a completely different approach for solving my problem that would also highly be appreciated. Thanks for the help!
Not sure where else to ask this question but I'm interested in working on the rosalind problems but have never received the email link to activate my rosalind account. It's been days too. There's also no contact info on the site to report the issue to. Anyone else experience the same issue and can shed some light? Thanks.
Im trying to wrap my head around multiple sequence alignment, but im at a loss of how well the algorithms manage to reduce sequence bias?
When doing a multiple aligment you seemingly have to do select sequences, choose algorithm, filter and repeat. But within the algorithm part there are several subalgorithms(treebuilding and weighing) how efficient are these at reducing sequence bias? can i just upload any type of sequences and it will sort it out and yield similar output as if i took a subset of my intial set of sequences?