/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hello Bioinformatics lovers,
I spent the holiday writing this tutorial https://crazyhottommy.github.io/reproduce_genomics_paper_figures/
to replicate this figure
Happy Learning!
Tommy
Hi guys I’m pretty new to PLINK. I’m trying to run a GWAS and have a binary covariate. Does it get encoded as 0/1 or 1/2 similarly to the sex and phenotype. I’m a bit confused and the documentation isn’t very clear about this case. Appreciate the help!
Hi Everyone,
I have been trying for weeks but having a hard time analyze 16s picrust2 data. I have tried ggpicrust2 and it does not seem to work. Could anyone please guide me on how to calculate means proportions and 95%confidence interval and p-value. For this type of graph. Please I would really appreciate it.
Hello from the wet lab side. Would appreciate some advice, im very new to bioinformatics.
Im conducting some experiments in neurodegenrative models using a fairly large pooled shRNA-library that ive been cloning for the past few weeks, and im unsure how to analyze the data/verify the library composition.
I looked into MAGeCK which is what my supervisor suggested although hes not very experienced in bioinformatics either. From the information ive gathered so far it seems like it can be used for shRNA libraries and not just CRISPR, but im unsure how this will work since the sequencing approach will be quite different, as in we will be sequencing barcodes/UMIs rather than the actual guide sequence?
The manuals/tutorials seem quite straightforward but maybe its more complicated for the shRNA context and ill have to modify the parameters of the run somehow?
There seems to be a fair amount of other programs that one can use for these sort of applications, i was looking into edgeR and pinapl.py as well, but im unsure if they're more suitable?
Would anyone be able to give me some guidance?
I’m on a cluster, and I want to copy some zipped fasta files to another folder on the cluster.
Whenever I try the cp command , the files get corrupted, what gives?
Does anyone have any advice? Is there a cp command specifically for gz files?
And yes, for the inevitable Captain Obvious: I have ensured the OG files are still intact.
I'm learning bioinformatics with baby steps and I wanted to annotate some E. coli genomes. After a quick search, it seems that bakta is still being developed/maintained while prokka isn't. So I gave bakta a try. At the end of the annotation process, it shows in the terminal that AMRFinderPlus has failed, and suggested me to update it via command. I did, and the same error poped up on the next run. While searching for some info on the github, it seems that whenever AMRFinderPlus updates, it breaks bakta. And since I've installed bakta two days ago, looks like it arrived broken out of the box. Now I somehow need to downgrade it inside my conda environment in order to make it work properly. My question is, is bakta any better than prokka at all? It looks that prokka did not got any update in years, but at least it seems to work, from what I've seen from my colleagues.
I have a wild tyoe tp53 and a variant. I have already aligned them using blast. But how do I annotate the mutation type. How can I find the mitation hotspots? I have tried to use ensembl vep and other tools. But I can't seem to get it. Please hele me 🙏
Hi everyone,
I'm wrapping up my PhD work in a lab that does small molecule drug discovery. I have become the go-to compbio/bioinformatics person (and I love it!) but I am mostly self-trained. I have pretty good experience with R, some Python.
As a "parting gift" (and maybe as a good demo of my skills for employers...) I would like to turn one of our SAR databases into something more interactive and memory-friendly. It is currently one of those massive, PC-freezing excel spreadsheets. The data is compound name, compound structure (ChemDraw object pasted in, sometime as image -_-), then different columns with activities in different assays.
Does anyone have a link to a friendly tutorial or github for a project like this? I am open to using R, python, SQL, or any other language. It seems simple but the chemical structure column is where I'm caught up. Also while I'm familiar with creating and working with databases in R, I have no experience turning them into something user-friendly.
I have tried searching both the subreddits and Google, I have mostly just found results for making databases in excel. It would be okay if the end product was in excel, but what I'm really picturing is something where you could just type the compound name, pull up the isolated data and structure, and easily add to it as well.
I really appreciate any advice or resources you could give me!
Lately, i have been solving algorithms problems in Rosalind which helped me improve my problem-solving skills and my coding skills immensely. Now, i am looking for a statistic/Data analysis equivalent, does anyone have any recommendations?
Hello everyone,
So a little bit of information : I've been put on a project about organs graft on sheep and the use of treatments to help said graft to not be rejected. To do that, the project was designed around a Control vs Treatment 1 vs Treatment 2 methodology. Biopsies were made and analysed with DIA Mass Spectrometry.
Now, I have access to a bunch of files, the most interesting one being a DataMatrix with all the protein's expression for each sample and a Candidate file with 1vs1 comparison for each protein, giving us regulation information between groups.
Based on that, I have been tasked 2 thing : first to do a statistical analysis of the data (the software linked to the MS only does 1vs1 comparison and not multiple comparaison which were needed here) and second to find the Human homologous of those sheep proteins.
The first part is done (though they don't like the result which.... well I followed protocols so what else do you want from me?????) but the second part is harder because I have 7000 different proteins ID and absolutely no idea how to do it.
So far the only thing that worked required me to put each 7000 protein one by one in the database and manually export each result, which would be absolute hell.
I've been told that it could be possible by using STRING and FASTA sequence.. which I do not have. And also, the STRING website doesn't handle 7000 proteins at once and I was redirected to CytoScape with StringApp.
Now I'm an absolute noob when it comes to all of these and I've tried CytoScape tutorial and manual but I can't seem to make it work the same it does with the tutorial.
Is it even possible to do it with DIA MS datafile or do I need something more ? And can you actually find homologous proteins between 2 different species ?
Thanks a lot people
Hi everyone. I'm a beginner in bioinformatics and i'm working on biodiversity of zooplankton using metatranscriptomics. I have 14 samples of zooplankton community and had these sequenced using Illumina.Post sequencing, I'm working towards assigning taxonomic identification.
Problem: I ran BUSCO analysis after assembly and I got really bad results for completeness. More than 90% of the BUSCOs are missing and very low are complete. These are the post sequencing processing I did so far:
QC- adapter trimming and filtering out of low quality bases using Cutadapt.
Normalization- sampled 1, 300,000 sequences from paired end reads after QC using seqtk
Assembly- I assembled paired end reads using MIRA Sequence Assembler.
Results Sample 1:
Coverage assessment (calculated from contigs >= 1000 with coverage >= 12):
Avg. total coverage: 19.04
Solexa: 19.61
All contigs:
Length assessment:
Number of contigs: 104995
Total consensus: 11770051
Largest contig: 2732
N50 contig size: 121
N90 contig size: 45
N95 contig size: 37
Coverage assessment:
Max coverage (total): 256
Solexa: 256
Quality assessment:
Average consensus quality: 67
Consensus bases with IUPAC: 0 (excellent)
Strong unresolved repeat positions (SRMc): 4 (you might want to check these)
Weak unresolved repeat positions (WRMc): 44 (you might want to check these)
Sequencing Type Mismatch Unsolved (STMU): 0 (excellent)
Contigs having only reads wo qual: 0 (excellent)
Contigs with reads wo qual values: 0 (excellent)
How should I approach this problem?
-use another assembler?
-test completeness using a diff. software?
-is there something wrong with my assembly from MIRA?
Hope you can help me. Really want to graduate this semester.
Firat of all, I am an absolute beginner and have no idea what tools I should use. My teacher game me a problem, mutation analysis of tp53 gene. Where I should compare a wildtype sequence with some random mutated gene. I chose R175H. So i downloaded both sequences and tried to analyze and compare the two using blast and clustalw. But I dont undersatand how do i do that at all. I have watched videos and even discussed with my tea her. But I cant understand anything. Cana nyone please help me?
I started a new position and they gave me the task of interpreting some epigenomic-related results. Now, my prior roles have generally been more wet lab-focused, so bioinformatic analyses fall out of my expertise area and I would appreciate some advice.
More concretely, the study they did used the Infinitum methylation EPIC Bead Chip of Illumina, which gave them information of 800.000 CpG positions and their methylation state. With this, they obtained a series of Differential methylation Positions (DMPs) when comparing two different pathological conditions with a control group.
My PI is interested in the methylation state of the miRNA regions. The bioinformatician conducted two different analyses in this direction, including the miRNA sequence +/- 1kb and 20kb (two different analyses with different range width):
I have been reading some bibliography about the subject, and I wanted to know if the approach (taking the range +/- 1kb and 20kb) makes any biological sense. I would think that analysing the epigenetic modifications in the promoters of the genes that codify these miRNAs would make more sense, but again, I'm not entirely sure that can be done.
I am a first year Chemistry PhD student that plans on looking for a small molecule immune check point inhibitor, immune potentiator, or immunomodulator for the treatment of cancer (or other conditions). Before I start, running synthesis, assays, etc. I wanted to preform a thorough extensive computational screening using docking, molecular dynamics, etc. but I wanted to know is there some way we could computationally test for off targets? Are there any data sets already created? maybe looking at how the drug is potentially metabolized and execrated by the liver and kidneys.
I would also appreciate any good reading materials for people doing projects of this type.
Hi everyone,
Any ideas where I might find a resources/tutorial to how to prepare for a bacterial GWAS? Interested in figuring out how many control organisms I need to compare with. Powerbacgwas looks promising but not sure if that includes all bacteria.
Thanks!
Hy everyone, I’m a GROOMACS beginner.
I want to perform some MD simulations of a protein that has been resolved by NMR spectroscopy (thus it has multiple structure models). Can someone kindly explain to me how to correctly prepare the NMR PDB before running the topology?
Any advice would be welcome!
Thanks in advance !
I have prepared a receptor-ligand complex for molecular dynamics simulations using Maestro on my local machine. However, since I am working on MacOS, I'm unable to run the simulation. Are there any web-based tools that would allow me to conduct molecular dynamics simulations? Or is the easiest solution to install a virtual machine? Additionally, do you know of any guides that could assist me in this process?
We use a few different commercial vendors for WGS sequencing. Recently, as they seem to have upgraded to the Novaseq platforms, they have offered a significant price drop for the same number of reads/sample. However, I have noticed a drastic increase in the number of optical duplicate read pairs from these platforms and wonder if anyone else has experienced something similar? These are pretty standard orders, where we ship genomic DNA and they take care of library preparation and sequencing. It terms of quantification, I compared two cohorts of a few dozen samples each, one from 2021 and one from the past year. The percentage of reads determined to be optical duplicates for the two was 1.7% vs 48.8%.
Hi. This is basically an SOS call.
I have been trying to make my data public on ENA but despite checking all the boxes, no files are public. My submission deadline is running out. I don't expect a timely response from ENA support, and that's why I chose to post here.
I am sharing the screenshots here.
If you have had a similar experience, I would appreciate your help.
As I search through structures in PDB I'm seeing a few come across with flop in its title. What does flop mean?
Here's an example of one - RCSB PDB - 6FQG: GluA2(flop) G724C ligand binding core dimer bound to L-Glutamate (Form A) at 2.34 Angstrom resolution
Any info. helps
Thanks
Hello All,
I have a metagenomic dataset made up of Illumina short reads. I want to know how often this protein is encoded across individual samples within the metagenomic dataset to compare them later. i.e., Does sample A encode for this protein more than sample B? What tools could I use and how would I be able to find this information?
I'm currently looking into maybe using BLAST, where the metagenomics would be a custom database and the protein FASTA would be my query. However, I'm a noob at BLAST and am not sure if this will give me what I want.
Any insight you can provide is appreciated.
Hi r/bioinformatics!
I'm working on filtering a large protein dataset for sequence similarity and looking for advice on the most efficient approach.
**Dataset:**
- ~330K protein sequences (1.75GB FASTA file)
I need to perform all-vs-all comparison (diamond told me 54.5B comparisons) to remove sequences with ≥25% sequence identity.
**Current Pipeline:**
**Issues:**
- DIAMOND is taking ~75s per block with auto thread detection on 4 vCPUs
- Total processing time unclear due to unknown number of blocks.
- Wondering if this two-step approach even makes sense
- BLAST is too slow
**Questions:**
I'm flexible with either Windows or Linux-based tools
**Available Environments:**
Local Windows PC:
- Intel i7 Raptor Lake (14 physical cores, 20 total)
- RTX 4060 (8GB VRAM)
- 32GB RAM
Linux Cloud Environment:
- LightningAI cluster
- Either L40S GPU or 4 vCPU Intel Xeon, unclear version but pretty powerful
- 15GB RAM limit
Thanks in advance for any insights!
My teacher assigned us a final project to develop a bioinformatics pipeline using Python or R. It can be any kind of pipeline. While the task is simple, I have no idea what to do since I’m more familiar with working in structural biology.
At the moment, I’m considering a phylogeny project: something that integrates genome assembly, quality control, multiple sequence alignment, and tree construction. However, I’m struggling with how to get started. I would truly appreciate any insights, comments, or suggestions on this project! :)
Hi. I have single cell RNA seq data for which I have performed batch correction with harmony, mutual nearest neighbors. Can I use the batch corrected data for differential expression analysis?
Hey, I am an Engineering Student from India, I am currently working on a college project where I want to train a ML model to predict the fuel production based on the bioreactor physical and chemical properties. I have read several research papers on this topic and currently struggling to find source for the data. It would be very helpful, if someone can guide me or provide some resource. Thanks for reading.
Hello!
I'm trying to recover a specific family of genes in my study species (olfactory receptors). I've blasted my reference genome using receptor sequences that were recovered in a similar species and available on genbank (output, format 6, below). I'd like to use the coordinates to pull out homologs in my samples (whole genome sequencing) and compare diversity of these regions to the rest of the genome.
What I'm having trouble understanding is why the regions are not contiguous in my search results - does this just have to do with poor matching/sequence evolution? Is there a better tool I should be using, or downstream analyses to help me recover complete homologs?
Thank you so much in advance, I'm teaching myself on the fly and it is slow goings...
Hi,
I'm new to genomics and I'm wondering what I should do from here.
I've assembled some bacterial organisms and I ran prokka on them. I now have fasta files and predicted genome annotation files.
My question is what are common things to do from here to investigate these files? I want to do metabolic reconstruction, and also transposable element analysis. a lot of these organisms have unique plasmids so I'd like to investigate those too. Are there good tools for any of these things?
Hi, so I’m trying to install foldx into YASARA and I have tried the method that the foldx manual and the YASARA manual showed. But for some reason, in analyze, I don’t get the FoldX clickable option. Am I doing something wrong??btw I have a MacBook Air M2
For me it was today when I found out about the PyMOL plugin PyMod.
✅ Beautiful UI ✅ Integration of a lot of tools I use (PSI-BLAST, Clustal Omega, HMMER, MUSCLE, CAMPO, PSIPRED, and MODELLER) ✅ Open source
I’m very new to this but we have scATAC and scRNA data, and we are looking to see if there is acetylation or methylation in certain conditions or some histones, mainly H3K27ac and H3K4me1, and if there are changes we would have trained immunity.
When I look into how to do analysis, it says we need CHIP Seq data. But my postdoc says it can be done with scATAC as well as seen in publications below:
https://pubmed.ncbi.nlm.nih.gov/25258085/
https://www.sciencedirect.com/science/article/pii/S0092867422003932
https://www.sciencedirect.com/science/article/pii/S0092867417315118
I’d appreciate any help! I’m not sure how to do this at all.