/r/bioinformatics

Photograph via snooOG

A subreddit to discuss the intersection of computers and biology.


A subreddit dedicated to bioinformatics, computational genomics and systems biology.

The Biology Network
science askscience biology
microbiology bioinformatics biochemistry
evolution
Bioinformatics

news for genome hackers

Frequently Asked Questions
New to Reddit?
Learning Bioinformatics
#bioinformatics IRC at Freenode
Information
  • If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
  • If you want to read more about genetics or personalized medicine, please visit /r/genomics
  • Information about curated, biological-relevant databases can be found in /r/BioDatasets
  • Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
Getting a job in bioinformatics
Friends

/r/bioinformatics

107,542 Subscribers

0

How to do a phylogenetic tree please help me

My professor wants me to do a tree of life as the theme of circular phylogenetic tree, need to have 97 groups at the same time to express the evolutionary relationships between them, the specific diagram is similar to the following, please help me, I do not have any relevant experience, can someone teach me how to do or some ideas with this?

0 Comments
2024/04/29
14:18 UTC

1

Nuclear Genes Dominate Cluster in Single-Cell Data

Hi,

A specific cluster in my data has a high amount of nucleus-specific genes compared to the rest of the cells. How should I interpret this cluster? Is it a technical artifact or some kind of contamination?

2 Comments
2024/04/29
12:37 UTC

4

Recommendations on papers applications of secondary RNA structure prediction

Does anyone care to recommend some interesting papers you found and read that use prediction of RNA secondary structure (RNAFold, etc.) as part of their methods ? I'm particularly interested in the subject of how RNA secondary structure affects the behavior of viral RdRps and thus viral evolution but I know that's kinda niche, so anything you've found interesting would be cool.

It's also fine if it's on the techniques of RNA secondary structure prediction as well, (so more bioinformatics and less application). Even surveys or reviews is fine.

Thanks !

0 Comments
2024/04/29
11:40 UTC

2

How to convert the SNP matrix (.vcf) to an Excel file like below:

I am sorry if this is the wrong subreddit to ask for help.
I am new to bioinformatics, I am unable to figure out how to convert my snp file (filtered), to the excel file (in the image below).
I used this command: bcftools query -f '%ID\t%CHROM\t%POS\t[%GT\t]\n' Bi.dp10.gq20.miss1.recode.vcf > samples.tsv
but instead of the alleles, the codes 0,1 or 1,1 get printed.
how to do I determine alleles from the genotype calls (GT).

I am not technically trained in bioinformatics, I am learning it by myself, using youtube tutorials, research papers and stack overflow. I use ubuntu on windows.

https://preview.redd.it/6txxyzxdxdxc1.png?width=1907&format=png&auto=webp&s=0f166b96c5e6c3c18a03a00ad2857fea6e89c0d7

https://preview.redd.it/55mz7gv1wdxc1.png?width=868&format=png&auto=webp&s=c6457252ac971f314254f43fd3345562bc04190a

7 Comments
2024/04/29
09:12 UTC

0

Comparing peptide sequences with MS/MS peptide data using MaxQuant

I have few peptide sequences which i want to validate using MS data. Anyone know how can i do it ? Thanks !!

0 Comments
2024/04/29
06:40 UTC

1

MEGA11/TimeTree and Accurate Times

Good evening, I’m Dalis Magnus. I’m a Marine Biology student working on a paper for an Evolutionary Biology course. The focus of the paper being the validity of a species complex and its radiation across its range, using a mix of CO1 and ddRAD data.

I was hoping someone could explain or help me with how to get accurate times out of a TimeTree generated using MEGA11. I have divergence times for my two out groups and the species complex as a whole. I can’t seem to get the program to let me specify those times.

How do I go about this? Or am I use this program incorrectly?

5 Comments
2024/04/28
23:09 UTC

5

Seeking Guidance: next steps in data Analysis for neoantigen identification from just vcf files

Hi, I'm currently working with VCF files (from WGS, with normal and tumor samples) from the ICGC database. We aim to identify immunogenic neoantigens (of protein or DNA nature) in cohorts of pancreatic cancer patients (specifically, those from Canada and Australia) using machine learning. Following the workflow outlined in a paper ( PMID: 37816353), I have annotated (using VEP) VCF files for each patient with snvs and indels, filtered to include only variants affecting protein-coding genes (yet, a variant may affect several non-protein condign transcripts) that are expressed.

Now, I'm stuck at the next steps. We can only use the VCF files as we don't have access to FASTA files and lack the memory capacity to work with the BAM files (which are around 20TB). According to the image I posted (PMID: 36698417), I need to:

  1. Perform HLA typing.
  2. Obtain TCR-seq data for TCR-pMHC prediction.
  3. Generate 11-mers of the variant amino acids/nucleotides, discarding those that match the wild-type (WT) 11-mer.

For the first problem, I have two options. I can use bcftools (consensus chr6:28,510,120-33,480,577) to generate a FASTA sequence of the HLA region from the VCFs and then perform HLA typing. Alternatively, I can use pharmaCat to directly perform HLA typing. I'm leaning towards using pharmaCat, but I'm unsure if it will provide the necessary input for HCM-binding prediction. Additionally, if I opt for the first option, I'm not sure how to create the consensus using only the normal sample (i don't totally understand the bcftools instructions) and I haven't found a predictor that doesn't require paired reads.

For the second problem, I was considering using bcftools consensus, but I'm not sure which region of the genome this sequence corresponds to, unlike the HLA region which I've identified. I know that the alpha and beta chains are located on chromosomes 14 and 7, respectively, but I'm uncertain if this approach would work.

For the third problem, I've identified three options:

  • Using the ANNOVAR argument --coding_change.

  • Utilizing FastaAlternateReferenceMaker or bcftools consensus to convert the VCF file into a FASTA file for the gene ad the gffread to extract protein sequences from FASTA + GTF files, followed by filtering and obtaining the mers.

  • the more direct approach: read the GTF and VCF simultaneously, and for each variant: + Look up the overlapping transcripts, and for each transcript: + Compute the local reading frame (for translation) + Compute the new amino acid (if synonymous, stop) + Compute each 11-mer overlapping the position in the amino acid sequence. For this one, i want to use the 3º option, but i dont feel vary confident to make such a script (currently is were I'm putting more effort of all this problems). I´ve search for paper of the immunogenicity predicting topic , but they don't really let clear how to get the mers.

My preference is the third option, but I'm not very confident in my ability to write a script for this task. That said, currently, this is where I'm putting most of my effort.

So, this post is essentially a request for guidance and opinions on how to approach my three main problems. I'm relatively new to the field of bioinformatics, coming from a biotechnological background, so please pardon my ignorance if I'm asking something obvious.

https://preview.redd.it/v4dqtl3njaxc1.png?width=799&format=png&auto=webp&s=34b73730020d656eb352ccb519ad42e3939adc91

Yellow: what i have

Red: what i want

5 Comments
2024/04/28
20:19 UTC

16

Would you recommend PacBio over nanopore for any reason?

As title. PacBio is poping up a lot in my twitter ads (red flag tbh), and I heard they may get delisted(?).

Is there anyone out there who would recommend PacBio over Nanopore right now? Why?

75 Comments
2024/04/28
19:53 UTC

1

KBase Prokka Output Problem

Hey everybody!

I'm trying to conduct metagenomic analysis within the Department of Energy Systems Biology Knowledgebase, or KBase for short. I'm following this paper: https://www.nature.com/articles/s41596-022-00747-x

Unfortunately, I'm having a problem with functional annotation, the 7th stage of the protocol. At this stage, the user may choose to use RAST or Prokka, and we chose the latter due to its better performance.

Unfortunately, there is no Prokka app in KBase that takes an AssemblySet object as an input. The apps available take individual Assembly objects only. As such, I've been running the Prokka app in batch mode, but this outputs a series of AnnotatedMetagenomeAssembly objects rather than Genome objects. This is important because subsequent analysis is not possible if the objects are in this state.

Does anyone know how to solve this issue in KBase? Ideally, I'd want to do this without exporting the data, doing the necessary tweaks, and then reuploading to KBase.

0 Comments
2024/04/28
17:32 UTC

1

Calculate sequence divergence from 4-fold degenerate sites of a pairwise whole genome alignment (MAF)

I'm trying to calculate pairwise sequence divergence between 2 species in a pairwise whole genome alignment (MAF file). The genomes were aligned using LASTZ. I would like to extract 4-fold degenerate sites and then measure pairwise distance (ideally under Kimura 2-P or similar) between the whole alignment. A lot of the tools I see require everything to be on a single chromosome or won't work for files of this size. I'm hoping to find something that works with a MAF file, but if I have to convert to FASTA or HAL that's fine.

I've used degenotate package to extract 4D sites from a FASTA file of CDS alignments and then used 'distmat' from EMBOSS (https://www.bioinformatics.nl/cgi-bin/emboss/help/distmat) to calculate K2P divergence, but it outputs a distance matrix so I have to carefully format input files to be only 2 sequences so it doesn't take forever. I'm not sure how to format my MAF WGA to do the same. Galaxy takes too long, and RPHAST won't compile on my laptop (UNIX).

1 Comment
2024/04/28
17:20 UTC

3

narrowPeak file generated using genrich for differential accessibility analysis using Diffbind

Hi I have generated .narrowPeak files using genrich and normalised .bigwig files using bamcoverage from deeptools. In order to perform diff accessibility analysis, i am unable to use DiffBind because dba says it cannot find peaks. Can i not use diffbind for genrich peak files?

This is my .narrowPeak file

I created a samples.csv file with the following columns: SampleID,Condition,CellType,Donor,PeakFile,BigWigFile
7 Comments
2024/04/28
16:45 UTC

13

Platform for teaching bioinformatics

Hello fellow bioinformaticians, I ended up in a situation where I have to teach some students (8 in total) some very basic bioinformatics/genomics. I initially thought that galaxy trainings would be very cool, but usegalaxy is too slow to properly do hands on with a small time frame for each lesson. I was thinking of putting up a jupyterhub instance on my lab's server, but it might be a hassle setting it up. Do you have suggestions or tips on platforms for hands on sessions?

7 Comments
2024/04/28
11:25 UTC

51

What are the odds of transitioning into Bioinformatics in mid 30s?

So I made a similar post a while back, asking about the books to learn binf for a newbie.

I studied electrical engineering but it wasn't my thing. Never had much self awareness and being brought up by a single parent who was not educated, there was not much guidance or nudge in the right direction. So, I worked in e-commerce data management and UX related job for 8 yrs.

I never knew what really interested me, to learn it as a skill for a job, especially STEM related. I'm not talking about passion. A job is just a job. But even to do something for work, you need a little bit of interest and inquisitiveness just enough to do it day after day.

But in my late 20s I picked up the habit of reading. Mostly non fiction and also science related books. Why we sleep, books by David eagleman, Siddhartha Mukherjee and few others. It was the books by Siddhartha that peaked my interest in genetics, after reading The Gene and emperor of all maladies. I started to realise that I love life science especially neuroscience and genetics.

And since then I've been toying with the idea of doing binf. I had even applied to one as my third choice in masters application in Sweden for fall 2024. But I happened to get into my second choice which was information systems(waitlisted for my 1st choice- DA). I had binf as my second choice but at the last moment I switched it to third. The reason was, I saw many binf grads struggling to secure a job even with deep biology knowledge. So I wasn't confident and the investment was a lot for 2yrs course as opposed to 1yr and let fate decide.

I have also applied to Georgia techs online masters in analytics. And if I get in, I might be doing both the masters simultaneously.

But what are some ways I could get into binf with this profile? Or should I consider doing a master's in binf? Should I even try or jus drop the idea of transitioning? And work as a DA/DS in tech?

I have SQL knowledge and I have done R and Python certification courses by Google and Jose portilla's udemy course.

47 Comments
2024/04/28
10:49 UTC

1

Coding or non-coding strand?

So I’m currently working on a DNA sequence analyzer in Python that takes user input (a DNA sequence) that can then be used for a variety of other functions.

For my transcription function, does it matter what directionality (5’-3’ vs. 3’-5’) that DNA strand is in order to create the appropriate transcript? It seems that all software online does not distinguish between the directionality and I’m unsure if the function should flip all letters, or just replace T’s with U’s. Do those softwares just treat all sequences as the coding strand or the non-coding strand?

10 Comments
2024/04/27
23:34 UTC

0

RNA-seq

Hi everoones i have a dude...
What would be the appropriate threshold for removing genes with very low or null counts in RNA-seq data analysis?

thanks....

4 Comments
2024/04/27
20:17 UTC

4

scRNA-seq data in GEO with .csv files

Hello! I am new to scRNA-seq analysis and Seurat and I am sorry of this is a stupid question. I have gone through some tutorials online and from what I understand, the output of Cellranger is either .mtx or .h5ad. I have been asked by my boss to analyze some publicly available single-cell data on GEO (GSE218000) and the files present in the RAW.tar are all .csv files. How can I process these .csv files to make a Seurat object?

Also, the data includes single-cell data from 12 patients (4 control, 4 with disease 1 and 4 with disease 2). What would be the best way of integrating the data? Should I combine all 12 .csv files (rowwise) before making the seurat object or is there another way? I would like to keep track of the biological condition (control or disease), obviously.

Thank you! I appreciate any help in this matter :)

6 Comments
2024/04/27
17:05 UTC

13

Is there an online Masters Bioinformatics program (or computational biology) with Python?

As the title says, I am thinking to apply for MS program in bioinformatics and am currently self studying Python. I know that R is also necessary but I am wondering if there is any MS bioinformatics (online) with Python focus?

Any ideas/comments/advices are welcome!

6 Comments
2024/04/27
16:14 UTC

0

An easy way to know which nucleotide the donor resides in PLIP?

So I am trying to analyze this PLIP data:

https://preview.redd.it/6zbxkjfx61xc1.png?width=1001&format=png&auto=webp&s=caef3b758a7aa465da4f2a6f42fec948095b7625

The receptor molecule is an enzyme but the ligand is a long nucleotide chain. It is straight forward to know which amino acid sequence is interacting with the nucleotide, but it is not as straight forward to know which nucleotide is actually interacting with the enzyme. Just the atom number, as you can see under the "Acceptor Atom".

Anyone knows how do I do this?

I am thinking of just opening the nucleotide chain in chimera, and enter this in the command line "select @ 3478". And it does not seem to work. If I highlight the atom, I can just hover on it and see the which nucleotide is it attached to.

0 Comments
2024/04/27
14:26 UTC

1

Trouble finding information about barcodes in PRO-IP-seq protocol

Im trying to reproduce the pipeline for PRO-IP-seq analysis (https://github.com/Vihervaara/PRO-IP-seq/blob/main/aligning_PRO-IP-seq), however I can't seem to find information related to barcodes used in the article (https://www.nature.com/articles/s41467-023-42715-3). Am i misunderstanding the workflow or is there just no information about used barcodes?

5 Comments
2024/04/27
12:56 UTC

11

Transition Advice: From Industrial PhD to Data Science Career

Hi everyone,

I'm a third-year industrial PhD student from Europe, currently working on computational methods in Single-Cell transcriptomics. My project primarily focuses on method development rather than on biological questions. Over the past three years, I've completed two main sub-projects, a pipeline and an R-package, and have been able to publish in a Q1 journal.

A bit about my situation: The company I'm working with typically hires native speakers and barely manages to stay afloat. I suspect my hiring was mainly to secure fellowship funding. Furthermore, my academic PI shows little interest in my project, leaving a lot of uncertainty about when I will defend.

I am contemplating a significant shift towards a data science or development role within the industry (preferably in the bioinformatics/health domain), as the fellowship will expire in the coming six months, and the company doesn't have enough resources to hire me. Based on the history of students in the PIs' lab, most finish their Ph.D. living on unemployed wages. There is some added pressure as I am an immigrant. Thus, I have decided to transition to support myself and put my thesis second on the priority list (I wonder if this is a good idea).

I enjoy developing tools and have a good grasp of statistical modelling, Shiny apps, Snakemake workflows, and containers. As far as I know, Python is more prevalent in industry roles. Thus, I am also considering refreshing my Python skills. The company I'm part of uses a development stack that includes Java, a language I assume is not widely recognized in modern bioinformatics. Thus, I am also considering picking up some web development with JS, as Shiny apps are quite a niche.

In summary, I'm at a crossroads regarding whether to learn new skills or continue focusing solely on the PhD project. I am eager to hear from anyone who has navigated a similar transition. What skills should I prioritize? Is it worthwhile to focus on specific libraries or a broader understanding of data structures in Python? How can I effectively leverage my current knowledge and new skills to secure a job in the industry? Also, I understand that the goal of learning substantial web dev in 6 months is quite unrealistic, but it would be worthwhile to explore something that can complement my skills.

Any advice on manoeuvring through these challenges and securing a job in the industry would be constructive.

0 Comments
2024/04/27
10:18 UTC

2

Integrating two scRNA-seq data set with overlap samples

I have two public scRNA-seq datasets from two sequencing platforms where two of those samples are sequenced on both platforms . Beside the standard Seurat integration, is there any way to take advantage of two samples to normalize the two datasets?

7 Comments
2024/04/27
09:45 UTC

6

How to find and read benchmark papers

Hello everyone!

I´ve heard people talk about benchmark papers a lot and while I do get what they do, noone has taken the time to explain the details:

How do I go about finding benchmark papers? Do they self-declare that they completed a benchmarking or is there a webpage you would recommend that collects them by sequencing method?

And how would you go about reading them? Like your usual papers or do you have any other helpful recommendations?

Thanks a lot!

3 Comments
2024/04/27
09:40 UTC

2

Book/article recommendations - history and technical aspects of sequencing tech

Hello everyone! I am writing my Bachelor thesis on benchmarking breast cancer scRNA-seq data, and in the process I got curious regarding the historical background of sequencing technologies. Now I'd like to read on it on my own, but looking through Google Scholar I couldn't find much.

Does anyone have any recommendations regarding books or articles that talk about the history of sequencing tech, and that go at least somewhat in depth on single cell technologies?

0 Comments
2024/04/27
08:01 UTC

1

MEGA11 - concatenate sequences in the correct order??

Hi!!

I am having trouble concatenating DNA sequences in MEGA11 in the correct order. The MEGA11 website states that it will order inputs alphabetically. I have tried labelling the four Fasta files both alphabetically and numerically however, when I go to concatenate the data, the sequences are in a random order?!

Has anyone experienced this issue and got any advice??

Ta - sad honours student

4 Comments
2024/04/27
05:07 UTC

0

Single cell RNA

Can anyone tell me, how to learn and understand Sc-rna technology. Is there anyway to start it?

I would really appreciate your help Thanks

7 Comments
2024/04/26
21:27 UTC

43

Feeling overwhelmed...am I not cut out for this or is this feeling reasonable?

I've started working at a university bioinformatics lab, having just graduated from a cs degree a few weeks ago.

My job is to work on a pipeline that was built by a team of postdocs (who are now gone). Anyways, I have limited domain knowledge and I'm just coming up to speed. I'm feeling overwhelmed with the HPC environment, and the 20+ tools involved in the pipeline with scripts written in several languages. Additionally it was left with known bugs and multiple sections of the pipeline not working...and when working it had a lot of manual intervention needed.

Unfortunately there is zero documentation. There are just sporadic esoteric comments in the code that make no sense.

Lots of the pipeline has stuff commented out and obviously changed during some debugging process but no documentation as to why, and no explanation on how to run it.

Anyways, I feel defeated and I've just started. The creators of this pipeline have, each, like a decade of experience in this field and I am here two weeks in trying to comprehend the pipeline while also learning the biology behind everything and trying to debug in multiple languages I am unfamiliar with, in a complicated HPC environment.

I feel like a should maybe give up on bioinformatics. I don't think I have the ability to get through this alone (and no there is no one else working on this project , so there is nobody to help me...). I think they maybe overestimated the abilities of a cs grad...

28 Comments
2024/04/26
21:27 UTC

0

Cloud computing for NGS analyses

I'm running a bulk RNA-seq project at the minute but my Macbook is very much out of storage from one sample itself after trimming and I can't proceed to do alignment. I've been looking at cloud computing but I'm not sure which one is the best for pricing and also NGS analyses (I'm very new to cloud computing, so I don't have a clear idea on how to use it). If anyone could help giving me some possible choices of cloud computing options, I'd be very grateful. Thank you :)

Also, if there is any other way to do the pre-processing of RNA-seq, for example through R, please let me know. I'm more than willing to use R to do it

6 Comments
2024/04/26
20:33 UTC

2

Finding Reads Mapped to Mitochondria Chromosomes for Filtering

Sorry if this seems like a noob question. The NCBI reference genome https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_023375975.1/ has only chromosome 1-25 but doesn't have an explicit reference to the MT chromosome like the Homo Sapien genome does. I read a reference paper for my analysis that they were able to filter the reads mapped to the Mitochondria but have no idea on how they were able to do this and if anyone can give any pointers, please let me know.

9 Comments
2024/04/26
20:12 UTC

0

How do I extract fastq files from analysed data.

Completely new to analyzing single cell data here. I am trying to teach myself to do analysis but it’s a long way to go. I have an analyses dataset with raw and filtered feature barcode matrix files, how would I be able to extract original fastq files from that? From what I have read so far BAM files are needed too but they’re not provided. Any help clues appreciated.

5 Comments
2024/04/26
19:05 UTC

22

Results of 4 months of job hunting in the UK

Long time lurker and I see theres plenty of discussion about the state of the job market, which I find helpful reading about, so I thought I'd share my experience applying for jobs in the UK.

[This is "to be continued..." because I'm still searching.]

I had a decent amount of initial interest, and only a minority ghosted me, which is nice. But the main problem is just a lack of relevant jobs to apply for - right now I'm finding maybe 1 per week, even in recent weeks looking in Europe or Asia.

If anyone is interested the two offers I declined were at universities, because the wage was lower than what i'm making doing ad-hoc tutoring now. The interview invitations I declined were because I realised I just didnt have the required skills and it felt pointless to even try, e.g. RNA-Seq, single-cell analyses.

Good luck to everyone else out there applying! [hope this doesnt get deleted because i just registered]

https://preview.redd.it/l5c8e3oydvwc1.png?width=2006&format=png&auto=webp&s=91fd70ca92fa302d4a99eea89bcf5563404d52b0

11 Comments
2024/04/26
19:00 UTC

Back To Top