/r/bioinformatics

Photograph via snooOG

A subreddit to discuss the intersection of computers and biology.


A subreddit dedicated to bioinformatics, computational genomics and systems biology.

The Biology Network
science askscience biology
microbiology bioinformatics biochemistry
evolution
Bioinformatics

news for genome hackers

Frequently Asked Questions
New to Reddit?
Learning Bioinformatics
#bioinformatics IRC at Freenode
Information
  • If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
  • If you want to read more about genetics or personalized medicine, please visit /r/genomics
  • Information about curated, biological-relevant databases can be found in /r/BioDatasets
  • Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
Getting a job in bioinformatics
Friends

/r/bioinformatics

111,500 Subscribers

0

Reaching out to potential PI as an international student

I am planning on applying to comp bio/bioinformatics programs for fall 2025 soon and I was wondering if it is advisable to reach out to professors to work with, as an international student in the States? I ask this because I am an international student and a lot of the programs I looked into are rotational (for example, University of Michigan) and I remember reading on one of the university's website that it is better to not email, if being an international student. For background, I am a Canadian student.

If I do need to, what would be the best time to email professors? Any tips/suggestions would be greatly appreciated.

If anyone can shed some light on this, it would really help me. Any help would be greatly appreciated. Thank you!

0 Comments
2024/06/30
14:15 UTC

0

Immunology BSc to Bioinformatics MSc (UK)

Hi,

I did a dry lab dissertation for my BSc, evaluating conserved amino acids from different malarial parasite genomes, and really enjoyed it.

I have spent a few years out of education, but am now looking into an MSc in Bioinformatics.

Definitely something I'm interested in, but it appears the job market is saturated in certain areas.

My question : is there a high demand for bioinformatics, and does anyone have any sources to back that up?

Thank you :)

0 Comments
2024/06/30
12:44 UTC

1

ClueGO for a custom organism

Hello I'm currently working with transcriptomis of Hydractinia symbiolongicarpus an hydrozoan and I did a WGCNA network wich I've loaded to cytoscape to further analysis. My problem is that when I try to make an ontology network using the GAF file or custom file as the authors suggest in either the original paper or a "tutorial" like window that shows up in the package when I enter in the option to load a custom organism the package don't recognize the files. I've tryed to message the authors for like three months and they haven't answered me yet so I think in post here if by any chance there's someone who had/have the same problem that I have Thank you for reading

0 Comments
2024/06/30
04:49 UTC

0

Bacterial Sequence variation comparison

When comparing aregion of genome ( 20-30genes), Does presence or absence of genes comapres as nucleotide variation? Or It does not consider.

Example: My reference strains has a presence of genes. But both case and control strains does have those class genes have different class of same genes. Does it consider sequence variation?

2 Comments
2024/06/29
19:49 UTC

2

Novice phylogeneticist on choosing outgroup species

Hello! I'm just starting out doing a project on cryptic diversity, trying to determine if my focal taxon (a moth) is actually two distinct species. I chose some species—the focal taxon, its species-group species (e.g. it's in the X. x species group so I included the X. x. species as well), another couple of species in its genus (selected based on sequence availability), and a representative species from the sister genus to my focal taxon's genus. I performed my analysis (details below), and it seems to very strongly suggest that the one species is actually two distinct monophyletic clades with substantial divergence—yay! Then I read that recent study again and realized that that author used different outgroups: they used different species within the genus. No problem, I changed up my approach to replace my three within-genus outgroups with the species-group species (as before) and the one within-genus outgroup species that paper used, and ran the analysis as before. Now my results are allll over the place—my focal species is now weirdly split up (is no longer even paraphyletic in some of my bootstraps), but more concerningly the outgroup genus is often not out anymore. The main part of my phylogeny is a weird mix of the different species in a much less organized way than before. And, sadly, my focal population is not a clearly-divergent clade from the rest of my focal species. I'm tempted to say "well screw it, clearly my other set of species was better!" but I have great incentive to do so (the other analysis provides evidence that my focal taxon is two species, which is the cool result I was hoping for), so I want to take a step back and ask "why do these analyses differ? How can I improve both of them and try (desperately) to approach the truth?"

So the question I bring to you all is: how do I know if I've chosen the right outgroups? How do I know if I can improve them? Why would swapping out some species for some other species (which presumably should do the same purpose, even if I did decrease the number of individuals and species) cause such ruination in my phylogenies? And the ever present question of did I do something stupid here—is there something I need to know but don't know that I don't know to do this properly?

Thank you in advance! (P.S. Please be gentle—this is my first time doing phylogenetics, and I'm doing everything I can to be meticulous while still embracing my enthusiasm for this project!)

Workflow:

— Access BOLD for fasta files, concatenate them using `cat`
— MAFFT to align sequences

— ModelTest-NG to get the right model (testing maximum likelihood JC+G, F81+G, HKY+G, K80+G, SYM+G, GTR+G—I'll be honest, I just used the default settings for what RAxML could handle, and I know that I should learn the, but right now I don't know what these mean beyond +G = gamma). AIC, AICc, BIC all agreed that GTR+G is the best fit model

— RAxML to interpret the trees using model GTR+G, `--all` setting enabled, and 500 bootstraps

— FigTree to visualize the trees (open the *mlTrees file), and go through each tree closing all monophyletic clades

5 Comments
2024/06/29
15:52 UTC

1

How do I analyze GC-MS data for metabolomics ? I have MS1 data. Any tools you can recommend. I want to integrate the same with metagenome as well

I have nanopore full length raw reads and metabolome raw data. I want to analyze and integrate those results. Please let me know if you know any tools.

Thank you..

1 Comment
2024/06/29
15:04 UTC

4

How do i integrate metabolic and microbiome data?

Hello! I'm performing data integration for a research project. I have the sample's gene expression data, the sample, and the source of the sample. I got all of this from GEO and did not collect anything myself. Next, I need to integrate metabolomic and microbiome data for my studies into the same dataset. I searched through the research paper that is the basis of my previous dataset and they used kraken2 for microbial identification. I will integrate the metagenomic data using kraken2 as mentioned. Is there a way i can link gene expression data with metabolomic data and then use that metabolomic data in my research? Please help!

I'm very very new to this field, this is my first project.

10 Comments
2024/06/29
08:45 UTC

3

Is it common that "connection time out" when accessing ftp.ensembl.org/pub/ to download whole fasta?

my command line is wget -r -l /pep -A all.fa.gz https://ftp.ensembl.org/pub/current_fasta/ , and connection time out again and again

13 Comments
2024/06/29
07:59 UTC

1

Non bonded atoms in PDB file

I generated a PDB file using alphafold2. However it wasn't getting opened in autodock as it was throwing a python stack trace. So I cleaned the file using ChimeraX. Removed its water and ions, added hydrogens and assigned charges. The cleaned pdb file is getting opened in autodock now but after adding polar hydrogens it's showing mergeNPHSGC warning of having 1 hydrogen with no bonds. When I went to save its showing error that protein has non bonded atoms. Please help!

0 Comments
2024/06/29
06:04 UTC

4

How to convert a bam file to a FASTQ file?

I am trying to upload a bam file to illuminas base space so I can run it through the dragon germline. But I can NOT get it to upload! Its the file size, 95gb. I tried freeing up computing power on my laptop and Ethernet connecting to the internet but it still wouldn't budge. Its taking so long I keep getting kicked out of base space and the upload cancelled. I want to try with a FASTQ file but I have no idea how to convert it. I tried looking up tools for it, but there all like raw coding tools. I stay in genetic variant analysis, coding is not my strong suit, and I feel like I'm reading gibberish. Anyone have a beginner friendly way to talk me through how to do this??

8 Comments
2024/06/29
01:14 UTC

1

Need help converting a PDB file to a PDBQT file

Hello, I need assistance converting a PDB file to a PDBQT file. I have tried to get AutoDock on my macbook but it keeps shutting down. If someone can help me with the file conversion, I'd appreciate it.

Thank you.

2 Comments
2024/06/28
22:35 UTC

1

Feature highlight: Visual Fisher's Exact test in R2platform

0 Comments
2024/06/28
22:10 UTC

1

Rosetta fold all atom

Hi guys, I couldn't set up RFAA on my computer, that is why, I am using tamarindbio website to test some structures on RFAA. However, it limits the length of the amino acids up to 700, so I cannot put there most of the structures that I want. I want to see if RFAA predict 8FLW ( antibody-virus envelope complex) right because alphafold3 cannot do it. Please someone help me on this, I would really appreciate it if someone does it and send me the result. Thank you in advance! Note: I am just doing it to learn!!!

0 Comments
2024/06/28
21:12 UTC

1

Issue with clusterProfiler simplify function

Hi all. I am having an issue with the clusterProfiler simplify function. I ran enrichGO of BP ontology and am now trying to simplify the data and it is taking so long to run. I have not let it complete the run and stop it after like 15 or so minutes. It runs fine if I do CC or MF ontology and will finish in less than a minute. Any ideas why BP is making it run so slow? I have been using this same code for a while and it has never taken this long to run. Thanks for any help!

0 Comments
2024/06/28
19:21 UTC

2

Am I overthinking this 16S analysis?

I am studying the faecal microbiome of dogs (16S rRNA sequencing). I was planning on using QIIME2 but stuck on which database to use (Silva, GreenGenes, Kraken2, etc).

I was going to look at each database and see how many genes have originated from canine studies. I was also thinking of assembling MAGS from canine metagenome studies, extracting the 16S gene to make a custom database more representative of the dog microbiome. Or should I just use GTDB?

Am I completely overthinking this? Should I just pick one of the databases available in QIIME and call it a day?

10 Comments
2024/06/28
19:09 UTC

19

Why can microbes be detected in scRNAseq data of tumor tissue?

Intratumoral microbiome has been verified by 16s rRNA and LPS detection across several tumors. Furthermore, some investigators have tried to characteristic canonically pathogenic microbes in bulk RNAseq data like TCGA and even in scRNAseq data of 10X. Then I can provide papers including https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9556978/ and https://www.cell.com/cell-host-microbe/fulltext/S1931-3128(20)30663-6.

What makes me extremely confused is that why we can find microbes in sequencing via polyA enriching approach. These papers emphasized they fully considered the contaminants and detected microbes in data generated by independent labs. However, no one explain why we can find them. Could someone elaborate it to me?

26 Comments
2024/06/28
17:08 UTC

0

Comparison of RNA Seq and qPCR?

I did an RNASeq experiment with interferon treatment, however in the DESeq2 analysis afterwards I didn't see any differences in my untreated vs my interferon treated conditions, so I checked the RNA I had left over from preparing the libraries with qPCR for a known ISG - and I could confirm that I had an upregulation of ~40-fold of one ISG in qPCR. But I don't get any upregulation of the same ISG in the DESeq2 results - not sure how to check if my data is actually meaningful now or not.

9 Comments
2024/06/28
14:24 UTC

0

How to speed up GSEA (or a faster alternative)

I'm analysing a very large dataset of ranked gene lists (7500) by GSEA. This is of course very slow, and on top of that I have a lot of gene sets to analyse. The dataset isn't suitable for enrichment analysis (which would be much faster) because I am looking at very global changes in gene expression and the effect sizes are subtle, so I am looking for any ways I can speed up the GSEA. I'm using the GSEA function in clusterProfiler with the FGSEA algorithm.

My current approach is to just split the data over 128 cores on our HCP, but even that is taking hours. Are there any other effective ways I can speed up a GSEA? Will changing any of the parameters have a big impact on speed, or is there a way I can perhaps do a speedy less accurate calculation and then re-run the significant hits to refine the P values?

8 Comments
2024/06/28
13:13 UTC

8

Databases for prokaryotic type strains

Hi guys. Is there any database similar to NCBI (in terms of reaching) to search for genomes/assemblies or specific genes of prokaryotes?

3 Comments
2024/06/28
12:43 UTC

0

What does 11 and 1 mean in parasail.sw_trace_scan_16(x, y, 11, 1, parasail.blosum62)?

x and y are two sequences, and parasail.blosum62 is the score matrix. But what 11 and 1 are?

Can parasail.sw_trace_scan_16 also handle non-equal sequence paris?

2 Comments
2024/06/28
12:11 UTC

3

Finding exact location in genome of SNPs

I've a collection of SNPs in .vcf format. Unfortunately they were generated using a non genomic reference dataset in bowtie so the location and naming does not reference the genome. How do I go about finding the location of my SNPs in the reference genome of this species so I can use it to compare to the population at whole.

I think I can extract the sequence up/down stream of the SNPs but outside blasting I'm unaware of a straight forward way of finding the location in the genome.

5 Comments
2024/06/28
08:30 UTC

7

Interest in working with paleogenomics and/or endangered or other animal conservation work, any suggestions for groups or companies to look at after graduation?

Sorry if the question is not relevant and please delete if breaks the rules of posting. For a little background I am currently in the second semester of my online bioinformatics program and am based in the US. While I’m not sure of what exactly I want to do in the bioinformatics space but I have always loved anything to do with animals. I’ve always had an interest in conservation and would love to be able to pursue some kind of bioinformatics career towards that. Paleogenomics also seemed like an interesting field as well to work in but am not too versed in the space. I also know about Colossal and their hopes for deextinction which sounds really interesting as well but am unaware of any other groups or companies trying anything similar.

So my main questions are if I am interested in working in a bioinformatics capacity towards research or work concerning conservation or paleogenomics of some kind does anyone have any suggestions on publications to look at concerning those fields, any books on the topics in the bioinformatics space, and generally other sources of information to better learn about the fields? Also if anyone knows companies or groups working in these fields that I could check out that’d be great as well!

Appreciate any help or advice and hope y’all l have a good day!

3 Comments
2024/06/27
21:01 UTC

8

Classifying metabolites from host vs bacteria?

I have metabolomic data from infection sites and from host circulation. Is there a well-established way to identify the metabolites that can only be produced by bacteria? I imagine there will be a lot of overlap but there has to be a way to do this.

4 Comments
2024/06/27
19:59 UTC

2

Error with AutoPSF in VMD

I am conducting research for my masters thesis in biotechnology and was interested in performing MD simulations in VMD, however I keep running into an error in AutoPSF. No matter what molecule I have open in VMD, selecting "guess and split chains" results in an error saying "couldn't open '<input file name>_autopsf_preformat_glycan.pdb': no such file or directory". I can't seem to find anyone else talking about this error online, so I'm not sure if I'm just making a really obvious mistake. I'm very new to this program so I would greatly appreciate any insight!

0 Comments
2024/06/27
19:08 UTC

3

Help with understanding Read Alignment/Assembly

I am looking into sequence alignment/read alignment. Please help clarify if any of these are wrong. I am italicizing questions. Feel free to answer any!

  1. Sequence Alignment

This is used when you have a reference string and a query string, for example, identifying difference between two genomes or finding where one gene is located in your reference.

There are two approaches: local alignment and global alignment. Local alignment uses a DP approach that resets to 0 instead of negative, allowing some parts to be aligned and others not. Global alignment aligns the entire query sequence to the reference.

The result of a sequence alignment is a score of the best match and by backtracking, the alignment that supports this best match.

What are possible scoring functions? How does sequence alignment differ for genome/RNA seq/proteins? How are multiple best match sequences delivered?

The most common heuristic for sequence alignment is a seed, filter, extend paradigm. What are other novel approaches? Do you think this heuristic is here to stay?

  1. Read Alignment: I am much less clear about this

There are two approaches: de novo read alignment and mapping to a reference genome

De novo read alignment heuristics have two steps: overlap detection and then sequence alignment for each read. In overlap detection, the algorithm tries to identify overlapping reads using seeding matches. Then, the best overlaps are aligned to each other.

Are there any good diagrams/papers about the general space?

Mapping to a reference genome is similar to sequence alignment except you do it for each read. You take a read and then seed, filter, and extend it on the reference genome. Then you move to the next read and so on.

How does read alignment deal with overlapping read alignments? Can the read alignment be done at all once without respect to which read it comes from?

0 Comments
2024/06/27
18:08 UTC

0

RNA virus (HIV, HCV, Influenza, or SARS-CoV-2) datasets with corresponding fitness values.

Hi everyone,

Does anyone know where I can find RNA virus (HIV, HCV, Influenza, or SARS-CoV-2) datasets with corresponding fitness values?

Thanks in advance!

2 Comments
2024/06/27
17:44 UTC

1

How do you process methylation data? No IDAT

Basically, I downloaded some data from GEO. I needed the raw methylation data but the GEO
does not have the IDAT files. It only gives two csv's methylated and unmethylated. How do I process this into one matrix.

I wanted to download multiple diffident datasets from GEO, get the raw matrices and then ultimately combine them and normalise.

Any help is greatly appreciated, this problem is driving me insane!

6 Comments
2024/06/27
15:55 UTC

7

Approximate per sample file size for gut microbiome sequencing.

Hey folks, I just want to know what would be the approximate fastq file size for one sample of human gut microbiome if I want do 16s rRNA and wgs. I want to purchase storage space for 500 samples accordingly. Thanks

5 Comments
2024/06/27
15:49 UTC

4

Subseq object sizes

Hi,

I wonder if I’m being dense here or not. I’m aligning islands from a few different, related bacterial strains. I’ve identified the coordinates of the islands in each assembly that I’m working with, and I’ve managed to pull out the sequence from those assemblies using subseq().

What’s confusing me is that the original DNAStringSet with ~100 chromosomes, and the DNAStringSet I’m generating from subseq are the same size - I figured that the subseq product would be smaller. Am I being dumb?

1 Comment
2024/06/27
14:44 UTC

2

Need help classifying species by phylum

I have a txt list of a few thousand species names and i need to categorize them by their different phyla.

I'm sure there's an online tool/script i can write that will do that, but i have no idea what appropriate tools exist or which modules to use in python (I tried using biopython but Spyder refuses to recognize i installed it).

TIA!

3 Comments
2024/06/27
14:00 UTC

Back To Top