/r/bioinformatics

Photograph via snooOG

A subreddit to discuss the intersection of computers and biology.


A subreddit dedicated to bioinformatics, computational genomics and systems biology.

The Biology Network
science askscience biology
microbiology bioinformatics biochemistry
evolution
Bioinformatics

news for genome hackers

Frequently Asked Questions
New to Reddit?
Learning Bioinformatics
#bioinformatics IRC at Freenode
Information
  • If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
  • If you want to read more about genetics or personalized medicine, please visit /r/genomics
  • Information about curated, biological-relevant databases can be found in /r/BioDatasets
  • Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
Getting a job in bioinformatics
Friends

/r/bioinformatics

107,007 Subscribers

6

I’m a research tech training tk be a bioinformaticist and I feel like 3 months into this I’m going to be let go

For background, I work for a research lab under a brand new PI. It’s a new lab, obviously, so there’s only my PI, myself, another tech and our lab manager. I’m the only one in the group that has shown interest in bioinformatics. My background is I’ve worked in pre-clinical (animal) and lab animal research for 5 years, most animal care/project planning for researchers, and in my free time I’m self-taught in Python and R. I have a passion for using deep learning to mitigate the need for so many animals in research, but I’m self taught.. so I know there’s gaps in my knowledge.

Anyways, fast forward to last year when I saw this tech position open up. I applied and brought up my interest in bioinformatics, since the new PI’s research strongly uses it for their work. He said he would hire me on as 50% wet lab and 50% dry lab. I felt like I landed my absolute dream path to my dream job. So I took it. I have a biology undergrad, but it’s been some years since I used a lot of the genetics and molecular biology knowledge, so I brushed up on it and basic bioinformatics tools like Seurat/Signac,bioconductor, etc.

4 months into the job now and I’m absolutely miserable. Well, mostly. I love the work I’m doing. I’ve been given tons of computational projects, anything from basic preprocessing our massive multiome data for downstream analyses to actually doing cell-type-specific analyses like motif discovery of selector enhancers or chromatin accessibility changes within similar cell types across development. And it’s been fun. But my PI.. is not fun to deal with. Every week we have a meeting to go over my scripts and talk about projects. Any time I ask a question to clarify, he says “we discussed this” or “we already talked about this”. But when I don’t ask questions, I make very stupid mistakes in my scripts that he catches. Today he told me that by now (not quite 4 months) he feels like he shouldn’t have to hold my hand and guide me through these things and that I should be capable. I was shocked. I have a basic biology background, basic coding background, next to no wet lab nor neuroscience experience.. I feel completely saturated with information, and I can’t retain it all. So of course I’m not going to be a fleshed out bioinformaticist yet.. is this how it usually is? My title is research tech, and I only wanted training for a bioinformaticist position that I would consider in the future. Like, I’m thinking a few years, not months. I just don’t know what to do. I’m starting to feel so discouraged and I hate going into meetings with him because I know he’s going to shred me. I want to be okay with my mistakes and learn from them, but our interactions give me so much anxiety that I feel like I don’t even want to try so I can’t fail. He’s so smart, and learns things SO fast. I don’t even know how to breach this subject with him because we’re just so different in our learning speeds and modalities. I feel like I should expect to be let go soon but I’m doing my best and I feel like I’m making real progress. I feel so defeated and I wanted this job more than anything.

5 Comments
2024/04/19
23:07 UTC

1

Git hooks in bioinformatics

An external consultant recently proposed in my company the usage of git hooks in our workflows. I am not familiar with the concept as we are small and work individually in our projects, but I would like to understand what is the intended use for them.

An I understood they are functions executed before and after a commit, but I find that concept limited and would like to know some examples on how do they improve the work.

1 Comment
2024/04/19
21:42 UTC

2

Why p value changes when using different number of pathway score?

Hi all,

I use GSVA (Gene Set Variation Analysis) package to calculate pathway scores. Then I compare pathway scores between 3 groups using limma. When I use a few thousand pathways, p value of pathway A will different (bigger) with when I use only a few pathways. I think it should be the same because each pathway is independent (adj p value changes because number of comparison change). Basically, it is a Anova test. Would you please have an explanation and which practice I should do in this case? Thank you so much!

4 Comments
2024/04/19
20:50 UTC

2

How are Diffusion Maps from ScanPy interpreted?

My main goal is to be able to use Diffusion Maps in a way similar to sklearn.decomposition.PCA and understand how to interpret the output (e.g., eigenvalues, eigenvectors, and transformed space).

The functionality I'm looking for is to be able to use .fit. and .transform to a model object. I found 2 packages that do this but they are not actively developed AFAIK:

However, mapalign doesn't support the .transform method.

I found another implementation in ScanPy which is actively developed so I'd like to use this and make a model that can support .fit and .transform methods.

I'm trying to adapt this to a very simple case which is the iris dataset:

import anndata as ad
import scanpy as sc
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
from sklearn import datasets

iris = datasets.load_iris()
X_iris = pd.DataFrame(
    iris.data,
    columns=iris.feature_names,
)
X_iris.index = X_iris.index.map(str)
df_meta = pd.Series(iris.target).map(lambda x: iris.target_names[x]).to_frame("species")
df_meta.index = df_meta.index.map(str)


adata = ad.AnnData(X_iris, obs=df_meta)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X', method='gauss', metric="euclidean")
sc.tl.diffmap(adata, n_comps=min(adata.shape))
adata.obsm["X_diffmap_"] = adata.obsm["X_diffmap"][:, 1:] 
sc.pl.embedding(adata, "diffmap_", color=["species"])

Unknown|660x429

If anyone is familiar with ScanPy or Diffusion Maps, any insight regarding the following questions would be extremely useful:

  1. How are the distances from ScanPy stored in adata.obsp["distances"] calculated?

It looks like they are calculated with the following code but I was wrong:

neigh = NearestNeighbors(n_neighbors=10,metric="euclidean")
neigh.fit(adata.X)
distances = neigh.kneighbors_graph(np.ascontiguousarray(adata.X, dtype=np.float32), mode="distance")
np.allclose(distances.toarray(), adata.obsp["distances"].toarray())
# False
  1. Similarly, how are the connectivities calculated?

I thought it would be from similar code but Sklearn's NearestNeighbors returns 1/0 not continuous floats:

neigh.kneighbors_graph(np.ascontiguousarray(adata.X, dtype=np.float32), mode="connectivity")
  1. How are the eigenvalues interpreted by the Diffusion Map?

According to this thread, you are supposed to drop the first dimension because it's the steady state and not informative: https://github.com/scverse/scanpy/issues/2254

adata.uns["diffmap_evals"]
# array([1.        , 1.        , 0.9813439 , 0.94277596], dtype=float32)
  1. Are the eigenvalues typically transformed and is that what is happening in the backend of ScanPy?

So if you are projecting into 4D as I am above, you would only analyze the last 3 dimensions.

In the literature, it says the steady state eigenvector's eigenvalue is 0 and the importance is inversely proportional to the non-zero eigenvalue.

The Laplacian eigenvectors that are identified in the processare also of interest for a different reason. Thenth eigenvectorvncontains one element corresponding to each of the species, whichis related to thenth i-trait for that species. The corresponding eigenvalue λn is inversely proportional to the relative importance of this nth trait axis (SI Appendix).

Ryabov et al. 2022 doi:10.1073/pnas.2118156119

  1. Are eigenvalues in descending order of importance?

Based on the context from the previous question, what does it mean that the 2nd eigenvector is also 1? Does that mean this is also uninformative?

  1. What does it mean when you get negative eigenvalues?

In many cases, I get negative eigenvalues back. For example, the following:

adata = ad.AnnData(X_iris, obs=df_meta)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X', method='gauss', metric="euclidean")
sc.tl.diffmap(adata, n_comps=25)
adata.uns["diffmap_evals"]
array([ 1.        ,  1.        ,  0.9813439 ,  0.94277596,  0.9376262 ,
        0.8767288 ,  0.8594042 ,  0.7913928 ,  0.7608171 ,  0.66026914,
        0.62744224,  0.62703806,  0.5787338 ,  0.51345664,  0.5119588 ,
        0.47451586,  0.4733336 ,  0.42002645,  0.41362438,  0.39693147,
        0.35458508, -0.33341438, -0.34987825, -0.3791933 , -0.4204018 ],
      dtype=float32)
  1. Why does the Diffusion Map process allow for more dimensions than the starting space? Are any of these dimensions informative? Is there a rule of thumb when selecting output dimensions?

For the iris dataset, I was able to use n_comps=200 which is more dimensions than both the samples and the features in the original data frame. However, most of them were negative:

sc.tl.diffmap(adata, n_comps=200)
adata.uns["diffmap_evals"]
array([ 1.          ,  1.          ,  0.9813439   ,  0.94277596  ,
        0.9376262   ,  0.8767288   ,  0.8594042   ,  0.7913928   ,
        0.7608171   ,  0.66026914  ,  0.62744224  ,  0.62703806  ,
        0.5787338   ,  0.51345664  ,  0.5119588   ,  0.47451586  ,
        0.4733336   ,  0.42002645  ,  0.41362438  ,  0.39693147  ,
        0.35458508  ,  0.2911088   ,  0.2740729   ,  0.25649452  ,
        0.22609575  ,  0.20761564  ,  0.19950579  ,  0.18004218  ,
        0.15906486  ,  0.13327035  ,  0.1303541   ,  0.13028115  ,
        0.11071147  ,  0.10258841  ,  0.075313695 ,  0.051683918 ,
        0.048783954 ,  0.043516714 ,  0.028213583 ,  0.023989521 ,
        0.016956184 ,  0.014117415 ,  0.0069459495, -0.010597833 ,
       -0.011460057 , -0.01403651  , -0.019394804 , -0.021086005 ,
       -0.030575594 , -0.032255445 , -0.03802568  , -0.041902117 ,
       -0.045432337 , -0.048568282 , -0.05079374  , -0.05370096  ,
...
0 Comments
2024/04/19
19:10 UTC

1

need help regarding an assignment around protein nucleation using monte carlo algorithm.

https://preview.redd.it/23evipq93hvc1.jpg?width=925&format=pjpg&auto=webp&s=16897149533510b26e136bb90a8cb298fd0d8613

i just need to know what type of graph should 2.a) and 2.b) parts of the question produce.

3 Comments
2024/04/19
17:42 UTC

1

Why is high N50 value is correlated with better quality?

The above

8 Comments
2024/04/19
17:41 UTC

6

Bioinformatics consulting

Any other full time bioinformatics consultants on this subreddit? Ive been doing full time work for several years (post-postdoc) and would love to hear from other consultants about their approaches/resources.

2 Comments
2024/04/19
15:53 UTC

1

BED file containing neutral regions in the hg38 genome

Dear all,

I am looking for a BED file containing neutral regions in the hg38 genome so that I can make a neutral model to do conservation scoring (phyloP) on a series of species (Primates) I have in a WGA. Where can I find such a BED file?

Additionally, say I want the BigWig file for each species I individually run phyloP for (where the reference genome is hg38), do I need a neutral model for each of these species? Or is a neutral model simply based on the reference genome of hg38 satisfactory?

0 Comments
2024/04/19
15:02 UTC

0

Help with visualisation of large communties in R

Hello communitiy, I am currently working with R (I am a beginner) where I am trying to analyse a given dataset with the Louvain algorhitm to cluster genes in communties. My data included roughly 10.000 genes from 155 samples. I would like to analyse and in the best case visualize the communties. Right now my code caps out at 2000 genes. I already filtered the genes to only use sognificant ones and ones with higher logFC than 1. Reducing the number to 7000. I would really appreciate if anybody can recommend some methods or packages. Thank you so much! If my code or the source of the dataset is needed, I will gladly provide more Info!

3 Comments
2024/04/19
14:36 UTC

2

cancer related databases

Dear All,

I am writing this post cause I am giving up on this super simple task. I have an excel sheet of genes found mutated in our samples, according to one analysis these genes were found to be of unknown significance, in terms of being pathological or not.

Now I need to cross reference this huge list of genes with other databases like dbSNB or comsmic or whatever but I really cannot get my hand on one clear data base that will allow me to compare if these mutations are truly of unknown significance.

Any help?

PS: the bioinformatician who did the analysis is out of reach, and I don't have access to the raw data, the only thing I have is this excel sheet.

9 Comments
2024/04/19
13:36 UTC

0

Extracting matching sequences mmseqs2

Hi everyone !
For my analysis, I'm using mmseqs2 (bash command) to search for homologous sequences using the easy-rbh (reciprocal best hit) option. In the output obtained, I only get the identifier of my best hit, as well as the hit values (e-value, identity, etc...) but I'm wondering if there's a way to also get the fasta sequence associated with this identifier directly in the output file ? This would allow me to avoid wasting time matching files in my database.
Thanks in advance ;)

1 Comment
2024/04/19
12:10 UTC

1

I'm analyzing pre-sequenced and annotated bacterial genomes, need help interpreting.

for a school project I am using Artemis to analyze thee bacteria. I have downloaded whole, completed genomes for these thee, and am noting down what genes have been annotated with interesting function descriptions. But what i truly wanna know is what these bacteria do as proven by their genome. These bacteria should inhibit nitrogen and carbon emissions from manure, and I'd love to indicate this function from their genome. But I am not sure how to.

currently, I am using NiH database to get my info and KAAS (KEGG) (genome.jp) to BLAST the sequence to get pathways. But KEGG only shows me 2 pathways when i run it through, and both are quorum sensing. What engine can i use, or what can i do to get a clear function read from the genomes?

Help,

An overachieving bachelor student.

2 Comments
2024/04/19
10:49 UTC

1

Sanger sequensing analysis

I am trying to learn sanger sequencing (how to interprate the chromatogram). In lab we inserted mCherry gene into the PLAC plasmid. Restriction enzyme digestion indicated a succesful insertion. The sequencing results came back lite this:

The seqencing primer is the dar green arrow before the gene. I am thinking, shouldn't the alignment be further to the right? If someone could give me a push the right way...

Might sound like a dumb question but I am new to this.

https://preview.redd.it/4z3rkmz5bevc1.png?width=1920&format=png&auto=webp&s=d31bd9adbc00f2326030f2dd7b6cc3e593a69f71

region before prime

https://preview.redd.it/i6hohaa9bevc1.png?width=1920&format=png&auto=webp&s=7b40deaef00a0a7111f96f074a8372d770564a0b

region after

10 Comments
2024/04/19
06:58 UTC

1

Request for Guidance on Incorporating MIPC Lipid in MD Simulations of Peptide/Fungal Membrane Complex

Dear expert,

I am Sonal MSc second year Bioinformatics student. I am writing to discuss an issue I encountered while working on building a peptide/fungal membrane complex for molecular dynamics (MD) simulation on CHARMM - GUI

Specifically, I am facing difficulties in incorporating MIPC (Mannose-(inositol-P)2-ceramide) lipid into the membrane model. MIPC is an important glycolipid component for fungal membranes, and its inclusion is crucial for accurately representing the membrane environment in our simulations.

I would appreciate your guidance on how to effectively add MIPC lipid to the membrane model. One suggestion is to categorize MIPC under the ceramide category, considering its structural and functional similarities. Alternatively, if there are other lipid options from the existing list that you recommend as alternatives to MIPC, I would be eager to explore those possibilities.

Thank you very much for your attention to this matter. I look forward to your valuable advice and suggestions.

We tried making the MIPC by myself in CHARMM-GUI below is the picture which did not work out.

Option available on the CHARMM GUI

2 Comments
2024/04/19
05:59 UTC

2

Ties in FGSEA

Hi all,

I am running a FGSEA using, "https://stephenturner.github.io/deseq-to-fgsea/". I was looking through my data as well as R-studio to figure out this issue.

In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (4.08% of the list).

Even though the 4% may seem relatively small, some pathway goes from signficant to non-signigifcant. Using any duplicate feature on R, it returned as zero.

Does anyone know how to resolve this issue? Maybe a new ranking metric; if so how would I rewrite the code for FGSEA?

4 Comments
2024/04/19
01:44 UTC

1

Trouble With Indexing My GRCh38

I am currently a university student using software provided to me by my lab, and am on the cloud, having downloaded the fa file of my genome, now trying to index it. There is enough storage where I have placed the genome, however I am still getting an error when I do "bwa index ___.fa" after many iterations "Failed to allocate 3088336136 bytes at bwtindex.c line 158: Cannot allocate memory". any clue why this may be, or how I can fix it?

Sorry I know this is hyperspecific but I don't know where else to turn.

4 Comments
2024/04/18
23:52 UTC

2

How to determine copy number variations given read depth data?

Very new to this.
Here is the example data (1-7 patient samples). Probes in the “non” region have CN=2. The depth at a probe is expected to be linearly proportional to the copy number of the DNA at that site. How would you normalize the data (if necessary) and is there a R package for this type of analysis?

https://preview.redd.it/0ocgk1b31bvc1.png?width=679&format=png&auto=webp&s=b78c9fde9186518a0da2a7943940cf3416b04851

8 Comments
2024/04/18
21:17 UTC

2

Ensembl Reference For Mouse Strain GTF Types: Abinitio, Chr And One Other

I have noticed this before when I was looking at the Ensembl Zebrafish fasta files. There are 3 types of `gtf` file that Ensembl gives. I am looking at this page for mice:

Mus_musculus.GRCm39.111.abinitio.gtf.gz2023-10-08 14:49 110 

Mus_musculus.GRCm39.111.chr.gtf.gz2023-10-08 14:49 31M 

Mus_musculus.GRCm39.111.gtf.gz2023-10-08 14:49 31M 

Which one is best to use with STAR for alignment the one ending with chr.gtf.gz or gtf.gz? The README explains what the abinito one is for but not the other 2. Any guidance is much appreciated.

0 Comments
2024/04/18
21:02 UTC

0

Question about removing sequence before primers using R Biostrings

hello,

I have a set of data that I wish to clean before analysis.

basically I have multiple sequences of different species and want to trim the edge of the sequences based on the primer sequence.

I am able to remove the primer sequence when it is just right at the edge of the sequence but when it's a bit further I can't figure out how to do it on R.

so basically I have

sequence <- "AGTCATGCGACTGCATA-rightprimer-AFTGACATACA

and I wish to obtain after trimming

sequence <- "AGTCATGCGACTGCATA"

however since there is a bit of parasitic sequence before the primer, the function matchLRPatterns doesn't want to trim the edges.

thanks !

7 Comments
2024/04/18
20:27 UTC

28

Happy Lab Week!

It is lab week. Happy Lab Week!

The bioinformaticians at my place of employment were not invited to any of the lab week celebrations.

I wanted to wish everyone here a "Happy Lab Week" in case you were also forgotten.

11 Comments
2024/04/18
17:03 UTC

4

downloading pdb files on rcsb.org

Hello everyone I am working on molecular docking and ı want to download all pdb's together as pdb format according to search (name of protein, name of organism) on rcsb.org . can someone help me if there is a way to do it, how it can be done? thanks for any help

6 Comments
2024/04/18
15:59 UTC

6

Seeking Recommendations for Bioinformatics Tools in Single-Cell RNA-Seq Analysis

Hi everyone,

I'm currently engaged in a project where we aim to replicate the computational analysis of a paper that explored inter- and intratumour heterogeneity in metastatic breast cancer through single-cell RNA-Seq analysis. Our focus is to use different tools/pipelines compared to the ones used by the original authors. So far, we've used HISAT2 for alignment, sorting, and indexing, but we're exploring alternatives for the other stages of analysis.

We need a replacement for the rsubreads function (used by the authors to generate counts) and tools similar to the griph package for cell cycle correction. We aim to produce a count matrix using the different tools and then apply it in a Seurat pipeline for PCA, differential gene expression analysis, and gene set enrichment analysis.

Can anyone recommend tools that are relatively easy to learn and efficient to use? Time is of the essence, and while we're keen on exploring methods, we can't afford a steep learning curve right now. Your suggestions would be invaluable!

Thanks in advance!

5 Comments
2024/04/18
12:50 UTC

0

Opening seurat v5 .rds data with v4

Hi,

Is there a good way to open postQC seurat v5 object on v4?

0 Comments
2024/04/18
12:42 UTC

1

RIN/DIN for downstream multi-omic analysis

Hi there,

Has anyone extracted RNA and/or DNA from small intestinal neuroendocrine tumours and sent for DNA/ RNA Seq analysis. What DIN/RIN values have you seen- anything above 5? just wondering if this is an isolated issue or related to the tissue itself. Tissue is usually fresh frozen on the day of surgery and extracted after one thaw or immediate extraction after transport on ice. Extracted with Qiagen kits.

Any thoughts/ help/ experiences would be useful

0 Comments
2024/04/18
09:59 UTC

2

Efficient SMILES database

I am creating my Final project for my bachelor degree, my idea is creating a little "framework" that predicts drugs or molecular properties from smiles sequences(BBB permiability, toxicity, reactivity..), This part is working fine.

My question is how do I store this sequences on a Database efficiently?, my idea is that if I give one input sequence the database should output the top 5 most similar sequences.

I would appreciate if anyone know about a resource or could give me advice about the type of database or algorithm i should be looking for.

6 Comments
2024/04/18
08:23 UTC

1

Is there currently a metric for measuring the "distance" between two proteins in Reactome or any other biological pathway database?

I'm interested in seeing how the knockout of one gene may affect the expression patterns of other genes (kind of like an effect size), and I wonder if the distance between two given proteins in Reactome may be a good predictor. Not sure how I would define distance, one trivial definition may be the number of nodes between the pair. Any help would be appreciated!

4 Comments
2024/04/18
07:50 UTC

8

An issue with peptide modeling

Dear fellow Redditors, I was asked at my lab to predict a 3D structure of a peptide derived from a protein (for which the structure was resolved). I have a little experience with that, but started the research.

I ended up using AlphaFold, PEPstrMOD and PEP-FOLD for this purpose. Well, nearly each best model of the three software looked different. I performed the RMSD (by aligning the models) and the values were high (from 4 to over 15). They rather did not look alike. Sometimes softwares returned similar models for other peptides I've tried (but at most 2 of 3 models were kind of similar).

I also looked at the Ramachandran plot. There were some outliers. I used the tool to refine the structure, but looking at the 3D model, not much changed - not the general peptide overview, but rotations of some bonds.

I do not know if I am thinking right, but to me for the model to be correct, all the three software for modeling would have to agree (ergo, RMSD values will be low, around 1). I would feel calmer.

I've found information that peptides are "highly flexible molecules", "do not assume a single defined structure in solution", "have high conformational flexibility".

Is there any way to predict 3D structure of a peptide (7-25 aa) at all? What should I aim for? How can I choose the best model? How can I compare models from different sources? Any hints will be highly appreciated, as I am lost.

9 Comments
2024/04/18
07:48 UTC

1

Convert .cloupe file

Hello,

I recently got send .cloupe files and I would like to analyse them in R (just starting to learn it). Is there a possibility to convert .cloupe back to a different file type, as I would like to use R to analyse it. I have been googling for quiet some time but I cannot find a solution.

Thank you in advance

1 Comment
2024/04/18
07:33 UTC

2

PhD Co-supervision set up question (wet and dry lab professors)

I am doing a fully computational Bioinformatics PhD, and I have the option to be co-supervised by a pure wet lab PI and a pure dry lab PI. This is an existing collaboration between the two PIs and they are co-writing grants on several projects. If I were to join these labs, I would be working on modelling data generated by the wet lab PI. I will not be doing any experiments. The wet lab PI offers outstanding clinical and biological knowledge and can interpret results while the dry lab PI is very methodologically heavy, but has little biological intuition.

The issue is that the computational PI is at a satellite campus and I will be situated in the wet lab space as this is where my graduate program is based. Therefore, I will only be able to work with the computational PI and their lab through slack and zoom.

I believe I can learn a lot from this joint mentorship, but I worry about the lack of computational grad students and postdocs available to help in person in the wet lab space. I'm certain I will get excellent biological knowledge and guidance, but lack in-person computational guidance.

I'm wondering if people have experience with these setups? Do you think this is a reasonable setup, or would I lose a lot from not having access to in-person computational support?

6 Comments
2024/04/18
04:02 UTC

1

qiime2 on putty

Hi! Big time newbie in bioinformatics. Is there a way to activate, or call up qiime2 on putty for microbiome analysis?

5 Comments
2024/04/18
01:52 UTC

Back To Top