/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
For background, I work for a research lab under a brand new PI. It’s a new lab, obviously, so there’s only my PI, myself, another tech and our lab manager. I’m the only one in the group that has shown interest in bioinformatics. My background is I’ve worked in pre-clinical (animal) and lab animal research for 5 years, most animal care/project planning for researchers, and in my free time I’m self-taught in Python and R. I have a passion for using deep learning to mitigate the need for so many animals in research, but I’m self taught.. so I know there’s gaps in my knowledge.
Anyways, fast forward to last year when I saw this tech position open up. I applied and brought up my interest in bioinformatics, since the new PI’s research strongly uses it for their work. He said he would hire me on as 50% wet lab and 50% dry lab. I felt like I landed my absolute dream path to my dream job. So I took it. I have a biology undergrad, but it’s been some years since I used a lot of the genetics and molecular biology knowledge, so I brushed up on it and basic bioinformatics tools like Seurat/Signac,bioconductor, etc.
4 months into the job now and I’m absolutely miserable. Well, mostly. I love the work I’m doing. I’ve been given tons of computational projects, anything from basic preprocessing our massive multiome data for downstream analyses to actually doing cell-type-specific analyses like motif discovery of selector enhancers or chromatin accessibility changes within similar cell types across development. And it’s been fun. But my PI.. is not fun to deal with. Every week we have a meeting to go over my scripts and talk about projects. Any time I ask a question to clarify, he says “we discussed this” or “we already talked about this”. But when I don’t ask questions, I make very stupid mistakes in my scripts that he catches. Today he told me that by now (not quite 4 months) he feels like he shouldn’t have to hold my hand and guide me through these things and that I should be capable. I was shocked. I have a basic biology background, basic coding background, next to no wet lab nor neuroscience experience.. I feel completely saturated with information, and I can’t retain it all. So of course I’m not going to be a fleshed out bioinformaticist yet.. is this how it usually is? My title is research tech, and I only wanted training for a bioinformaticist position that I would consider in the future. Like, I’m thinking a few years, not months. I just don’t know what to do. I’m starting to feel so discouraged and I hate going into meetings with him because I know he’s going to shred me. I want to be okay with my mistakes and learn from them, but our interactions give me so much anxiety that I feel like I don’t even want to try so I can’t fail. He’s so smart, and learns things SO fast. I don’t even know how to breach this subject with him because we’re just so different in our learning speeds and modalities. I feel like I should expect to be let go soon but I’m doing my best and I feel like I’m making real progress. I feel so defeated and I wanted this job more than anything.
An external consultant recently proposed in my company the usage of git hooks in our workflows. I am not familiar with the concept as we are small and work individually in our projects, but I would like to understand what is the intended use for them.
An I understood they are functions executed before and after a commit, but I find that concept limited and would like to know some examples on how do they improve the work.
Hi all,
I use GSVA (Gene Set Variation Analysis) package to calculate pathway scores. Then I compare pathway scores between 3 groups using limma. When I use a few thousand pathways, p value of pathway A will different (bigger) with when I use only a few pathways. I think it should be the same because each pathway is independent (adj p value changes because number of comparison change). Basically, it is a Anova test. Would you please have an explanation and which practice I should do in this case? Thank you so much!
My main goal is to be able to use Diffusion Maps in a way similar to sklearn.decomposition.PCA
and understand how to interpret the output (e.g., eigenvalues, eigenvectors, and transformed space).
The functionality I'm looking for is to be able to use .fit.
and .transform
to a model object. I found 2 packages that do this but they are not actively developed AFAIK:
However, mapalign
doesn't support the .transform
method.
I found another implementation in ScanPy
which is actively developed so I'd like to use this and make a model that can support .fit
and .transform
methods.
I'm trying to adapt this to a very simple case which is the iris dataset:
import anndata as ad
import scanpy as sc
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
from sklearn import datasets
iris = datasets.load_iris()
X_iris = pd.DataFrame(
iris.data,
columns=iris.feature_names,
)
X_iris.index = X_iris.index.map(str)
df_meta = pd.Series(iris.target).map(lambda x: iris.target_names[x]).to_frame("species")
df_meta.index = df_meta.index.map(str)
adata = ad.AnnData(X_iris, obs=df_meta)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X', method='gauss', metric="euclidean")
sc.tl.diffmap(adata, n_comps=min(adata.shape))
adata.obsm["X_diffmap_"] = adata.obsm["X_diffmap"][:, 1:]
sc.pl.embedding(adata, "diffmap_", color=["species"])
If anyone is familiar with ScanPy
or Diffusion Maps, any insight regarding the following questions would be extremely useful:
ScanPy
stored in adata.obsp["distances"] calculated?It looks like they are calculated with the following code but I was wrong:
neigh = NearestNeighbors(n_neighbors=10,metric="euclidean")
neigh.fit(adata.X)
distances = neigh.kneighbors_graph(np.ascontiguousarray(adata.X, dtype=np.float32), mode="distance")
np.allclose(distances.toarray(), adata.obsp["distances"].toarray())
# False
I thought it would be from similar code but Sklearn's NearestNeighbors returns 1/0 not continuous floats:
neigh.kneighbors_graph(np.ascontiguousarray(adata.X, dtype=np.float32), mode="connectivity")
According to this thread, you are supposed to drop the first dimension because it's the steady state and not informative: https://github.com/scverse/scanpy/issues/2254
adata.uns["diffmap_evals"]
# array([1. , 1. , 0.9813439 , 0.94277596], dtype=float32)
So if you are projecting into 4D as I am above, you would only analyze the last 3 dimensions.
In the literature, it says the steady state eigenvector's eigenvalue is 0 and the importance is inversely proportional to the non-zero eigenvalue.
The Laplacian eigenvectors that are identified in the processare also of interest for a different reason. Thenth eigenvectorvncontains one element corresponding to each of the species, whichis related to thenth i-trait for that species. The corresponding eigenvalue λn is inversely proportional to the relative importance of this nth trait axis (SI Appendix).
Ryabov et al. 2022 doi:10.1073/pnas.2118156119
Based on the context from the previous question, what does it mean that the 2nd eigenvector is also 1? Does that mean this is also uninformative?
In many cases, I get negative eigenvalues back. For example, the following:
adata = ad.AnnData(X_iris, obs=df_meta)
sc.pp.neighbors(adata, n_neighbors=10, use_rep='X', method='gauss', metric="euclidean")
sc.tl.diffmap(adata, n_comps=25)
adata.uns["diffmap_evals"]
array([ 1. , 1. , 0.9813439 , 0.94277596, 0.9376262 ,
0.8767288 , 0.8594042 , 0.7913928 , 0.7608171 , 0.66026914,
0.62744224, 0.62703806, 0.5787338 , 0.51345664, 0.5119588 ,
0.47451586, 0.4733336 , 0.42002645, 0.41362438, 0.39693147,
0.35458508, -0.33341438, -0.34987825, -0.3791933 , -0.4204018 ],
dtype=float32)
For the iris dataset, I was able to use n_comps=200
which is more dimensions than both the samples and the features in the original data frame. However, most of them were negative:
sc.tl.diffmap(adata, n_comps=200)
adata.uns["diffmap_evals"]
array([ 1. , 1. , 0.9813439 , 0.94277596 ,
0.9376262 , 0.8767288 , 0.8594042 , 0.7913928 ,
0.7608171 , 0.66026914 , 0.62744224 , 0.62703806 ,
0.5787338 , 0.51345664 , 0.5119588 , 0.47451586 ,
0.4733336 , 0.42002645 , 0.41362438 , 0.39693147 ,
0.35458508 , 0.2911088 , 0.2740729 , 0.25649452 ,
0.22609575 , 0.20761564 , 0.19950579 , 0.18004218 ,
0.15906486 , 0.13327035 , 0.1303541 , 0.13028115 ,
0.11071147 , 0.10258841 , 0.075313695 , 0.051683918 ,
0.048783954 , 0.043516714 , 0.028213583 , 0.023989521 ,
0.016956184 , 0.014117415 , 0.0069459495, -0.010597833 ,
-0.011460057 , -0.01403651 , -0.019394804 , -0.021086005 ,
-0.030575594 , -0.032255445 , -0.03802568 , -0.041902117 ,
-0.045432337 , -0.048568282 , -0.05079374 , -0.05370096 ,
...
i just need to know what type of graph should 2.a) and 2.b) parts of the question produce.
The above
Any other full time bioinformatics consultants on this subreddit? Ive been doing full time work for several years (post-postdoc) and would love to hear from other consultants about their approaches/resources.
Dear all,
I am looking for a BED file containing neutral regions in the hg38 genome so that I can make a neutral model to do conservation scoring (phyloP) on a series of species (Primates) I have in a WGA. Where can I find such a BED file?
Additionally, say I want the BigWig file for each species I individually run phyloP for (where the reference genome is hg38), do I need a neutral model for each of these species? Or is a neutral model simply based on the reference genome of hg38 satisfactory?
Hello communitiy, I am currently working with R (I am a beginner) where I am trying to analyse a given dataset with the Louvain algorhitm to cluster genes in communties. My data included roughly 10.000 genes from 155 samples. I would like to analyse and in the best case visualize the communties. Right now my code caps out at 2000 genes. I already filtered the genes to only use sognificant ones and ones with higher logFC than 1. Reducing the number to 7000. I would really appreciate if anybody can recommend some methods or packages. Thank you so much! If my code or the source of the dataset is needed, I will gladly provide more Info!
Dear All,
I am writing this post cause I am giving up on this super simple task. I have an excel sheet of genes found mutated in our samples, according to one analysis these genes were found to be of unknown significance, in terms of being pathological or not.
Now I need to cross reference this huge list of genes with other databases like dbSNB or comsmic or whatever but I really cannot get my hand on one clear data base that will allow me to compare if these mutations are truly of unknown significance.
Any help?
PS: the bioinformatician who did the analysis is out of reach, and I don't have access to the raw data, the only thing I have is this excel sheet.
Hi everyone !
For my analysis, I'm using mmseqs2 (bash command) to search for homologous sequences using the easy-rbh (reciprocal best hit) option. In the output obtained, I only get the identifier of my best hit, as well as the hit values (e-value, identity, etc...) but I'm wondering if there's a way to also get the fasta sequence associated with this identifier directly in the output file ? This would allow me to avoid wasting time matching files in my database.
Thanks in advance ;)
for a school project I am using Artemis to analyze thee bacteria. I have downloaded whole, completed genomes for these thee, and am noting down what genes have been annotated with interesting function descriptions. But what i truly wanna know is what these bacteria do as proven by their genome. These bacteria should inhibit nitrogen and carbon emissions from manure, and I'd love to indicate this function from their genome. But I am not sure how to.
currently, I am using NiH database to get my info and KAAS (KEGG) (genome.jp) to BLAST the sequence to get pathways. But KEGG only shows me 2 pathways when i run it through, and both are quorum sensing. What engine can i use, or what can i do to get a clear function read from the genomes?
Help,
An overachieving bachelor student.
I am trying to learn sanger sequencing (how to interprate the chromatogram). In lab we inserted mCherry gene into the PLAC plasmid. Restriction enzyme digestion indicated a succesful insertion. The sequencing results came back lite this:
Might sound like a dumb question but I am new to this.
region before prime
region after
Dear expert,
I am Sonal MSc second year Bioinformatics student. I am writing to discuss an issue I encountered while working on building a peptide/fungal membrane complex for molecular dynamics (MD) simulation on CHARMM - GUI
Specifically, I am facing difficulties in incorporating MIPC (Mannose-(inositol-P)2-ceramide) lipid into the membrane model. MIPC is an important glycolipid component for fungal membranes, and its inclusion is crucial for accurately representing the membrane environment in our simulations.
I would appreciate your guidance on how to effectively add MIPC lipid to the membrane model. One suggestion is to categorize MIPC under the ceramide category, considering its structural and functional similarities. Alternatively, if there are other lipid options from the existing list that you recommend as alternatives to MIPC, I would be eager to explore those possibilities.
Thank you very much for your attention to this matter. I look forward to your valuable advice and suggestions.
We tried making the MIPC by myself in CHARMM-GUI below is the picture which did not work out.
Hi all,
I am running a FGSEA using, "https://stephenturner.github.io/deseq-to-fgsea/". I was looking through my data as well as R-studio to figure out this issue.
In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (4.08% of the list).
Even though the 4% may seem relatively small, some pathway goes from signficant to non-signigifcant. Using any duplicate feature on R, it returned as zero.
Does anyone know how to resolve this issue? Maybe a new ranking metric; if so how would I rewrite the code for FGSEA?
I am currently a university student using software provided to me by my lab, and am on the cloud, having downloaded the fa file of my genome, now trying to index it. There is enough storage where I have placed the genome, however I am still getting an error when I do "bwa index ___.fa" after many iterations "Failed to allocate 3088336136 bytes at bwtindex.c line 158: Cannot allocate memory". any clue why this may be, or how I can fix it?
Sorry I know this is hyperspecific but I don't know where else to turn.
Very new to this.
Here is the example data (1-7 patient samples). Probes in the “non” region have CN=2. The depth at a probe is expected to be linearly proportional to the copy number of the DNA at that site. How would you normalize the data (if necessary) and is there a R package for this type of analysis?
I have noticed this before when I was looking at the Ensembl Zebrafish fasta files. There are 3 types of `gtf` file that Ensembl gives. I am looking at this page for mice:
Mus_musculus.GRCm39.111.abinitio.gtf.gz2023-10-08 14:49 110
Mus_musculus.GRCm39.111.chr.gtf.gz2023-10-08 14:49 31M
Mus_musculus.GRCm39.111.gtf.gz2023-10-08 14:49 31M
Which one is best to use with STAR for alignment the one ending with chr.gtf.gz or gtf.gz? The README explains what the abinito one is for but not the other 2. Any guidance is much appreciated.
hello,
I have a set of data that I wish to clean before analysis.
basically I have multiple sequences of different species and want to trim the edge of the sequences based on the primer sequence.
I am able to remove the primer sequence when it is just right at the edge of the sequence but when it's a bit further I can't figure out how to do it on R.
so basically I have
sequence <- "AGTCATGCGACTGCATA-rightprimer-AFTGACATACA
and I wish to obtain after trimming
sequence <- "AGTCATGCGACTGCATA"
however since there is a bit of parasitic sequence before the primer, the function matchLRPatterns doesn't want to trim the edges.
thanks !
It is lab week. Happy Lab Week!
The bioinformaticians at my place of employment were not invited to any of the lab week celebrations.
I wanted to wish everyone here a "Happy Lab Week" in case you were also forgotten.
Hello everyone I am working on molecular docking and ı want to download all pdb's together as pdb format according to search (name of protein, name of organism) on rcsb.org . can someone help me if there is a way to do it, how it can be done? thanks for any help
Hi everyone,
I'm currently engaged in a project where we aim to replicate the computational analysis of a paper that explored inter- and intratumour heterogeneity in metastatic breast cancer through single-cell RNA-Seq analysis. Our focus is to use different tools/pipelines compared to the ones used by the original authors. So far, we've used HISAT2 for alignment, sorting, and indexing, but we're exploring alternatives for the other stages of analysis.
We need a replacement for the rsubreads function (used by the authors to generate counts) and tools similar to the griph package for cell cycle correction. We aim to produce a count matrix using the different tools and then apply it in a Seurat pipeline for PCA, differential gene expression analysis, and gene set enrichment analysis.
Can anyone recommend tools that are relatively easy to learn and efficient to use? Time is of the essence, and while we're keen on exploring methods, we can't afford a steep learning curve right now. Your suggestions would be invaluable!
Thanks in advance!
Hi,
Is there a good way to open postQC seurat v5 object on v4?
Hi there,
Has anyone extracted RNA and/or DNA from small intestinal neuroendocrine tumours and sent for DNA/ RNA Seq analysis. What DIN/RIN values have you seen- anything above 5? just wondering if this is an isolated issue or related to the tissue itself. Tissue is usually fresh frozen on the day of surgery and extracted after one thaw or immediate extraction after transport on ice. Extracted with Qiagen kits.
Any thoughts/ help/ experiences would be useful
I am creating my Final project for my bachelor degree, my idea is creating a little "framework" that predicts drugs or molecular properties from smiles sequences(BBB permiability, toxicity, reactivity..), This part is working fine.
My question is how do I store this sequences on a Database efficiently?, my idea is that if I give one input sequence the database should output the top 5 most similar sequences.
I would appreciate if anyone know about a resource or could give me advice about the type of database or algorithm i should be looking for.
I'm interested in seeing how the knockout of one gene may affect the expression patterns of other genes (kind of like an effect size), and I wonder if the distance between two given proteins in Reactome may be a good predictor. Not sure how I would define distance, one trivial definition may be the number of nodes between the pair. Any help would be appreciated!
Dear fellow Redditors, I was asked at my lab to predict a 3D structure of a peptide derived from a protein (for which the structure was resolved). I have a little experience with that, but started the research.
I ended up using AlphaFold, PEPstrMOD and PEP-FOLD for this purpose. Well, nearly each best model of the three software looked different. I performed the RMSD (by aligning the models) and the values were high (from 4 to over 15). They rather did not look alike. Sometimes softwares returned similar models for other peptides I've tried (but at most 2 of 3 models were kind of similar).
I also looked at the Ramachandran plot. There were some outliers. I used the tool to refine the structure, but looking at the 3D model, not much changed - not the general peptide overview, but rotations of some bonds.
I do not know if I am thinking right, but to me for the model to be correct, all the three software for modeling would have to agree (ergo, RMSD values will be low, around 1). I would feel calmer.
I've found information that peptides are "highly flexible molecules", "do not assume a single defined structure in solution", "have high conformational flexibility".
Is there any way to predict 3D structure of a peptide (7-25 aa) at all? What should I aim for? How can I choose the best model? How can I compare models from different sources? Any hints will be highly appreciated, as I am lost.
Hello,
I recently got send .cloupe files and I would like to analyse them in R (just starting to learn it). Is there a possibility to convert .cloupe back to a different file type, as I would like to use R to analyse it. I have been googling for quiet some time but I cannot find a solution.
Thank you in advance
I am doing a fully computational Bioinformatics PhD, and I have the option to be co-supervised by a pure wet lab PI and a pure dry lab PI. This is an existing collaboration between the two PIs and they are co-writing grants on several projects. If I were to join these labs, I would be working on modelling data generated by the wet lab PI. I will not be doing any experiments. The wet lab PI offers outstanding clinical and biological knowledge and can interpret results while the dry lab PI is very methodologically heavy, but has little biological intuition.
The issue is that the computational PI is at a satellite campus and I will be situated in the wet lab space as this is where my graduate program is based. Therefore, I will only be able to work with the computational PI and their lab through slack and zoom.
I believe I can learn a lot from this joint mentorship, but I worry about the lack of computational grad students and postdocs available to help in person in the wet lab space. I'm certain I will get excellent biological knowledge and guidance, but lack in-person computational guidance.
I'm wondering if people have experience with these setups? Do you think this is a reasonable setup, or would I lose a lot from not having access to in-person computational support?
Hi! Big time newbie in bioinformatics. Is there a way to activate, or call up qiime2 on putty for microbiome analysis?