/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
The basis behind minimap2 is several heuristic chaining before a final alignment (which is considered optional). Considering, string alignment is the gold standard taught in intro bioinformatics classes, what is the benefit or biological significance of chaining seed hits/chains together?
Is there any reason besides the performance benefit of chaining as opposed to alignment?
Hello, I have 600+ ASV’s generated from dada2, using the Batra 12S primer set from this paper (https://pubmed.ncbi.nlm.nih.gov/26479867/). I am targeting for amphibians, but these primers seem to pick up everything. The vast majority of those are actually bacterial and algal 16S sequences, but from just preliminarily blasting some sequences, there are definitely vertebrates present as well. I have spent months building RDP classifiers and fasta files to try and assign taxonomy down to species level, but with little success. I’ve built classifiers using rCRUX, and changed the formatting around to make it dada2 compatible, but the assignments are poor and don’t pick up on things I know are there. I’ve tried this to varying degrees including just amphibians and even including the entire nucleotide database. I’ve also used PrimerMiner to pull down all amphibian 12S sequences from GenBank and BOLD and combined fasta files for those, but the fit in dada2 is too specific when using straight fasta files and comes back with almost no assignments. I’m just not sure what the best way to go about this is, and I’m still new to this. I really don’t want to blast 600+ sequences. I do have the blast+ suite downloaded, along with a local copy of the nucleotide database (yes it’s huge). Maybe there is a way I could use this somehow? I’d love to stay in dada2 if possible, but I’m open to anything. Any suggestions on achieving species-level classifications would be welcomed at this point.
We decided not to concatenate sequencing files in the beginning of the pipeline. VCF files for algal DNA-seq data were acquired but there seems to be a lot of variation between the same sample and the two lanes it was ran in. Less than 50% of the variants appear with similar frequency and over 50% have wildly different frequencies among variants.
Might there have been a problem during sequencing?
New lecturer here, again, teaching subjects I have no experience in.
So, I was teaching the students how to align sequences using JALVIEW, and JALVIEW can can construct trees, should I keep working with JAL for phylogenetic tree building, or use MEGA?
Hello. I'm gonna check the physicochemical properties of my developed protein sequences. However, I'm seeing there are different tools to do so. But the outputs of most tools are varying from one-another. This resulted in a confusion to reach to a decision which tool should I rely on. I've sorted out two tools that seemed reliable to me for serving this purpose. Will be happy to know which of the following tools should I go for:
You can also suggest me other tools in the comment. Regards.
Looking to Reddit because I don't know where else to go...
I am a humble graduate student attempting to use the Mesquite program on my Macbook Pro to align multiple genetic sequences (in FASTA format). When I try to align using the automated tools (ClustalW, MUSCLE, or MAFFT, I have tried them all) nothing happens. I have downloaded these programs separately as binary files, I have the MUSCLE one as a Unix Executable file. I continually get this error message that says "error=86, Bad CPU type in executable". I have no Mesquite experience before this. Not really sure how to fix this, any help would be very very appreciated!! Thanks!
I have a file from GEO where RPKMs were generated from the ucsc mm10 gtf. On the otherhand, i have a normalized count matrix from my DESEq workflow. I want to combine these datasets and create a PCA plot to see how the samples in these datasets are similar.
I really need help because i am wondering is that even possible? Is there any links for a guide on this? The goal of this project we are doing in our lab is that we have ran deseq2 and we believe that the samples we have may correspond to developmental stages. We have then decided to do PCA with publicly available dataset.
Retrieving these dataset has proven difficult as they are not count matrix but rather RPKMs matrix or .bw etc from GEO.
Is there a way to retrieve these raw counts?
Edit: CGC (Cancer Genomics Cloud), not CDC.
I have some files under my account there that I want to access via API calls on R on my local machine, but the API calls only seem to return metadata about the files, not the actual contents of the files themselves.
Anyone have experience with this?
Hi. I have a few thousand whole genome sequences (from a parasite) that are around 100kb in length each. I want to explore the relatedness between these sequences. In our previous studies on smaller groups of samples, using multiple sequence alignment and visually inspecting phylogenetic trees allowed us to see that the sequences grouped on the tree in a way that closely reflected geographic origin. We would like to carry out a similar analysis based on our much larger cohort but I'm struggling to run my usual pipeline of MAFFT/trimAI on such a large dataset, even on a AWS HPC. Does anyone have suggestions of other tools that are better suited to large datasets, how to reduce the dataset, or any alternative approaches.
Thanks!
I am working on modelling carbon metabolism in the chemolithoautotrophic bacteria Cupriavadius necator. I plan to model how carbon dioxide enters the cell and is fixed by the CBB cycle.
At the time of writing this, I have modelled a basic Calvin Benson Bassham (CBB) cycle with included carbon dioxide diffusion mechanisms. However, the model does not reach steady state as it has no sources of ATP regeneration, and lacks a carbon outflow.
Despite many different attempts at achieving steady state, all have caused the model to break down. Listed below is the current setup for the cycle on Copasi:
This model is simple as I am fairly new to copasi, but when no outflow is included, the model works as expected but does not reach steady state (also expected).
I am aware how vague this may seem to those with more experience, but any help would be greatly appreciated.
The whole interface has changed, and is not showing any results even after uploading a pdb file. Is there any major update going on? How long will it take to get better? I have a final on Monday, and very much need PDBsum for that.
I'm trying to write a IGV like tool in R for fun. How does IGV visualise the reads? Should I map the reads first. I'm using a synthetic data where instead of nucleotides I'm using alphabets in random. I have made random read like sequence for this. I have generated a read count and made a table for unique read and count. I'm having trouble how to move forward.
Hello reddit, I am working on a gene analysis program and I was wondering if anyone could provide any insight into how you might go about aligning two genomes for closely related species so that they start in roughly the same place. I am aware that there are other programs out there that eliminate the need to do this, but I am attempting this as skill development to become competitive for graduate programs in bioinformatics. Is this something that can be done through an existing library (in Python, which I am using) or should I defer this to an existing program (such as ClustalOmega)?
Hello
I am trying to annotate hundreds of vcf.gz files with bcftools using this command
ls *.vcf.gz | parallel -j 200 "bcftools annotate -a dbSNP156.gz -c ID -O z -o {.}.rsid.vcf.gz --threads 1 {}"
When I open the annotated files, I see an ID column, but instead of rs ids I only see thousands of dots.
Why?
Help, please
Currently working on nanopore data, I realise running Fastqc is ideal for illumina and Pacbio reads. I’ve come across nanoplot, nanocomp and nanostat, are they a good alternative? Would you recommend running both Fastqc and the above mentioned nano alternatives? #bioinformatics#nanopore#illumina#fastqc
Hello!
I usually use Inkscape to assemble the different figures for papers because I can easily add the panels generated in R or Python in SVG format to the figure and make small changes effortlessly. Like when the wet lab team doesn't like the colors I chose for the stromal cells, I can adjust them without having to load 20Millon of cells again.
So, I was wondering if anyone could recommend an online or collaborative way to work on the same SVG-based image.
Thks!
Hi bioinformaticians, doing a postgrad in Bioinformatics so still getting used to this area and would appreciate a little help! Currently working on an assignment to reproduce the analysis of a previous RNA-seq paper (with quite vague methods) from their sequencing data.
We had to use RSEM (with Bowtie2 as aligner) for alignment and counts using the reference genome specified in the paper, but afterwards we found all 6 of our samples had ~63% successful alignment of reads. This doesn't seem great and there was no mention of this in the paper. It seems unlikely to me to be contamination of their original samples as they are all between 61-65%, so I'm thinking it's something to do with my alignment settings.
For the reference genome, RSEM requires a .gtf and .fa file, there are several versions of the reference genome the paper linked to. I used the genomic.gtf and genomic.fa versions, as it was the only gtf file in the directory, although there were rna.fa and rna_from_genomic.fa files too (this is all from NCBI GCF database).
Could the fact that I used a genomic reference instead of an RNA reference affect my alignment rate? If so, how can I use the RNA reference with this tool if there's no RNA gtf file? Please don't suggest using any other software tools instead of Bowtie2 and RSEM, I have to follow the same pipeline as the original paper.
Thanks very much.
Hi I’m trying to perform the NMF analysis, differential expression, drug targeting and WGCNA analysis on a couple of publicly available datasets. I have already started and I am using the publicly available raw counts available from GEO and TCGA. I am performed the batch effect removal using combat_seq and have continued my analysis since it worked well I would say. But what I’m wondering now in retrospect, is “is it okay to use raw counts?” Even tho the batch was removed successfully I could provide the PCA if needed. Sorry if this is something that is well known or something but I’m struggling with it and as far as I can see multiple published articles have used raw counts for their analysis. Thanks in advance!
Hello everyone, I am working on differential expression analysis for Multiformis using DESeq2. However, I encounter a strange summary after running the res
function. What I found strange is the equal number of upregulated and downregulated genes (a coincidence?), and that I observed zero outliers and zero low counts. Can someone explain whether this is normal or if there might be an issue with the preprocessing of my RNA-seq data?
out of 2804 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 788, 28%
LFC < 0 (down) : 788, 28%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
And when I used this command summary(res_all_times, alpha=.0001)
I got this:
out of 2804 with nonzero total read count
adjusted p-value < 1e-04
LFC > 0 (up) : 318, 11%
LFC < 0 (down) : 260, 9.3%
outliers [1] : 0, 0%
low counts [2] : 0, 0%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Also, could you explain me what mean count < 0 does it mean?
Hey, I was just wondering if anyone has any advice on how I can fix this error saying that not all atoms have an autodock_element field. It appears on every protein I prep but has not just started recently. I download the pdb from the protein databank and do the usual prep (remove inhibitors and heteroatoms, remove water, add polar hydrogens, and add Kollman charges) but it still appears when I go to write the pdbqt file for any molecule. Any advice is appreciated
Hi everyone,
I've recently seen a volcano plot for the differential expression between two clusters (in single cell sequencing) that used a variable to represent the difference in number of cells that express each gene instead of the -log10(p value). I'd like to try this with my data but unfortunately I can't remember the paper where I saw this plot. Does anybody know what I'm talking about and can show me a reference where it's used?
Thanks!
I’m planning on doing an RNA-seq meta-analysis but not all studies provide raw data. In fact, some of the largest studies just provide their normalized counts. My original plan was just to get raw reads, then realign all to hg38, and use these new normalized counts in my meta-analysis. Because that’s not possible I was thinking of using the studies raw counts, converting the gene labels to a unified system and then do a meta analysis using either metaSeq (https://www.bioconductor.org/packages/release/bioc/html/metaSeq.html) or MetaRNASeq (https://cran.r-project.org/web/packages/metaRNASeq/index.html). My question is, will the fact that the studies have difference preprocessing pipelines be an issue still? Or because they’re be compared within studies and then just the differences are compared across studies it shouldn’t be as big an issue?
Let’s say I want to compare how a drug differentially impacts two cell types using single cell sequencing.
As a simple example, say I want to identify shared/unique dysregulated pathways between cell type 1 and 2 after the addition of a drug. I would first compare control and drug transcriptomes for cell type 1 type to get DEGs in type 1 due to the drug. Then would do the same for cell type 2. Then I would compare the lists of DEGs from cell types 1 and 2 to find which DEGs are unique vs shared.
My question is, would this be best performed with a discrete list of DEGs and GO, or with GSEA? Because DGE analysis gives me a discrete list, I can easily compare them and then use differential DEGs to find the shared/unique pathways through GO. But GSEA looks at all genes expressed, so I’m not sure how I would compare differentially impacted programs.
I would prefer GSEA because it is a more un-biased approach without an arbitrary p value cut off and takes into account the totality of gene expression. However, I don’t think I can use GSEA to compare differentially impacted pathways. Is there any way this is possible using GSEA or am I better to stick with DGE followed by overrepresentation analysis on unique and shared DEGs? Thanks for your advice in advance!
Anyone know of a program that will run on Windows that will detect copy number variants in a CRAM file? Must I go Python?
I frequently need to simulate molecular dynamics for my in silico drug design. But there are less facilities for the molecular dynamics simulation in my lab. Can anyone please suggest me what alternatives may I get?
Previously, we used WebGro for this purpose.
I just fall into this job but I need to show results asap so I'm sorry for this
I have control and one treatment (stress) for some plants and basically interested in how some specific genes and biological functions are differential expressed between control and treatment.
my question is: how to present those results?
I did Trinity De Novo assembly, ran Salmon, DESeq2 and EggNOG but now what? I was told I could use heatmaps, volcano plots, venn, GO enrichment, revigo...
And the most predominant doubt I have is where do I see the difference across treatments? deseq2 and eggnog produce tables and results kind of like just one thing mixed together? I mean I'm confused on how to actually say "hey here you can see the difference between treatment and control" you know what I mean?
literally anything that clears my mind will help lol thank you
Dear all,
I have been trying for weeks to get this to work, and it doesn't, which is why I am reaching out for help.
We let a service provider produce antibodies for us, based on a protein sequence we provide (= antibody protein sequence). The service provider reverse translates that amino acid sequence into cDNA and generates plasmids for transfection and protein production (= antibody cDNA sequence). The service provider then sequences the plasmids with multiple primers, obtaining multiple, ideally, overlapping reads per plasmid (antibody), to check that the sequence aligns with the in silico design. We also reveice these sequencing reads (= cDNA sequencing reads) in order to perform our own double checks on whether everything is correct. The format of the cDNA sequencing read is .ab1.
Example:
Antibody A is reverse translated into cDNA, synthesized and cloned into plasmid_antibody_a.
Plasmid_antibody_a is sequenced with 3 primers to span the whole coding region (to make it simple I'll only focus on the heavy chain locus of the anitbody), from which we obtain cDNA sequencing reads read_1.ab1, read_2.ab1 and read_3.ab1.
I am to use the .ab1 cDNA sequencing reads and align them to the corresponding antibody sequence (protein), in this case antibody A. I do not want to use online tools (company sensitive information) or software because we are looking at over a 100 sequences and I want a high throughput, automated approach, which is less error-prone and robust.
Using some CLI and ptyhon I am following this general pattern:
While this works fine in cases where the cDNA sequencing reads are highly similar, it does not for sequencing reads that are very different from each other. Imagin this situation:
read_1.ab1, read_2.ab1 and read_3.ab1 all have different amino acids at a given position, in the merged consensus sequence this will be indicated as a gap or x because there is no consensus, and when aligned to the antibody protein sequence result in a alignment mistake. At the same time, one of sequencing reads actually has the correct amino acids as the original protein sequence, but this gets lost during the merging.
Another approach I tried first, is to align all sequening reads under each other to the original protein sequence, like a typical mafft or clustal alignment. However, this makes it really hard to (i) see whether the whole protein sequence is covered, (ii) find the differences between all sequencing reads since you have to manually look at each sequence read and (iii) to do this with more than 100 sequences.
Is there some other, smarter, faster, better way/tool that I am missing? I can imagine that others might have done something similar, so I don't want to reinvent the wheel (which I feel that I am already doing).
I am happy to hear your feedback!
Hi all! I recently completed my MSc in bioinformatics and I've noticed the job market getting increasingly saturated and I'm finding it difficult to secure an interview. I understand that my lack of non-academic experience may hinder me, and many applicants will likely have a better understanding of certain job specifications than myself. I am simply looking for advice on dealing with burnout and not being discouraged by the 100s of people applying for the same job. Imposter syndrome type deal you know?
Hey all! I am a Bioinformatician PhD student working on a Ubuntu system on scRNAseq. I have some Visium HD data and I have applied the Spaceranger count pipeline. Now the data I have to transfer the Spaceranger output to a fellow colleague who will do the downstream analysis.
We don't not have a server/cluster and I am struggling with transferring the data to my colleague. We tried copying it in a hard disk and using an FTP transfer but some files were corrupted and/or missing. Would anyone have any suggestion how to fix this simple yet complicated issue of transfering the Spaceranger output?
Thanks a lot!