/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hi folks,
I am having a hard time with mapping a large number of sequences using Interproscan (through Galaxy) and it appears to be that the tool is not recognizing my sequences as valid. I am able to correctly map some sequences that look like this:
>d1f1aa_ b.1.8.1 (A:) Cu,Zn superoxide dismutase, SOD {Baker's yeast (Saccharomyces cerevisiae) [TaxId: 4932]}
vqavavlkgdagvsgvvkfeqasesepttvsyeiagnspnaergfhiqefgdatngcvsa
gphfnpfkkthgaptdevrhvgdmgnvktdengvakgsfkdslikligptsvvgrsvvih
agqddlgkgdteeslktgnagprpacgvigltn
They are in fasta format like this:
Interproscan works with them:
but when I attempt to input other sequences (from another source) that look like this:
>COX_I_YP_001518912.1_Cu_user;_Acaryochloris_marina_MBIC11017_(Bacteria/Cyanobacteria)
MTEAQAPHLEEVEVTPWREYFSFSTDHKVIGIQYLVTSFVFYLIGGLLAELVRTELATPASDFVPRETYNELFTMHATIMIFLWIIPTLTGGFGNFLVPLMIGARDMAFPKLNAIAFWIIPPTSILLLCSFFVGPASAGWTSYPPLSLMTNKAGEAIWILGVILLGTSSIMAGLNFLVTILKMRIPSMTLNDMPLFCWAMLATSALQLVATPVLSGAMVLLGFDLLVGTNFFNPAGGGDPIVYQHMFWFYSHPAVYIMILPAFGLISEILPVHARKPIFGYQAIAYSSIAISFLGLIVWAHHMFTSGTPDWLRMFFMIATMVIAVPTGIKVFSWVATVWGGKLNLCSAMLFGMAFVSMFVVGGLSGVMVASVPFDIHVHDTYFVVAHLHYVLFGGSVFGIYAGLYHWFPKMTGRMLNEFWGKVHFAMTFVGFNICFLPMHVLGLQGMNRRIAEYDPKFAALNVVCTIGSYILATSTIPFVVNAVWSWLAGPRANSNPWKGLTLEWTVPSPPPVENFEEDPVLAIGPYDYGTPKALDFVAATLAPAHALAAESLE
in the fasta like this:
The analysis fails. The bug report looks like this:
I tried simplifying the header format, but am unclear why this is still not being picked up as a valid sequence. Both files are correctly imported as fastas.
We are looking at genomic DNA between two populations, multiple individuals sequenced in each population. I pooled samples by phenotype using mpileup to get two .vcf files. One file is for a selected population, the second is for a control / unselected population. My system has a reference genome. My sample sizes are different between the two populations. To normalize my data at a genomic positions, I want to divide the depth of the alternate allele by the total depth at that position; resulting in proportion data for each value tested. I will do the same thing for alleles that match the reference genome.
My alternative hypothesis is that the frequency of a variant is different in the selected population than the control population. Basically, I want to find variants that differ between the two phenotypes.
My bosses suggested running a fisher exact test, but this cannot handle proportion data. Therefore, I need to look for analyses that can take proportion data. I’ve tried Chi-squared, but it can’t handle the zeros in the control (which I describe in the paragraph below). Are logistic regressions or generalized linear models appropriate for this type of data set and analysis? Are there more appropriate tests?
But I have a second issue. The genomic sequencing data we want to use was generated on an illumina MiSeq, which provides relatively small sample depth/coverage. Therefore, there are many instances in my dataset where the selected population has variants detected and the control popultion has 0 reference or alternate alleles at the position of the variant in the selected population. I could just ignore these positions, but it seems possible that if the variant is present in the selected but absent in the control, this position could be associated with our selected phenotype. Are there any tests that can handle these zeros, or do I need to just ignore them for the current analysis until I get a dataset with greater read depth at variant positions (an Illumina NovaSeq6000 run will be completed in the near future).
So, tl/dr:
Question 1) what are some standard / acceptable statistical tests I can run on a dataset that is normalized with proportional read depth?
Question 2) Are there statistical tests I can run to analyze a dataset with zeros at the control variant site? Can it also accommodate proportional data?
Hello! someone knows some easy way of calculating the mcdonald-kreitman test in r, from fasta data?
Hello!
I have a list of GO terms sourced from Uniprot. For example:
cardiac muscle cell development [GO:0055013]; cardiac muscle contraction [GO:0060048]; cardiac muscle hypertrophy [GO:0003300]; cardiac muscle tissue morphogenesis [GO:0055008]; cardiac myofibril assembly [GO:0055003]; detection of muscle stretch [GO:0035995]; mitotic chromosome condensation [GO:0007076]; muscle contraction [GO:0006936]; positive regulation of gene expression [GO:0010628]; positive regulation of protein secretion [GO:0050714]; protein kinase A signaling [GO:0010737]; response to calcium ion [GO:0051592]; sarcomere organization [GO:0045214]; sarcomerogenesis [GO:0048769]; skeletal muscle myosin thick filament assembly [GO:0030241]; skeletal muscle thin filament assembly [GO:0030240]; striated muscle contraction [GO:0006941]
Can I check if there are any standard ways for visualising this kind of information? It kind of looks like these terms are hierarchical - e.g. "response to calcium ion" seems like it would come above all of the muscle contraction terms. I was thinking a graph of nodes, with directed edges running from a parental term to a descendent term. I'm happy making the network, but am not too sure how to get the edge information - ideally I'd like to do this via the api.
Can anyone give me a head start before I start reading?
Thanks!
Edit: sorry I should have been clearer, this is all going be part of a python app with networks made on the fly, so I can't really use external tools. ideally I'd like to get this information via some kind of API call. Seen some interesting answers that I'll follow up on tomorrow, as it's quite late for me - thanks for the help!
Hi everyone, I am analyzing some full-length 16S sequencing data and I have a doubt.
For decontamination I use decontam. I have 3 extraction batches, each one with a negative control, and I don't know if I have to run decontam per-batch. I know that decontam have a batch option, but according to this issue ( https://github.com/benjjneb/decontam/issues/130 ) decontam uses all contaminants identified on all controls in all the batches (the only thing it does is to calculate the decontam scores per batch).
Decontam devs say combining all neg controls is the way to go, but I don't know if paring each batch with its negative control is more correct.
What do you think? Thank you very much!
Hello everyone,
I am using GEO data: GSE182001 to figure out differential gene expression of eosinophils in different tissues and disease states. Originally I was planning to start from the raw data, align it to a reference genome and analyze it in R. However, I have now realised that the raw data isn't labelled as the metadata just states: mixed samples. I have been working on this for a couple weeks now and I am not sure what to do. Does anyone know what I can do with this data to analyse differential gene expression in R and create different figures and plots? Any code and links to learning material will be greatly appreciated.
Thank you
Hey,
I am looking for a good alternative to BLAST. BLAST is not really ideal for me, since I have Ns in my input and AFAIR BLAST doesn't accept these. Also I want to specify a number of mismatches and in BLAST I would need to play around with the word size and the gap penalty and stuff, but I just want to use sth. like --allowed-mismatches 5
or so. Also It would be great, if I could define a region within a sequence where these mismatches are allowed and others, where no mismatches are allowed. So, I need sth. that takes this as input
sequence: CGTAGCATCGATCGATCGATNT # N can be any base
mismatches: up to 5 in the first 10 bases, none in the last 10
and output any area in a reference FASTA that would match these criteria (like BLAST), not just one.
Does anyone know such an aligner?
Hey everyone!
Looking for a bunch of bulk RNA-seq datasets to test some stuff on. Does anyone have any recommendations for publicly available RNA-seq data from interesting, impactful papers? I am more looking for data that supports a specific biological hypothesis (like, what is the effect of treatment X on Y cells at the RNA level) rather than something more general.
Hi. I am pursuing a hobby project in genomics as a learning exercise. I've already done a lot of reading about which tools to use. However for my particular situation I'm not sure which tools would even work on my data.
The idea is to scan a set of complete E coli genomes and see if it is possible to identify critically important genes a priori by finding those genes that have anomalously few SNPs across the whole population. I'm unconcerned as to whether the question is naive, already solved, or statistically infeasible; this is just a learning exercise.
I have 3000 complete E-coli genomes downloaded from the NCBI.
According to the best guide I found so far, the workflow I need to follow is:
Do assembly
do quality control
do alignment with bowtie
do variant calling with bcftools
However, I'm confused about a few things.
The complete genomes from NCBI are in fasta format (not fastq), and so there's no quality scores associated with them. However given that they are all complete published genomes (as opposed to reads) can I assume that the quality is high?
The description lines for the fasta files all look like this:
lcl|NZ_OZ040669.1_cds_WP_000673464.1_2 [gene=dnaN] [locus_tag=ABGU86_RS00010] [protein=DNA polymerase III subunit beta] [protein_id=WP_000673464.1] [location=1409..2509] [gbkey=CDS]
One can see that the data from NCBI has already been annotated and aligned. This would mean that feeding all of the fasta files into bowtie for alignment would be unnecessary, because I already know from the protein Ids which sequence I need to match between genomes. And bowtie only takes fastq files anyway!
Meanwhile BCFtools needs .bam files as input, however I can't figure out how to input bam files for the whole population of 3000 so that I can get allele frequencies as output in the .vcf files.
In summary:
If I want to get allele frequencies across my whole population, given that the data that I'm working with is in fasta format that has already been annotated with protein ids, what software should I use?
Thanks in advance for having patience with my newbie question.
Hey all, I am curious if there is a way to individually move a singular bootstrap value in iTOL so I do not have to re-adjust it by manually putting them in or doing it in something like photoshop. I have attached a photo for reference. Please let me know if this is possible. Also my first time making a phylogenetic tree and its been fun so far!!! Also are there arrows in this program for us to use to point at things? Thanks!
Hi everyone! I have RADseq Illumina data (paired-end) that I have aligned to a reference sequence using the Stacks pipeline. I want to do a phylogenetic analysis with an outgroup so I downloaded a whole genome of a species in the same family. My plan was to work through the first part of the pipeline with my sequences to create bam files and then do the same with my outgroup sequence. Because the whole genome is not paired-end, I don't think I can use BWA because it asks for 2 fasta files. Instead, I indexed the reference genome with bowtie2 and aligned the whole genome (outgroup) to that reference and then used samtools to convert the .sam file to a .bam file. So, I have bam files for my samples from stacks and my outgroup bam file from bowtie. Now, I have run into an issue with my outgroup file when I try to do SNP calling because
"Error: BAM file is not properly sorted; 2th record 'SRR7208975.1' at CM073125.1:33600596' should come before unmapped records.
Error: (At the 2th record in file '/media/nrsa-202/DATA3/Hartman/SpeciesDelimitation/v1.3/output/03_bam/mum.bam'.)"
Does anyone have any idea how to sort the bam file? Or, does anyone have any other suggestions on how to add an outgroup to the Stacks pipeline?
Hey yall! I'd like to train a neural network to identify where introns have been spliced out of the final mRNA so I can work backwards and write genes with non-ambiguous splicing patterns. I have a year of AI/ML exposure but I'm a little shaky on choosing the best network architecture for working with sequences. Any help would be much appreciated! Thanks :)
Hey Reddit, I’m a postdoc working in an unrelated field who was asked by metabolic engineering PI to assist with a project as I’m the only “computational guy” available.
I’m attempting to model heterogenous pathways in Ecoli using cobrapy and GEMs but am finding the accuracy (both qual and quant) to be unusable.
What are the next logical steps to move this from steady state mass balance to something more feasible so I can start at least qualitatively screening through gene deletions and potential hetero paths?
I’ve been in this space only for a couple months now so please be patient.
I have no prior experience in bioinformatics (or informatics at all) , but I've been able to solve most of my problems. I'm in a small lab and Noone else is able to help me with this. I have sair samples from different caves and we intend to visualize the different fungi species/genera found. I went from Oxford nanopore fastqs to kraken2 classification followed by bracken.
I want to visualize the data in a relative abundance heatmap. I've looked at tools like qiime2 (which is honestly intimidating) and metaphlan4, that does not use k-mer based classification.
I've looked at some pipelines on github that promise being able to go from kraken/bracken to heatmaps, or even from fastq to heatmaps, but I haven't been able to get them to work.
-> What would be the best way to achieve this, or to turn bracken outputs to some kind of table compatible with qiime2 or other softwares that are able to then form these heatmaps? I'd appreciate any guidance you can give me, either steps to take or direct me towards a paper that explains how to get here. Thanks in advance!
Hello everyone! 😊
This isn’t an advertisement or a job post—just a genuine hope to meet some like-minded people who are eager to grow and dive deeper into the technical world of bioinformatics.
I’m reaching out with a lot of humility and hope to connect with a few like-minded individuals who share a passion for bioinformatics. My goal is to find some friends and peers with whom I can exchange knowledge and skills in bioinformatics analysis, especially in replicating figures and tables from research papers to strengthen our practical abilities.
If anyone is interested in teaming up to learn and grow together, please feel free to reach out! Let’s build a strong team that helps each other deepen our understanding and become proficient in bioinformatics. Together, we can accelerate our journey into the technical world of bioinformatics and make learning even more enjoyable.
Looking forward to connecting with some amazing folks!
Hi, in order to annotate a mouse prostate tumor sample and a mouse spleen sample (spatial transcriptomics), what reference datasets in singleR could be used? any recommendations?
Thanks
I have generated a list of differentially expressed (DE) genes following bulk RNA-seq analysis. The lab I'm working with is interested in classifying these genes according to their association with specific stem cell developmental stages, namely 2i, epiblast, and post-epiblast.
Could you advise me on how to go about this classification? Is there a specific database or resource that I can utilize to determine which developmental stage these genes are associated with?
Thank you for your assistance!
Hi! I got a tricky problem: I generate a few million short DNA sequences (31-200 lenght, most are 31-40 long) from closely related organisms. Some of these sequences are part of HGT events between these organisms. I can BLAST (nr DB) the sequences but I'm not sure what a sufficient indicator for a HGT event in the BLAST results would be.
I for example found some results where my sequence was also part of a retrovirus affecting the organism which would be a strong indicator that the virus transferred the sequence into another closely-related organism. What other indicators might be there? I'm looking at everything, from description over similar organisms to coverage and percent identity.
Another idea is to compare my sequences to a HGT database of some sorts but the only one I found is the HGT-DB which is more than 20 years old and does not include my organisms. My input is coming from a phylogenetic analysis which means I can't build a tree and try to look at those results. Comparing GC content in 40-long sequences is also not something I would consider worthwhile.
Hi,
New to bioinformatics, interested in kidney cancer. I'm trying to repeat the findings from a paper I like (https://www.nature.com/articles/s41467-021-21068-9). They used the TCGA KIRC, TCGA KIRP, and TCGA KIRH (as well as some external data). They performed GSEA using the Hallmark gene sets (MSigDb). For simplicity sake, I've been using just TCGA KIRC.
I'm doing the analysis and the results honestly look really similar, expect for the fact that my NES values are flipped (i.e. HALLMARK_MYC_TARGETS_V2 has an NES of -2.5 but in the paper its a +2.5). The result is consistent, so I am not sure what I am doing wrong. Code below (R)
# Step 1: obtain the sarcomatoid/non-sarcomatoid definition from Bakouny, Z, et al.
phenotypefile <- read_excel("file.csv") # This file is the supplementary file from the paper which assigns sarcomatoid to specific TCGA samples
phenotypefile$sample_id <- gsub("\\-", ".", phenotypefile$sample_id) # Adjusts the sample_id column so that it lines up with the transcriptomic data
phenotype_kirc <- subset(phenotypefile, phenotypefile$study_id == "kirc_tcga") # removes the TCGA KIRH and TCGA KIRP
# Step 2: obtain the mRNA-seq data which is in KPM
expression_data <- read.delim("C:../kirc_tcga/data_mrna_seq_v2_rsem.txt", header = TRUE) #TCGA KIRC RNASeq data from cBioPortal
expression_data <- expression_data %>% select(-Entrez_Gene_Id) # Remove the Entrez Gene ID column
expression_data <- expression_data %>% select(-TCGA.DV.A4W0.05) # Remove the non-primary tumor sample
# Step 3: obtain the sample IDs from the column names of the expression file
sample_id <- colnames(expression_data)
sample_id <- sample_id[-c(1)] # remove the label of Hugo_Label
# There are mRNA expression data for 533 patient samples
# Step 4: Make phenotype_data with the row names corresponding to the sample IDs obtained from the expression data
phenotype_data <- data.frame(Sample_ID = sample_id)
phenotype_data <- left_join(phenotype_data, phenotype_kirc, by = c('Sample_ID' = "sample_id"))
phenotype_data <- phenotype_data %>% select(-study_id,-patient_id,-sarc_or_rhab,-rhabdoid)
phenotype_data$sarcomatoid <- phenotype_data$sarcomatoid %>%
replace(is.na(.),0)
# There are 533 total samples. 46 of these samples are sarcomatoid.
# Step 5: set Sample_ID as row name in phenotype_data and then remove it from the data frame
rownames(phenotype_data) <- phenotype_data$Sample_ID
phenotype_data$Sample_ID <- NULL # Remove the Sample_ID column now that it's set as row names
phenotype_data$sarcomatoid <- as.factor(phenotype_data$sarcomatoid)
# Step 6: clean up the expression data file
# Convert to data.table
expression_data_dt <- as.data.table(expression_data)
# Aggregate any repeats by Hugo_Symbol and calculate the mean for each gene
expression_data_dt <- expression_data_dt[, lapply(.SD, mean), by = Hugo_Symbol]
# There were 17 Hugo_Symbol genes listed twice
# Convert back to data.frame
expression_data <- as.data.frame(expression_data_dt)
# Remove any missing Hugo_Symbols (1 missing)
expression_data <- expression_data %>% filter(!is.na(Hugo_Symbol))
# Set row names and remove the Hugo_Symbol column
rownames(expression_data) <- expression_data$Hugo_Symbol
expression_data$Hugo_Symbol <- NULL
# Step 7: Log 2 transformation of expression data
log_expression_data <- log2(expression_data +1)
log_expression_data_matrix <- as.matrix(log_expression_data)
# Step 8: Extract the Hallmark Gene Signatures------------
hallmark_gmt_file <- "C:/.../h.all.v2024.1.Hs.symbols.gmt"
hallmark_gene_sets <- getGmt(hallmark_gmt_file)
hallmark_gene_set_list <- geneIds(hallmark_gene_sets)
# Step 9: make a named vector with the patient sample ID and sarcomatoid status -------------
sarcomatoid_status <- setNames(phenotype_data$sarcomatoid, phenotype_data$PATIENT_ID)
# Step 10: perform differential gene expression analysis ------------
t_test_results <- apply(log_expression_data, 1, function(x) {
t.test(x ~ sarcomatoid_status)$statistic
})
# Rank genes by t-statistic (higher means more differentially expressed)
test_gene_list <- sort(t_test_results, decreasing = TRUE)
# Step 11: perform GSEA--------
# Assuming hallmark_gene_set_list is a list where names are gene set names and elements are gene vectors
# Convert list to data frame
hallmark_gene_set_df <- data.frame(
gs_name = rep(names(hallmark_gene_set_list), times = sapply(hallmark_gene_set_list, length)),
gene_symbol = unlist(hallmark_gene_set_list)
)
# Perform GSEA
gsea_results <- GSEA(geneList = test_gene_list,
TERM2GENE = hallmark_gene_set_df,
pvalueCutoff = 0.05,
verbose = FALSE)
# Convert results to dataframe for viewing
gsea_results_df <- as.data.frame(gsea_results)
Hello, i am trying to build a hidden markov model for CpG islands, as it is the simplest in terms of parameters. Now i am trying to found a dataset of genome and CpG sequence to estimate the transition matrix between different state Q and an emission probability. But i had no luck in finding a dataset.
I have been struggling with rosetta fold for some time trying to use the pmutscan protocol with a mutants list to perform a saturation mutagenesis scan on the entire protein. My protein is a cytochrome tetramer. I want to perform a scan of all possible mutations at most spots(besides the residues where the heme binds) to find the mutations that stabilize the structure. Are there other software packages that can handle this task? preferably that can handle a protein with multiple chains.
Hi everyone,
I've received WGS files in fastq format from a company that does WGS for a human cell line (treated and untreated). Files are huge.
I've used samtools to create a) a BAM file and b) a BAM index file
When loading the BAM file into IGV (2.18.2) the files are not read properly.
Running samtools quickcheck gives this error:
sample_xzy.bam had no targets in header.
What am I doing wrong here? Is the conversion process wrong? Does the original fastq file not have enough information in it? Would appreciate any pointers that could help.
Best regards
Hi guys!
On my campus, everyone uses different alignment algorithms and, consequently, different apps. So here I am—what's the best alignment method when it comes to phylogenetic analysis on small genomes? I'm currently working on one and need the most convenient apps for my graduate work.
I have 9 soil samples. I am doing analysis of shotgun metagenomic data which near about 25 millions reads in each sample. Now after quality control I have tried megahit for assembly. I have done coassembly and then I have done metabat2, maxbin and concoct but it gives near 180 bins all together. After derep things, It gives hardly 5 bins of good quality. I have searched many literature and all the stuffs but I cant understand what I can do to solve the issue. my project is focusing pesticide degrading gene identification.
Also how fulfilling is Bioinformatics as a job and also sociably?
Is it possible to run sleuth for differential gene analysis but with just one biological replicate? I have a treatment group with three biological replicates and a non-treatment group but with one biological replicate. Will it work?
Hi, My friend works in a virology lab at a university, and I am currently developing an AI model for part of his research. He is willing to use this model as a pre-screening method for ChIP to reduce number of needed experiments. So I already have a prototype that can: Take a list of DNA sequences (either short sequences under 50 bp or longer sequences/promoters) and a list of possible transcription factors (TFs) and identify the probability of binding between each DNA sequence and transcription factor.
I have a couple of questions for those working in genomics or virology labs:
Hello everyone,
I have data representing mean fluorescence intensity (MFI) measured using a Luminex device. Due to the high number of samples, I measured them across four plates, each containing control samples as well.
I applied log base 2 transformation to the data, calculated the median of all values in each plate, and subtracted the median from all values in that plate. However, I am encountering zero or negative values in my results.
I would like advice on how to handle these negative values. Should I add a constant to shift all values? If so, what constant should I use, and at which step should this addition occur? Additionally, should the constant be the same across all plates, or should each plate have its own constant?
Thank you in advance for your help!
I have spatial transcriptomics data (Merscope and Visium) and I want to answer the following question: "do cells in the surroundings of cell type A express increased custom signature compared to cells in the surroundings of cell type B?". I believe that proximity analysis would help me in this case but I would like an opinion on which tool/workflow to choose and if you can recommend tutorial/publications. Thanks
I want to determine the maximum memory and maximum time for an assembly job. I’m wondering if I can predict it using a complexity analysis tool on the fastq.
Does anyone recommend either a way to predict assembly resource requirements from fastq or a complexity analysis tool specifically for fastq (not assembled fasta)?