/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Guys, I’m a first semester student in molecular biology and eagerly wanted to know the pathway of being expert in bioinformatics field in 1 year, because after that I should apply for different lab and getting job. Is there anyone to help me in this way?
Hey!
I'm studying two populations, one that lives on islands and one that lives on the mainland, and we differentiate them at the subspecies level. I have scaffold level genomes for both subspecies, and only the island population is annotated. Any suggestions on routes for synteny analysis? End goal is to decide which genome to use for GATK.
When you have a novel gene fusion that is not known in the literature, simply by looking at the breakpoints and domains, how do you interpret it? What would the correct approach look like - what does one have to look out for, what is important and so on? I checked if this fusion is somehow known in the literature but literally nobody ever mentioned it. the gene fusion occurred in a patient with melanoma. please see attached. thank you so much for your help - i am new to bioinformatics and a bit lost in this case..
Hello, I am fairly new and not really experienced in bioinformatics and genomics.
I have one FASTA file of a bacterial isolate. I was wondering what are the different things I can do with this?
So far I have Identified using PubMLST, used Prokka, and Abricate.
I want to learn to use newer and tools. I would appreciate any type of suggestions and help to get into bacteria genome sequencing and bioinformatics
PS - I use Linux which I am learning to use as well
Hello,
I am trying to understand rna deconvolution. I started reading review papers and feel overwhelmed. So I wanted to ask for help.
What are the current methods used? I found around 50 technologies, but I dont know which of them are actually used. Which of them are the current standard?
Are there any hood papers I should read to get into that topic? I am especially interested in text explaining the theory and math in detail.
Thanks.
Hi all,
Recently I performed a quite comprehensive single cell sequencing analysis: 5', GE, TCR with antibody hashing, overall a beautiful experiment. I have been analyzing the data with CellRanger v9.0, however I am running into the challenge that I can't annotate the cell type of my wells and samples.
Is annotation for mice samples supported by CellRanger or do I have to move to a different tool? On the same note, is aggregation supported for hashing experiments?
Thanks a lot <3
Hi All,
Has anyone here integrated single-cell data from multiple species using Seurat and Harmony? If so, is this approach generally recommended?
I was considering identifying orthologous genes between each species and human, then merging the datasets and performing integration using Harmony for downstream analysis. I’d love to hear about any experiences or insights others might have on this approach.
Thanks!
Hi,
I'm trying to wrap my head around nf-core/nextflow, and have read and followed many of the tutorials online that write basic nextflow workflows that kinda touch 1-2 tools. However, I haven't been able to find a tutorial/guide on a larger pipeline, where outputs are chained (output from one goes as input to one or more downstream modules), or even how to manage a sample sheet, break it down into a map, tuple etc.
I've kinda written a test pipeline that I had to really play around with to manage my sample sheet (input of sample, some bams, and some sequences of interest) and it feels kinda clunky for short workflows.
What's really confusing is how do I actually use a nf-core module? I have installed a few, such as HSMetrics, but how do I supply the proper inputs to the module in my workflow? From what it seems like, the module is just a bit of wrapper code, and not really an image or anything, so I still would need to have picard installed (which is fine, I do already).
Hello everyone,
While working with RNA-seq data during the mapping step using STAR, I observed an unusually high percentage of reads mapped to multiple loci (80-90%) against the Candidatus Nitrosocosmicus franklandus C13 genome. Additionally, during the quality control step with FastQC, I noticed an extremely high duplication rate (~90%) across all samples.
Could anyone help explain these unexpected results? Any insights or suggestions would be greatly appreciated!
Hi, I've generated a fasta using writeXStringSet in R. Afterwards, I used awk to transform the multi-line sequences into a single line awk '/^>/ {printf(“\n%s”,$0);next; } { printf(“%s”,$0);} END {printf(“\n”);}'
.
However, I saw differences in length calculations between awk and R awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}'. I don't know why arose these discrepancies.
Hello! I have a bunch of molecules, of which only the molecular formula and SMILES are available. PubChem and other databases do not have 3D structures but 2D structures. Can anyone suggest tools or a way to generate or estimate the 3D structures of the compounds?
I recently started working with PrimalScheme (didn't realize it was used for designing ARTIC primers for COVID sequencing—really cool!) to design primers for sequencing the whole genome of the rabies virus from clinical samples. These samples were detected as positive by ELISA, and we validate further using a PCR that targets the G gene.
To design primers, I started with the highest percent match from a BLAST search as my reference genome. Then, I downloaded a set of genomes from NCBI and used FastANI to identify highly similar sequences. From this, I selected about 10 representative sequences and fed them into PrimalScheme. After some trial and error, I managed to get a set of primers divided into 2 pools that cover about 98% of the genome.
I then used PrimerPooler to evaluate the pools for potential primer dimer issues, and everything checks out there with deltaG values mostly positive. The Tm values are within 5°C of each other, GC content is within acceptable ranges, and the primer lengths are consistent.
At this point, I'm ready to order the primers from IDT, but I feel like I might be overlooking something. Is there anything else I should check before proceeding?
Nobody seems to think this is a big deal, but I do and can't figure out what's going on.
In all 16 samples, greater than 99.5% of all sequences were grouped into a family size of one (UMI-based grouping), meaning they're all unique. I used qualimap for the generation of mapping QC metrics to HG 38; duplication plots show that greater than 99.5% of the sequences are not duplicates.
But in the library prep, there were nine PCR cycles. At the very least, we should be seeing duplicates, right? It feels crazy to me that the PCR failed somehow for 16 samples in a row. Something feels off, and I don't know what.
I'm following this pipeline: fgbio
Ya'll got any idea on what to try? I might try removing the UMIs entirely (+5 nucleotides), remapping with another tool, and see if we still get same results. Maybe I'm mishandling the UMIs somehow or missing something fundamental?
In a near future, I'll get a bunch of Forward/Reverse reads (in .ab1 format) from Sanger sequencing and I'm looking for a free/open source command-line based workflow to process them
I was thinking about using a biopython script to trim according to the quality scores provided by ab1 files. The script would then subprocess MUSCLE (I'm running Linux) to align Forward and Reverse seqs (after turning them into fasta format).
I've got 3 questions:
EDIT: each pair of Forward and Reverse seqs correspond to an amplicon (COI DNA Barcode) of a species to identify, I thus have no "reference sequence" for the alignment
I just noticed that the HMMER online tool is down and I can't find any information on when it'll be back. In the meantime, are there any alternatives with the same functionality?
I have RNA-Seq counts data from 10 samples (5 condition and 5 control) along with corresponding protein isoform measurements. I normalized the RNA-Seq counts using VST and RLD, and the protein isoform measurements were converted to z-scores. I attempted both Spearman and Pearson correlations, but neither resulted in statistically significant correlations or extremely low p-values. I suspect this is due to the small sample size. I would appreciate any suggestions on how to proceed.
Hi all! I have a dilemma with choosing from 2 different softwares, namely Geneyx and VarsomeClinical for specific diseases gene findings in short/long sequence reads. Both of them looks that can make similar things but I don't know what to choose. For now is with research purposes but will switch to clinical very soon. As about prices, Varsome is aprox double at price.
Thank you for any help!
Could anyone suggest tools or web servers for prophage detection in bacterial genome?
Hello everyone,
I am trying to split my Visium spatial fastq file into many fastq files according to barcodes. So my desire is to have a fastq file for every barcode. My barcode.txt file is something like this...
sp1 AAACAACGAATAGTTC
sp2 AAACAAGTATCTCCCA
sp3 AAACAATCTACTAGCA
sp4 AAACACCAATAACTGC
...the R2 file is something like this...
u/SRR19762149.14266 NB552055:200:HWCH5BGXH:1:11101:26830:2925 length=90
AATGCAAACAGTACCTAACAAACCCACAGGTCCTAAACTACCAAACCTGCATTAAAAATTTCGGTTGGGGCGACCTCGGAGCAGAACCCA
+SRR19762149.14266 NB552055:200:HWCH5BGXH:1:11101:26830:2925 length=90
AAAAAE/AEEEEEEEAEAEEEE<EEEEE//<EEE/EEEEEEEEEAEEEAEEE<AEEAE////6EAAA/EE/EA<EEAE/<EEEE//EEA/
...while the R1 file is something like this:
@SRR19762149.13421 NB552055:200:HWCH5BGXH:1:11101:23424:2818 length=28
CTCCGAGTAAATCCGCTCCTCAGTTGAC
+SRR19762149.13421 NB552055:200:HWCH5BGXH:1:11101:23424:2818 length=28
AAAAAEEEEEEEEEEEEEEEEEEEEEAE
I tried this command in Linux from https://github.com/Debian/fastx-toolkit/tree/debian/unstable (with both --eol and --bol):
zcat R2.fastq.gz | ./fastx_barcode_splitter.pl --bcfile barcodes.txt --eol --exact --prefix /output/ --suffix "_R2.fastq" --debug
But unfortunately it keeps on saying:
"matched barcode: unmatched"
I also tried https://bitbucket.org/princeton_genomics/barcode_splitter/src/master/ but again no luck :(
Could you kindly help me to find a solution, please?
Thank you so much in advance!
Matteo
Hi, I'm exploring different aligners and their parameters and I would like to ask what kind of evaluations should be done for long read data. Would map quality be enough, e.g. calculate average mapping quality or other statistics need to be consider? Which softare did you recommend for these evaluations?.
As you can see, I have the assays. But I cannot start to intergrate.
> obj@assays
$RNA
Assay (v5) data with 41569 features for 15368 cells
Top 10 variable features:
CLC, HBB, LYZ, PEAR1, CTSG, PF4, IL17RE, BOLL, ELANE, PRTN3
Layers:
counts, data, scale.data
> obj <- IntegrateLayers(
object = obj, method = RPCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.rpca",
k.weight = 40,
verbose = TRUE
)
Computing within dataset neighborhoods
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Finding all pairwise anchors
| | 0 % ~calculating
Error in UseMethod(generic = "Assays", object = object) :
no applicable method for 'Assays' applied to an object of class "NULL"
> Assays(obj)
[1] "RNA"
I get an data from geo(GSE189161), there is a file named metadata, I read the file and try to load them into Seurat object.
> seurat_obj$meta.data <- metadata
Error in \
[[<-`:`
! None of the meta data requested was found in the data frame
Run \
rlang::last_trace()` to see where the error occurred.`
I wanna to know how to solve.
I have some 8 BM samples and I’m not sure how to analyze it after UMAP creation. How do you classify clusters?
Can I just map a few markers we are interested in? I have a whole marker list for some BM tissues but it always gives me an error probably because not enough genes can be mapped onto clusters.
Also my PI wants me to compare a few samples against each other. How do you do comparative analysis in Single cell ATAC seq? Is it just checking the difference in types of cells mapped in the UMAP between 2 samples?
Any helpful links? The Satijalab doesn’t have ATAC seq only tutorial.
Context : I have little knowledge about molecular modelisation but have some years of bioinformatics experience as an engeneer (mainly working with omics data).
I recently tried AlphaFold 2 but was lost very fast. I also have some good idea of how collabFold works but never used it myself (I plan to start there).
Basically I would be interested into protein-protein interactions, something that could help me find potential targets to test with laboratory methods such as two-hybrid or cross-linking.
So, I was wondering if any of you had the experience of integrating this tools into your studies, maybe published the results along other data (rnaseq, chip-seq, metagenomics, or the ones above) ?
Also don't hesitate if you know some good ressources about it !
Cheers
I was trying to reinforce my manual annotation of scRNA-seq data through reference mapping using the well-annotated dataset and label transfer. There is a lot of atlas for human dataset, but I am working on mouse samples. The only source for mouse reference I know is https://cellxgene.cziscience.com/collections , but I cannot find a satisfied one that could match my own dataset, which is mostly immune cells from autoimmune models. I was wondering if anybody knows there are other good resources for such well-annotated reference atlas?
Hello everyone,
I am currently working on a protein-protein docking study where both the proteins are metalloenzymes containing iron-sulfur/iron clusters. Additionally, the interaction involves a monomeric protein binding to a multimeric protein complex, which adds another layer of complexity. I’ve been finding it challenging to navigate this, as the available literature and resources on such systems are quite limited.
I’d like to seek advice on the following points:
Docking Tools: Are there specific docking tools or software that can effectively handle metalloenzymes, especially those with iron-sulfur clusters?
Monomer-Multimer Considerations: What is the best approach for modeling interactions between a monomer and a multimer? Should the multimer be treated as a rigid body, or is it better to explore flexibility in the subunits?
Validation of Results: How can I verify the biological plausibility and structural accuracy of the docking results? Are there specific metrics or workflows commonly used for such complex systems?
Recommended Literature or Protocols: Are there key papers, reviews, or guidelines you would recommend for docking studies involving metalloenzymes and multimeric proteins?
I would greatly appreciate any insights, suggestions, or references that might help me address these challenges. Thank you in advance for your time and expertise!
(SORRY FOR THE CHATGPT FORMAT.)
I’m working with using BEAUti and BEAST for phylogenetic analysis. I’ve worked through the tutorials and can come up with results, but I don’t feel like they make much sense to me and I’m not sure I really understand what I am doing with it. I’ve done some phylogenetic analysis before, but this program and method is completely new to me.
Any help would be greatly appreciated with trying to figure this all out.
Unluckily I was using TargetNet but it is no more availeble (I don't know why)
Hello. I am new to depmap and am having trouble understanding things. I want to see in what cell lines my gene is interest is essential. Any help/suggestions will be greatly appreciated. Thanks
Hi everyone, I've been stumped for a couple of weeks on a gene annotation task that I've been working on with SAG data. I've run Eggnogmapper and gotten a table with columns such as the KEGG KO number and KEGG Pathway, and I was tasked with mapping the KO numbers to a pathway. How do I do this programatically? I'm pretty confused and would appreciate some help, thanks! I mainly code with Python, by the way.