/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hi, on my new Windows computer, I've tried installing bioconductor packages on R. Most of them are fine, but I cannot install GO.db and org.Hs.eg.db. I've tried uninstalling and reinstalling R, Rstudio, Rtools, even resetting my entire computer. It seems to work on my old laptop. Any ideas?
> BiocManager::install('org.Hs.eg.db')
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.rstudio.com/
Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.1 (2024-06-14 ucrt)
Installing package(s) 'org.Hs.eg.db'
installing the source package ‘org.Hs.eg.db’
trying URL 'https://bioconductor.org/packages/3.20/data/annotation/src/contrib/org.Hs.eg.db_3.20.0.tar.gz'
Content type 'application/x-gzip' length 98233708 bytes (93.7 MB)
downloaded 93.7 MB
* installing *source* package 'org.Hs.eg.db' ...
** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Warning message:
package 'IRanges' was built under R version 4.4.2
ERROR: lazy loading failed for package 'org.Hs.eg.db'
* removing 'C:/Users/gabri/R/win-library/4.4/org.Hs.eg.db'
The downloaded source packages are in
‘C:\Users\gabri\AppData\Local\Temp\RtmpailNer\downloaded_packages’
Installation paths not writeable, unable to update packages
path: C:/Program Files (x86)/R/R-4.4.1/library
packages:
boot, cluster, foreign, MASS, Matrix, nlme, survival
Warning message:
In install.packages(...) :
installation of package ‘org.Hs.eg.db’ had non-zero exit status
> session_info()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.utf8 LC_CTYPE=English_Singapore.utf8
[3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.utf8
time zone: Asia/Singapore
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] BiocManager_1.30.25 compiler_4.4.1 tools_4.4.1
We've been offered a few runs of long-read sequencing for our environmental DNA samples (think soil). I've only ever used 16S data so I'm a bit fuzzy on what is possible to find with long-read metagenome sequencing. In papers I've read people tend to use 16S for abundance and use long reads for functional.
Is it likely to be possible to analyse diversity and species abundance between samples? It's likely to be a VERY mixed population of microbes in the samples.
I am not sure if this belongs to this sub but does anyone have any experience in doing any of these in R? ImageJ is the dedicated software of course but has someone had better, more reliable results with R in the past?
Thanks!
I want to use Anova to compare biological samples. What software do people general use ?
Hi,
I'm new here and would really appreciate some advice!
I'd like to run some plant genomes through a program, and the program only takes protein annotation files as inputs. I'm trying to find annotation data for these genomes but this doesn't seem to be straightforward.
For example, this genome appears to have been annotated separately twice (?): one annotation from the original authors (I'm guessing some time during or before 2014) and one annotation conducted by ENSEMBL (which seems to be continually updated/improved): https://plants.ensembl.org/Oryza_glumipatula/Info/Annotation/#genebuild
The ENSEMBL annotations include a predicted protein annotation file ('Fasta Peptide dump'), which seems to be the only protein annotation source I can find for some of the genomes I'm searching for: https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-60/fasta/oryza_glumipatula/pep/
I'm just wondering whether these ENSEMBL predicted protein files are appropriate/reliable to use, and whether they are preferred over the original (much older) annotation data?
Apologies if I've misinterpreted the way ENSEMBL annotation works in my above explanations. Any help or advice would be very appreciated.
I’m still new to bioinformatics, please 🙏 I would love your help. I’ve asked this question in a different way but this is with updates
So after a long process of learning how to do bulk RNA sequencing alignment using STAR, I got the raw data for about 20 samples (replicates included). I wanted to combine these raw counts with my own experimental samples and see how they are similar.
But even without knowledge I know there will be a problem of batch effects. In fact performing deseq2 analysis and using the vst (variance stabilizing transformation) and rlog to create PCA plots showed all my samples clustering together.
Is there another way you would recommend to check the similarity of these samples having the raw counts? I’m asking besides the removal of batch effects using combat (I have to learn that)…
Hey!
I'm studying two populations, one that lives on islands and one that lives on the mainland, and we differentiate them at the subspecies level. I have scaffold level genomes for both subspecies, and only the island population is annotated. Any suggestions on routes for synteny analysis? End goal is to decide which genome to use for GATK.
Hello, I am fairly new and not really experienced in bioinformatics and genomics.
I have one FASTA file of a bacterial isolate. I was wondering what are the different things I can do with this?
So far I have Identified using PubMLST, used Prokka, and Abricate.
I want to learn to use newer and tools. I would appreciate any type of suggestions and help to get into bacteria genome sequencing and bioinformatics
PS - I use Linux which I am learning to use as well
Hello,
I am trying to understand rna deconvolution. I started reading review papers and feel overwhelmed. So I wanted to ask for help.
What are the current methods used? I found around 50 technologies, but I dont know which of them are actually used. Which of them are the current standard?
Are there any hood papers I should read to get into that topic? I am especially interested in text explaining the theory and math in detail.
Thanks.
Hi all,
Recently I performed a quite comprehensive single cell sequencing analysis: 5', GE, TCR with antibody hashing, overall a beautiful experiment. I have been analyzing the data with CellRanger v9.0, however I am running into the challenge that I can't annotate the cell type of my wells and samples.
Is annotation for mice samples supported by CellRanger or do I have to move to a different tool? On the same note, is aggregation supported for hashing experiments?
Thanks a lot <3
Hi All,
Has anyone here integrated single-cell data from multiple species using Seurat and Harmony? If so, is this approach generally recommended?
I was considering identifying orthologous genes between each species and human, then merging the datasets and performing integration using Harmony for downstream analysis. I’d love to hear about any experiences or insights others might have on this approach.
Thanks!
Hi,
I'm trying to wrap my head around nf-core/nextflow, and have read and followed many of the tutorials online that write basic nextflow workflows that kinda touch 1-2 tools. However, I haven't been able to find a tutorial/guide on a larger pipeline, where outputs are chained (output from one goes as input to one or more downstream modules), or even how to manage a sample sheet, break it down into a map, tuple etc.
I've kinda written a test pipeline that I had to really play around with to manage my sample sheet (input of sample, some bams, and some sequences of interest) and it feels kinda clunky for short workflows.
What's really confusing is how do I actually use a nf-core module? I have installed a few, such as HSMetrics, but how do I supply the proper inputs to the module in my workflow? From what it seems like, the module is just a bit of wrapper code, and not really an image or anything, so I still would need to have picard installed (which is fine, I do already).
Hello everyone,
While working with RNA-seq data during the mapping step using STAR, I observed an unusually high percentage of reads mapped to multiple loci (80-90%) against the Candidatus Nitrosocosmicus franklandus C13 genome. Additionally, during the quality control step with FastQC, I noticed an extremely high duplication rate (~90%) across all samples.
Could anyone help explain these unexpected results? Any insights or suggestions would be greatly appreciated!
Hi, I've generated a fasta using writeXStringSet in R. Afterwards, I used awk to transform the multi-line sequences into a single line awk '/^>/ {printf(“\n%s”,$0);next; } { printf(“%s”,$0);} END {printf(“\n”);}'
.
However, I saw differences in length calculations between awk and R awk '/^>/ {if (seqlen) print seqlen;print;seqlen=0;next} {seqlen+=length($0)}END{print seqlen}'. I don't know why arose these discrepancies.
Hello! I have a bunch of molecules, of which only the molecular formula and SMILES are available. PubChem and other databases do not have 3D structures but 2D structures. Can anyone suggest tools or a way to generate or estimate the 3D structures of the compounds?
I recently started working with PrimalScheme (didn't realize it was used for designing ARTIC primers for COVID sequencing—really cool!) to design primers for sequencing the whole genome of the rabies virus from clinical samples. These samples were detected as positive by ELISA, and we validate further using a PCR that targets the G gene.
To design primers, I started with the highest percent match from a BLAST search as my reference genome. Then, I downloaded a set of genomes from NCBI and used FastANI to identify highly similar sequences. From this, I selected about 10 representative sequences and fed them into PrimalScheme. After some trial and error, I managed to get a set of primers divided into 2 pools that cover about 98% of the genome.
I then used PrimerPooler to evaluate the pools for potential primer dimer issues, and everything checks out there with deltaG values mostly positive. The Tm values are within 5°C of each other, GC content is within acceptable ranges, and the primer lengths are consistent.
At this point, I'm ready to order the primers from IDT, but I feel like I might be overlooking something. Is there anything else I should check before proceeding?
Nobody seems to think this is a big deal, but I do and can't figure out what's going on.
In all 16 samples, greater than 99.5% of all sequences were grouped into a family size of one (UMI-based grouping), meaning they're all unique. I used qualimap for the generation of mapping QC metrics to HG 38; duplication plots show that greater than 99.5% of the sequences are not duplicates.
But in the library prep, there were nine PCR cycles. At the very least, we should be seeing duplicates, right? It feels crazy to me that the PCR failed somehow for 16 samples in a row. Something feels off, and I don't know what.
I'm following this pipeline: fgbio
Ya'll got any idea on what to try? I might try removing the UMIs entirely (+5 nucleotides), remapping with another tool, and see if we still get same results. Maybe I'm mishandling the UMIs somehow or missing something fundamental?
In a near future, I'll get a bunch of Forward/Reverse reads (in .ab1 format) from Sanger sequencing and I'm looking for a free/open source command-line based workflow to process them
I was thinking about using a biopython script to trim according to the quality scores provided by ab1 files. The script would then subprocess MUSCLE (I'm running Linux) to align Forward and Reverse seqs (after turning them into fasta format).
I've got 3 questions:
EDIT: each pair of Forward and Reverse seqs correspond to an amplicon (COI DNA Barcode) of a species to identify, I thus have no "reference sequence" for the alignment
I just noticed that the HMMER online tool is down and I can't find any information on when it'll be back. In the meantime, are there any alternatives with the same functionality?
I have RNA-Seq counts data from 10 samples (5 condition and 5 control) along with corresponding protein isoform measurements. I normalized the RNA-Seq counts using VST and RLD, and the protein isoform measurements were converted to z-scores. I attempted both Spearman and Pearson correlations, but neither resulted in statistically significant correlations or extremely low p-values. I suspect this is due to the small sample size. I would appreciate any suggestions on how to proceed.
Hi all! I have a dilemma with choosing from 2 different softwares, namely Geneyx and VarsomeClinical for specific diseases gene findings in short/long sequence reads. Both of them looks that can make similar things but I don't know what to choose. For now is with research purposes but will switch to clinical very soon. As about prices, Varsome is aprox double at price.
Thank you for any help!
Could anyone suggest tools or web servers for prophage detection in bacterial genome?
Hello everyone,
I am trying to split my Visium spatial fastq file into many fastq files according to barcodes. So my desire is to have a fastq file for every barcode. My barcode.txt file is something like this...
sp1 AAACAACGAATAGTTC
sp2 AAACAAGTATCTCCCA
sp3 AAACAATCTACTAGCA
sp4 AAACACCAATAACTGC
...the R2 file is something like this...
u/SRR19762149.14266 NB552055:200:HWCH5BGXH:1:11101:26830:2925 length=90
AATGCAAACAGTACCTAACAAACCCACAGGTCCTAAACTACCAAACCTGCATTAAAAATTTCGGTTGGGGCGACCTCGGAGCAGAACCCA
+SRR19762149.14266 NB552055:200:HWCH5BGXH:1:11101:26830:2925 length=90
AAAAAE/AEEEEEEEAEAEEEE<EEEEE//<EEE/EEEEEEEEEAEEEAEEE<AEEAE////6EAAA/EE/EA<EEAE/<EEEE//EEA/
...while the R1 file is something like this:
@SRR19762149.13421 NB552055:200:HWCH5BGXH:1:11101:23424:2818 length=28
CTCCGAGTAAATCCGCTCCTCAGTTGAC
+SRR19762149.13421 NB552055:200:HWCH5BGXH:1:11101:23424:2818 length=28
AAAAAEEEEEEEEEEEEEEEEEEEEEAE
I tried this command in Linux from https://github.com/Debian/fastx-toolkit/tree/debian/unstable (with both --eol and --bol):
zcat R2.fastq.gz | ./fastx_barcode_splitter.pl --bcfile barcodes.txt --eol --exact --prefix /output/ --suffix "_R2.fastq" --debug
But unfortunately it keeps on saying:
"matched barcode: unmatched"
I also tried https://bitbucket.org/princeton_genomics/barcode_splitter/src/master/ but again no luck :(
Could you kindly help me to find a solution, please?
Thank you so much in advance!
Matteo
Hi, I'm exploring different aligners and their parameters and I would like to ask what kind of evaluations should be done for long read data. Would map quality be enough, e.g. calculate average mapping quality or other statistics need to be consider? Which softare did you recommend for these evaluations?.
As you can see, I have the assays. But I cannot start to intergrate.
> obj@assays
$RNA
Assay (v5) data with 41569 features for 15368 cells
Top 10 variable features:
CLC, HBB, LYZ, PEAR1, CTSG, PF4, IL17RE, BOLL, ELANE, PRTN3
Layers:
counts, data, scale.data
> obj <- IntegrateLayers(
object = obj, method = RPCAIntegration,
orig.reduction = "pca", new.reduction = "integrated.rpca",
k.weight = 40,
verbose = TRUE
)
Computing within dataset neighborhoods
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=04s
Finding all pairwise anchors
| | 0 % ~calculating
Error in UseMethod(generic = "Assays", object = object) :
no applicable method for 'Assays' applied to an object of class "NULL"
> Assays(obj)
[1] "RNA"
I get an data from geo(GSE189161), there is a file named metadata, I read the file and try to load them into Seurat object.
> seurat_obj$meta.data <- metadata
Error in \
[[<-`:`
! None of the meta data requested was found in the data frame
Run \
rlang::last_trace()` to see where the error occurred.`
I wanna to know how to solve.
I have some 8 BM samples and I’m not sure how to analyze it after UMAP creation. How do you classify clusters?
Can I just map a few markers we are interested in? I have a whole marker list for some BM tissues but it always gives me an error probably because not enough genes can be mapped onto clusters.
Also my PI wants me to compare a few samples against each other. How do you do comparative analysis in Single cell ATAC seq? Is it just checking the difference in types of cells mapped in the UMAP between 2 samples?
Any helpful links? The Satijalab doesn’t have ATAC seq only tutorial.
Context : I have little knowledge about molecular modelisation but have some years of bioinformatics experience as an engeneer (mainly working with omics data).
I recently tried AlphaFold 2 but was lost very fast. I also have some good idea of how collabFold works but never used it myself (I plan to start there).
Basically I would be interested into protein-protein interactions, something that could help me find potential targets to test with laboratory methods such as two-hybrid or cross-linking.
So, I was wondering if any of you had the experience of integrating this tools into your studies, maybe published the results along other data (rnaseq, chip-seq, metagenomics, or the ones above) ?
Also don't hesitate if you know some good ressources about it !
Cheers
I was trying to reinforce my manual annotation of scRNA-seq data through reference mapping using the well-annotated dataset and label transfer. There is a lot of atlas for human dataset, but I am working on mouse samples. The only source for mouse reference I know is https://cellxgene.cziscience.com/collections , but I cannot find a satisfied one that could match my own dataset, which is mostly immune cells from autoimmune models. I was wondering if anybody knows there are other good resources for such well-annotated reference atlas?
Hello everyone,
I am currently working on a protein-protein docking study where both the proteins are metalloenzymes containing iron-sulfur/iron clusters. Additionally, the interaction involves a monomeric protein binding to a multimeric protein complex, which adds another layer of complexity. I’ve been finding it challenging to navigate this, as the available literature and resources on such systems are quite limited.
I’d like to seek advice on the following points:
Docking Tools: Are there specific docking tools or software that can effectively handle metalloenzymes, especially those with iron-sulfur clusters?
Monomer-Multimer Considerations: What is the best approach for modeling interactions between a monomer and a multimer? Should the multimer be treated as a rigid body, or is it better to explore flexibility in the subunits?
Validation of Results: How can I verify the biological plausibility and structural accuracy of the docking results? Are there specific metrics or workflows commonly used for such complex systems?
Recommended Literature or Protocols: Are there key papers, reviews, or guidelines you would recommend for docking studies involving metalloenzymes and multimeric proteins?
I would greatly appreciate any insights, suggestions, or references that might help me address these challenges. Thank you in advance for your time and expertise!
(SORRY FOR THE CHATGPT FORMAT.)