/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hi all, I would like to know how you go about annotating cell types, outside of SingleR and manual annotation, in a rather definitive/comprehensive way? I'm mainly working with python, on 5 different mouse tissues, for my pipeline. I've tried a bunch of tools, while I'm either missing key cell types or the relevant reference tissue itself, I'm looking for an extremely thorough way of annotating it, accurately. Don't want to miss out on key cell types. Any comments appreciated, thanks.
I am working with a human 5’ scRNA-seq dataset with scTCR-seq and have identified several highly expanded TCRs. I would now like to explore possible antigen specificity and have been doing so in a basic manner so far by searching databases like IEDB and VDJdb. Most of the hits are naturally viral antigens which is somewhat but not entirely helpful to me.
Can anyone recommend another database/software that can predict specificity to human proteins? Does this even exist? Is my search futile?
I am faced with a collection of hundreds of genome assemblies, built from shotgun sequencing reads
Some assemblies have just several hundred contigs so seem pretty good. However some have contigs counts in the 10s of thousands range. Target genome size is 1Gb
Trying to decide on the threshold for excluding some genomes for downtown analysis. It's important that I be able to speak to local syntenic variation, so too fragmented will result in lots of false negatives
What would.ylu think would be a reasonable cutoff for deciding an assembly is "good enough" vs "bad/incomplete"?
Hello,
The drug sequence is GCG TTT GCT CTT CTT CTT GCG. I’m not sure whether the starting GCG TTT... is from the 3' or 5' end, but assuming it’s from the 3' end, the complementary mRNA sequence would be 5'-CGC AAA CGA GAA GAA GAA CGC-3'.
This sequence can be transcribed from the following DNA double strand:
DNA(5'): 5'-CGC AAA CGA GAA GAA GAA CGC-3'
DNA(3'): 3'-GCG TTT GCT CTT CTT CTT GCG-5'
When I use NCI Blast with the 5' sequence, I get the correct result. However, using the 3' sequence fails. Why is that?
Hi, I want to get the CDS Length for all the available genes from ENSEMBL biomart, but when I run the following search, it gives a table where there is more than 1 CDS length for some of the genes. What is the reason for this? How can I avoid this?
Hi everyone,
I’m currently working on analyzing structural variants (SVs) from VCF files and have completed the annotation of my variants. However, I’m now looking for tools or pipelines that can help me prioritize and rank these mutations effectively.
If anyone has experience with this or can recommend specific software, algorithms, or workflows that could assist in this process, I would greatly appreciate your input!
Thanks in advance for your help!
Hey, I ran Alphafold and my outputs include a bunch of *.pkl files:
['result_model_1_multimer_v3_pred_1.pkl', 'result_model_1_multimer_v3_pred_0.pkl', 'result_model_1_multimer_v3_pred_2.pkl', 'result_model_1_multimer_v3_pred_4.pkl', 'result_model_1_multimer_v3_pred_3.pkl', 'result_model_2_multimer_v3_pred_0.pkl', 'result_model_2_multimer_v3_pred_2.pkl', 'result_model_2_multimer_v3_pred_1.pkl', 'result_model_2_multimer_v3_pred_4.pkl', 'result_model_2_multimer_v3_pred_3.pkl', 'result_model_3_multimer_v3_pred_2.pkl', 'result_model_3_multimer_v3_pred_1.pkl', 'result_model_3_multimer_v3_pred_0.pkl', 'result_model_3_multimer_v3_pred_4.pkl', 'result_model_3_multimer_v3_pred_3.pkl', 'result_model_4_multimer_v3_pred_0.pkl', 'result_model_4_multimer_v3_pred_1.pkl', 'result_model_4_multimer_v3_pred_4.pkl', 'result_model_4_multimer_v3_pred_3.pkl', 'result_model_4_multimer_v3_pred_2.pkl', 'result_model_5_multimer_v3_pred_1.pkl', 'result_model_5_multimer_v3_pred_0.pkl', 'result_model_5_multimer_v3_pred_4.pkl', 'result_model_5_multimer_v3_pred_3.pkl', 'result_model_5_multimer_v3_pred_2.pkl']
I'm just wondering, what is the difference between model_1_multimer_v3_pred_1.pkl
, *_pred_0.pkl
, *_pred_1.pkl
?
I loaded each file in Python and I'm trying obtain the confidence scores. If I do an average over the five *pred.pkl files, is that a good approx of the overall confidence of each result_model?
I need to teach primer design to first yesr undergraduates. The thing is, ive never ever performed PCR before, and never had to design primers.
Any tips or tools to consider? What's important to avoid primer failure?
Cheers
I have a medical degree and am interested in continuing my education in bioinformatics. I am curious if there is a possibility of working as a volunteer remotely in research. message me if you have this opportunity
I tried InterProScan of my sequences in Blast2go basic and i got error message like this
11:38 InterProScan (xxxxx) started...
11:40 The following message originates directly from the EMBL-EBI servers, please contact them directly:
Invalid parameters:
Applications -> Value for "appl" is not valid: Currently "TIGRFAM" but should be one of the restricted values: NCBIfam, SFLD, Phobius, SignalP, SignalP_EUK, SignalP_GRAM_POSITIVE, SignalP_GRAM_NEGATIVE... (check parameter documentation for more details)
What should i do?
Has anyone tried metacells? How do I know if I should or should not exclude a gene module?
I'm a bench scientist with limited bioinformatics knowledge/experience so please pardon my ignorance. I'm interested in determining how expression of a particular gene correlates with different immune populations within tumors, using LM22 as my Gene Signature Matrix, and using a TCGA dataset for my mixture matrix. Is it possible to use CIBERSORTx in this way? If so, would it make sense to Impute Cell Fractions?
e.g. On cBioPortal, I select a TCGA breast cancer study, and look up BRCA1 as my gene of interest, but also add all of the LM22 genes to my query so that I can download a table of gene expression values for BRCA1 + all LM22 genes.
Would appreciate any feedback.
I am new to handling big data such as this and only know small scale data produced from our lab.
What i want to know is that from GTEX I want to only get the values for some genes and compare them across cells.
What kind of normalization or corrections should I do to the TPM data?
I noticed that the sum over the genes is not 10^6? Why is that?
Thanks!
Title tells it all. We have 2 biology and 2 AI related Nobel prizes so far. microRNA's, Alphafold, and memory. (the author might be factually wrong but the question still stands)
When I try to install biopython by typing "pip install biopython" in command prompt, this happens:
Can anyone help? I went to this link and installed the updated Microsoft Visual C++. I have no idea what to do next :/
I’m doing a homework assignment and it gives us 9 sequences. How do you put them into a multi-FASTA file to then align using T-Coffee? Thanks in advance!
So at University we are using Yasara for Energyminimizations since i don't quite wanna spend 300€ to do the same thing at home I wanted to ask if someone might know a decent alternative?
Awarded for protein design (D.Baker) and protein structure prediction (D.Hassabis and J.Jumper).
What are your thoughts?
My first takeaway points are
Hi everyone,
’m planning to conduct bulk RNA sequencing on mouse intestinal tissue using our disease model and a healthy WT control. I would like to submit samples from both the diseased and control mice for sequencing. I know that, typically, we perform experiments in triplicates to ensure robustness. Does the same rule apply for RNA sequencing, and if so, how many replicates would be ideal for this experiment?
Thank you so much for your guidance!
Hi all! Newbie bioinformatics tech here.
I’m reading this paper (https://www.nature.com/articles/s41587-023-01716-9) as I’m wanting to apply this algorithm on our data, specifically our ATAC modality rather than our RNA. They mention in the methods that they use ArchR to preprocess their ATAC data, and exclude the 1st dimension for downstream analyses because it correlates strongly to technical variation (makes sense). But in our case, we have a Seurat object. Which is fine.. I know how to do the standard preprocessing and convert it to an anndata object. And for most downstream analyses where the dimensions matter, you can specify dims 2:X
My confusion, though, is from their GitHub scripts where they’re processing their ATAC data (https://github.com/dpeerlab/SEACells/tree/main/notebooks/ArchR) is where the heck are their excluding the first dimension? They don’t specify it anywhere, and they seem to do this weird thing where they export the SVD assay as a csv file then add it to an anndata object. I’m just wanting to know how to exclude it, since it doesn’t seem to be a default argument when computing SEACells. Hope someone can help me and thanks in advance!
Edit: just to clarify, when I preprocessed my ATAC data, I did RunUMAP with the correct dims (2:X) but I don’t quite know if this algorithm uses that or if there’s another way I should exclude that first dimension.
Hello! I'm a forensic biologist and I was looking for creat a personal database in which I could keep sequences from different kinds of organisms, without duplicates.
So I would ask you if there's a way to know the exactly composition about sequences, annotation, species, organisms in details lodged into subdatabases in the list below but without download them, because I've not enought space to download each one:
I also would ask you if some smaller subdatabases (like LSU or SSU or 16S or 18S etc) present in the list are included into bigger subdatabases (like "nt_euk-nucl-metadata.json" or "ref_prok_rep_genomes-nucl-metadata.json").
Does "nt-nucl-metadata.json" include each other information or sequences depositated in others subdatabase of the same list? It's a size of 11K so I've supposed that
Thank you!
Hi All,
I just received the sequencing result for my 36 samples, and I have some questions/concerns:
Usually when we send RNA for sequencing we get Stats, Basemean, L2FCSE, etc usually generated by DESeq2. However, the results we got back from Novogene doesn't include the items listed above for FGSEA analysis.
In order to generate those results I insert the un-normalized genes into DESeq (which normally works), however the L2FC values for samples with only one normalized count with 0.05 pAdj values are vastly different. For an example Sample 3 has a L2FC difference of 0.77. However, for all of the other data the different is around 10^-5 or lower (including L2fC, Pvalue, Padj Value).
I attempted to contact the technical support for Novegne in Ca, but no one is pikcing up or responding to technical support inquiries.
Any suggestions?
hi!
im new to rna-seq and am trying to do it on galaxy. While my HISAT scores are high- 85% ish, the feature counts successful alignments are only around 40%- the rest are nearly all NoFeatures. can I ask if having many hypothetical proteins in the annotation affects this?
thank youu so much
The reason I am considering this route is because I'm coming from a GIS and Wildlife Sciences background. Both have provided me a sort of "weak" background in data science and biology, respectively. My GPA is 3.13, and I don't have upper level molecular biology/biochemistry coursework.
However, I seem to be able to get into Birmingham's online MsC in Bioinformatics.
I guess one important note is that I will be living abroad (I'm in the States) for 1 year (though the MS will last 2.5 years) soon. If I wasn't, I might think it would be better to just take a couple upper division extension classes and perhaps volunteer at a lab. But is this still a potential better route?
Hi Reddit Fam. Training bioinformatician here.
I am using BV-BRC (formerly PATRIC) to annotate Klebs pneumoniae genome assemblies, the output of which is NOT a gene prediction (only contigs id, location, and functional protein). I am using BV-BRC to further validate my PROKKA annotations.
Two things:
What program do you suggest I use to call pathogenic bacterial genes, aside from PROKKA?
Has anyone managed to annotate multiple genomes in BV-BRC (using CLI). My method was p3-cat them into a combined file. p3-submit that genome annotation. However, the job always rejects my output path, saying it does not exist, even when Klebs-ouput3 is an empty folder and I overwrite it. It also has the correct file path so no mistakes there. (Error: user@bvbrc/home/Experiments/Klebs-output3: No such file or directory).
The command submitted: p3-submit-genome-annotation -f --contigs-file combined2.fasta --scientific-name "Klebsiella pneumoniae subsp. pneumoniae KPX" --taxonomy-id 573 --domain "Bacteria" /user@bvbrc/home/Experiments/Klebs-output3 combined3.fasta
The format: p3-submit-genome-annotation [-f overwrite] [--parameters] output-path output-name
Anyway, any advice or thoughts would be much appreciated!
Im trying to delineate the promotor region of gene A in my strain and to do that i wanted to compare a reference organism (containing an annotated tss for gene A) with my strain. However, when i blast the upstream region of my strain i dont get a hit upstream of gene A in the reference, instead i get a strong hit downstream.
At this point it seems chromosomal rearrangement is the only thing that might explain this, but the same thing happen when i try to do the same for gene B, as this seems a bit excessive i suspect im not fully fathoming the output of blast... Do you guys have an idea of what could be going on?
My methodology has been to take the upstream region of the genes annotated in Augustus, so for instance if the annoation was 1167-1596 nt i would take the region 1000-1166 nt. This i would modify from the augustus output, so its only the bases ACTCTC etc. that remains. Id then paste these into blast and search - and id get a 92% hit with several hundred bp query match to the region downstream :/
Its a bit confusing because the blast match overlasps with a proposed gen which is oriented in the 3-5' direction, so i dont know if its matching the opposite strand, but can promotors be located on the opposite strand if you wanna express a gene on the complementary strand?
I have some large fastq.gz file and I have been trying to sort by a set of barcodes for months. My setup uses a unique outer barcode, followed by an adapter sequence which is the same between all individuals, followed by a unique inner barcode sequence. Each unique outer barcode by inner barcode combination corresponds to a unique individual / sample. And this fastq.gz file contains approximately 700 unique individuals.
I have tried a few different scripts, mostly using the help of ChatGPT. I had thought my script was working, because I sorted by the outer barcode first and got 95% of my reads matching a sequence. But when I sorted those outer barcode sorted reads by the adapter plus the inner barcode, only 5% of those reads matched a specified sequence.
For some reason when I run my script to sort by all outer barcodes, adapters, and inner barcode combinations at the same time, my script finds no reads at all.
So I took a step back and used grep, to try and identify read counts per individual, and it appears I can find some, but the numbers are still very low, approximately 3,000 reads per individual.
I feel like I am still doing something wrong and I don’t know how to progress. Is there anyone out there that can provide some help, guidance, or better script than an AI made? I’d be willing to share my script or something else that might be necessary to help you help me. Idk. I kind of feel a bit lost at this point.
Is there an easy way (not scraping), to get gene expressions(any gene product) of the 20k or so genes for humans for varying cells and varying humans?
The dataset i have tried comes from CAGE experiments which are limited in number. I tried GEOmnibus but it seems i don't get the whole set of genes per experiment and is hard to mix and match genes and what they unit they measure in or even the gene product
I am not so particular of the gene product. I am after the quantity of the data.
Thank you!
Hi! So I’m planning a distant experiment. I’ve created protocols to differentiate iPSCs into cells of different organs (eg. cardiomyocytes, blood cells, neurons, intestinal cells etc). I plan to collect RNA from each of the derived cell types. I want to show that each cell type has gene expression patterns/activated pathways corresponding to their respective primary tissue. Im guessing bulk RNA seq would be more suitable, since I would hopefully have distinct homogenous populations? Also, what online databases can I use to map my results with? Thank you so much!
Hey I've been trying to run a script on MATLAB that uses GENIE3 but the required compiled files were supposed to be used on Intel Macs. Is there any way that I can run it on my mac?
This is the error I've been getting:
Invalid MEX-file '/Users/omarhasannin/Documents/Research Projects/Chapter
3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64': dlopen(/Users/omarhasannin/Documents/Research
Projects/Chapter 3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64, 0x0006): Library not loaded:
u/loader_path/libmex.dylib
Referenced from: <05A0769C-5D89-7E42-44D1-9D9AA1BBE4DA> /Users/omarhasannin/Documents/Research Projects/Chapter
3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64
Reason: tried: '/Users/omarhasannin/Documents/Research Projects/Chapter
3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/libmex.dylib' (no such file), '/usr/local/lib/libmex.dylib' (no such file),
'/usr/lib/libmex.dylib' (no such file, not in dyld cache)
Error in genie3_single (line 145)
[tree,varimp]=rtenslearn_c(expr_matrix(:,input_idx),output_norm,int32(1:nb_samples),[],ok3ensparam,0);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in genie3 (line 85)
VIM(i,:) = genie3_single(expr_matrix,i,input_idx,tree_method,K,nb_trees);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in run_regressiontree (line 171)
results = genie3(matrix,input_vec, 'RF', 'sqrt', 10000);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in regression_tree_pipeline (line 158)
[~,~,bg2,clusterhub] = run_regressiontree(expression_data,time_data,filename_cluster,symbol,istimecourse,[],i,timethreshold,edgenumber);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in RTPSTAR_MAIN (line 116)
regression_tree_pipeline(expression,timecourse,clustering,symbol,...