/r/bioinformatics

Photograph via snooOG

A subreddit to discuss the intersection of computers and biology.


A subreddit dedicated to bioinformatics, computational genomics and systems biology.

The Biology Network
science askscience biology
microbiology bioinformatics biochemistry
evolution
Bioinformatics

news for genome hackers

Frequently Asked Questions
New to Reddit?
Learning Bioinformatics
#bioinformatics IRC at Freenode
Information
  • If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
  • If you want to read more about genetics or personalized medicine, please visit /r/genomics
  • Information about curated, biological-relevant databases can be found in /r/BioDatasets
  • Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
Getting a job in bioinformatics
Friends

/r/bioinformatics

118,761 Subscribers

2

How do you annotate cell types in single-cell analysis?

Hi all, I would like to know how you go about annotating cell types, outside of SingleR and manual annotation, in a rather definitive/comprehensive way? I'm mainly working with python, on 5 different mouse tissues, for my pipeline. I've tried a bunch of tools, while I'm either missing key cell types or the relevant reference tissue itself, I'm looking for an extremely thorough way of annotating it, accurately. Don't want to miss out on key cell types. Any comments appreciated, thanks.

4 Comments
2024/10/10
19:45 UTC

1

Predicting TCR antigen specificity from scTCR-seq

I am working with a human 5’ scRNA-seq dataset with scTCR-seq and have identified several highly expanded TCRs. I would now like to explore possible antigen specificity and have been doing so in a basic manner so far by searching databases like IEDB and VDJdb. Most of the hits are naturally viral antigens which is somewhat but not entirely helpful to me.

Can anyone recommend another database/software that can predict specificity to human proteins? Does this even exist? Is my search futile?

2 Comments
2024/10/10
17:40 UTC

3

[Opinion] When would you consider a genome assembly "good enough" for syntenic analysis?

I am faced with a collection of hundreds of genome assemblies, built from shotgun sequencing reads

Some assemblies have just several hundred contigs so seem pretty good. However some have contigs counts in the 10s of thousands range. Target genome size is 1Gb

Trying to decide on the threshold for excluding some genomes for downtown analysis. It's important that I be able to speak to local syntenic variation, so too fragmented will result in lots of false negatives

What would.ylu think would be a reasonable cutoff for deciding an assembly is "good enough" vs "bad/incomplete"?

6 Comments
2024/10/10
17:34 UTC

3

mRNA Transcription and NCI Blast Results

Hello,

The drug sequence is GCG TTT GCT CTT CTT CTT GCG. I’m not sure whether the starting GCG TTT... is from the 3' or 5' end, but assuming it’s from the 3' end, the complementary mRNA sequence would be 5'-CGC AAA CGA GAA GAA GAA CGC-3'.

This sequence can be transcribed from the following DNA double strand:

DNA(5'): 5'-CGC AAA CGA GAA GAA GAA CGC-3'
DNA(3'): 3'-GCG TTT GCT CTT CTT CTT GCG-5'

When I use NCI Blast with the 5' sequence, I get the correct result. However, using the 3' sequence fails. Why is that?

18 Comments
2024/10/10
15:11 UTC

0

CDS Length

Hi, I want to get the CDS Length for all the available genes from ENSEMBL biomart, but when I run the following search, it gives a table where there is more than 1 CDS length for some of the genes. What is the reason for this? How can I avoid this?

https://preview.redd.it/ozlkiv8bixtd1.png?width=394&format=png&auto=webp&s=dfba897273beb1f45ba8752801daf818fa91a1fd

3 Comments
2024/10/10
13:14 UTC

2

Title: Seeking Tools and Pipelines to Prioritize and Rank Mutations in Structural Variants Analysis

Hi everyone,

I’m currently working on analyzing structural variants (SVs) from VCF files and have completed the annotation of my variants. However, I’m now looking for tools or pipelines that can help me prioritize and rank these mutations effectively.

If anyone has experience with this or can recommend specific software, algorithms, or workflows that could assist in this process, I would greatly appreciate your input!

Thanks in advance for your help!

0 Comments
2024/10/10
12:19 UTC

3

AlphaFold Outputs

Hey, I ran Alphafold and my outputs include a bunch of *.pkl files:

['result_model_1_multimer_v3_pred_1.pkl', 'result_model_1_multimer_v3_pred_0.pkl', 'result_model_1_multimer_v3_pred_2.pkl', 'result_model_1_multimer_v3_pred_4.pkl', 'result_model_1_multimer_v3_pred_3.pkl', 'result_model_2_multimer_v3_pred_0.pkl', 'result_model_2_multimer_v3_pred_2.pkl', 'result_model_2_multimer_v3_pred_1.pkl', 'result_model_2_multimer_v3_pred_4.pkl', 'result_model_2_multimer_v3_pred_3.pkl', 'result_model_3_multimer_v3_pred_2.pkl', 'result_model_3_multimer_v3_pred_1.pkl', 'result_model_3_multimer_v3_pred_0.pkl', 'result_model_3_multimer_v3_pred_4.pkl', 'result_model_3_multimer_v3_pred_3.pkl', 'result_model_4_multimer_v3_pred_0.pkl', 'result_model_4_multimer_v3_pred_1.pkl', 'result_model_4_multimer_v3_pred_4.pkl', 'result_model_4_multimer_v3_pred_3.pkl', 'result_model_4_multimer_v3_pred_2.pkl', 'result_model_5_multimer_v3_pred_1.pkl', 'result_model_5_multimer_v3_pred_0.pkl', 'result_model_5_multimer_v3_pred_4.pkl', 'result_model_5_multimer_v3_pred_3.pkl', 'result_model_5_multimer_v3_pred_2.pkl']

I'm just wondering, what is the difference between model_1_multimer_v3_pred_1.pkl, *_pred_0.pkl, *_pred_1.pkl ?

I loaded each file in Python and I'm trying obtain the confidence scores. If I do an average over the five *pred.pkl files, is that a good approx of the overall confidence of each result_model?

0 Comments
2024/10/10
11:40 UTC

0

Best tool and tips for primer design

I need to teach primer design to first yesr undergraduates. The thing is, ive never ever performed PCR before, and never had to design primers.

Any tips or tools to consider? What's important to avoid primer failure?

Cheers

6 Comments
2024/10/10
11:20 UTC

1

Research volunteer

I have a medical degree and am interested in continuing my education in bioinformatics. I am curious if there is a possibility of working as a volunteer remotely in research. message me if you have this opportunity

1 Comment
2024/10/10
10:44 UTC

3

Blast2go basic help?

I tried InterProScan of my sequences in Blast2go basic and i got error message like this

11:38 InterProScan (xxxxx) started...

11:40 The following message originates directly from the EMBL-EBI servers, please contact them directly:

Invalid parameters:

Applications -> Value for "appl" is not valid: Currently "TIGRFAM" but should be one of the restricted values: NCBIfam, SFLD, Phobius, SignalP, SignalP_EUK, SignalP_GRAM_POSITIVE, SignalP_GRAM_NEGATIVE... (check parameter documentation for more details)

What should i do?

0 Comments
2024/10/10
04:46 UTC

1

metacells

Has anyone tried metacells? How do I know if I should or should not exclude a gene module?

https://preview.redd.it/gqconbp9putd1.png?width=430&format=png&auto=webp&s=9a503932c1fb4d6d59c48f77014f4b50d23f2748

1 Comment
2024/10/10
03:49 UTC

6

Is deconvoluting bulk RNA-seq data with cBioPortal possible?

I'm a bench scientist with limited bioinformatics knowledge/experience so please pardon my ignorance. I'm interested in determining how expression of a particular gene correlates with different immune populations within tumors, using LM22 as my Gene Signature Matrix, and using a TCGA dataset for my mixture matrix. Is it possible to use CIBERSORTx in this way? If so, would it make sense to Impute Cell Fractions?

e.g. On cBioPortal, I select a TCGA breast cancer study, and look up BRCA1 as my gene of interest, but also add all of the LM22 genes to my query so that I can download a table of gene expression values for BRCA1 + all LM22 genes.

Would appreciate any feedback.

2 Comments
2024/10/10
02:52 UTC

1

Do i need to do any correction or preprocessing on GTEX data?

I am new to handling big data such as this and only know small scale data produced from our lab.

What i want to know is that from GTEX I want to only get the values for some genes and compare them across cells.

What kind of normalization or corrections should I do to the TPM data?

I noticed that the sum over the genes is not 10^6? Why is that?

Thanks!

2 Comments
2024/10/09
23:05 UTC

22

What's going to be the next Tech based idea that's gonna win a nobel prize in biology?

Title tells it all. We have 2 biology and 2 AI related Nobel prizes so far. microRNA's, Alphafold, and memory. (the author might be factually wrong but the question still stands)

19 Comments
2024/10/09
21:54 UTC

3

Problems installing biopython

When I try to install biopython by typing "pip install biopython" in command prompt, this happens:

https://preview.redd.it/c7u04rboqstd1.png?width=1600&format=png&auto=webp&s=7a791323a2a6398a5b90a6c18bce4822aad68b1f

Can anyone help? I went to this link and installed the updated Microsoft Visual C++. I have no idea what to do next :/

11 Comments
2024/10/09
21:15 UTC

0

How to create a multi-FASTA file?

I’m doing a homework assignment and it gives us 9 sequences. How do you put them into a multi-FASTA file to then align using T-Coffee? Thanks in advance!

9 Comments
2024/10/09
19:24 UTC

1

Energy Minimization Programm

So at University we are using Yasara for Energyminimizations since i don't quite wanna spend 300€ to do the same thing at home I wanted to ask if someone might know a decent alternative?

7 Comments
2024/10/09
17:21 UTC

138

Nobel Prize in Chemistry for David Baker, Demis Hassabis and John Jumper!

Awarded for protein design (D.Baker) and protein structure prediction (D.Hassabis and J.Jumper).

What are your thoughts?

My first takeaway points are

  • Good to have another Nobel in the field after Micheal Levitt!
  • AFDB was instrumental in them being awarded the Nobel Prize, I wonder if DeepMind will still support it now that they’ve got it or the EBI will have to find a new source of funding to maintain it.
  • Other key contributors to the field of protein structure prediction have been left out, namely John Moult, Helen Berman, David Jones, Chris Sander, Andrej Sali and Debora Marks.
  • Will AF3 be the last version that will see the light of day eventually, or we can expect an AF4 as well?
  • The community is still quite mad that AF3 is still not public to this day, will that be rectified soon-ish?
18 Comments
2024/10/09
17:13 UTC

1

Question regarding RNA Seq

Hi everyone,

’m planning to conduct bulk RNA sequencing on mouse intestinal tissue using our disease model and a healthy WT control. I would like to submit samples from both the diseased and control mice for sequencing. I know that, typically, we perform experiments in triplicates to ensure robustness. Does the same rule apply for RNA sequencing, and if so, how many replicates would be ideal for this experiment?

Thank you so much for your guidance!

9 Comments
2024/10/09
17:10 UTC

2

SEACells on ATAC data

Hi all! Newbie bioinformatics tech here.

I’m reading this paper (https://www.nature.com/articles/s41587-023-01716-9) as I’m wanting to apply this algorithm on our data, specifically our ATAC modality rather than our RNA. They mention in the methods that they use ArchR to preprocess their ATAC data, and exclude the 1st dimension for downstream analyses because it correlates strongly to technical variation (makes sense). But in our case, we have a Seurat object. Which is fine.. I know how to do the standard preprocessing and convert it to an anndata object. And for most downstream analyses where the dimensions matter, you can specify dims 2:X

My confusion, though, is from their GitHub scripts where they’re processing their ATAC data (https://github.com/dpeerlab/SEACells/tree/main/notebooks/ArchR) is where the heck are their excluding the first dimension? They don’t specify it anywhere, and they seem to do this weird thing where they export the SVD assay as a csv file then add it to an anndata object. I’m just wanting to know how to exclude it, since it doesn’t seem to be a default argument when computing SEACells. Hope someone can help me and thanks in advance!

Edit: just to clarify, when I preprocessed my ATAC data, I did RunUMAP with the correct dims (2:X) but I don’t quite know if this algorithm uses that or if there’s another way I should exclude that first dimension.

3 Comments
2024/10/09
16:41 UTC

1

sub databases blast --> .json extension

Hello! I'm a forensic biologist and I was looking for creat a personal database in which I could keep sequences from different kinds of organisms, without duplicates.

So I would ask you if there's a way to know the exactly composition about sequences, annotation, species, organisms in details lodged into subdatabases in the list below but without download them, because I've not enought space to download each one:

  • 16S_ribosomal_RNA-nucl-metadata.json
  • 18S_fungal_sequences-nucl-metadata.json 28S_fungal_sequences-nucl-metadata.json
  • ITS_RefSeq_Fungi-nucl-metadata.json  ITS_eukaryote_sequences-nucl-metadata.json
  • LSU_eukaryote_rRNA-nucl-metadata.json LSU_prokaryote_rRNA-nucl-metadata.json
  • SSU_eukaryote_rRNA-nucl-metadata.json core_nt-nucl-metadata.json env_nt-nucl-metadata.json
  • human_genome-nucl-metadata.json
  • mito-nucl-metadata.json
  • mouse_genome-nucl-metadata.json
  • nt-nucl-metadata.json nt_euk-nucl-metadata.json
  • nt_others-nucl-metadata.json
  • nt_prok-nucl-metadata.json nt_viruses-nucl-metadata.json patnt-nucl-metadata.json
  • pdbnt-nucl-metadata.json
  • ref_euk_rep_genomes-nucl-metadata.json ref_prok_rep_genomes-nucl-metadata.json
  • ref_viroids_rep_genomes-nucl-metadata.json ref_viruses_rep_genomes-nucl-metadata.json
  • refseq_rna-nucl-metadata.json
  • refseq_select_rna-nucl-metadata.json
  • taxdb-metadata.json tsa_nr-prot-metadata.json tsa_nt-nucl-metadata.json

I also would ask you if some smaller subdatabases (like LSU or SSU or 16S or 18S etc) present in the list are included into bigger subdatabases (like "nt_euk-nucl-metadata.json" or "ref_prok_rep_genomes-nucl-metadata.json").

Does "nt-nucl-metadata.json" include each other information or sequences depositated in others subdatabase of the same list? It's a size of 11K so I've supposed that

Thank you!

2 Comments
2024/10/09
15:49 UTC

3

Questions Regarding Novogene RNA sequencing Results

Hi All,
I just received the sequencing result for my 36 samples, and I have some questions/concerns:

Usually when we send RNA for sequencing we get Stats, Basemean, L2FCSE, etc usually generated by DESeq2. However, the results we got back from Novogene doesn't include the items listed above for FGSEA analysis.

In order to generate those results I insert the un-normalized genes into DESeq (which normally works), however the L2FC values for samples with only one normalized count with 0.05 pAdj values are vastly different. For an example Sample 3 has a L2FC difference of 0.77. However, for all of the other data the different is around 10^-5 or lower (including L2fC, Pvalue, Padj Value).

I attempted to contact the technical support for Novegne in Ca, but no one is pikcing up or responding to technical support inquiries.

Any suggestions?

0 Comments
2024/10/09
15:45 UTC

1

hypothetical proteins and featurecounts

hi!

im new to rna-seq and am trying to do it on galaxy. While my HISAT scores are high- 85% ish, the feature counts successful alignments are only around 40%- the rest are nearly all NoFeatures. can I ask if having many hypothetical proteins in the annotation affects this?

thank youu so much

0 Comments
2024/10/09
15:40 UTC

20

Has anyone gone from a MS in bioinformatics to a PhD in Molecular Biology?

The reason I am considering this route is because I'm coming from a GIS and Wildlife Sciences background. Both have provided me a sort of "weak" background in data science and biology, respectively. My GPA is 3.13, and I don't have upper level molecular biology/biochemistry coursework.

However, I seem to be able to get into Birmingham's online MsC in Bioinformatics.

I guess one important note is that I will be living abroad (I'm in the States) for 1 year (though the MS will last 2.5 years) soon. If I wasn't, I might think it would be better to just take a couple upper division extension classes and perhaps volunteer at a lab. But is this still a potential better route?

27 Comments
2024/10/09
15:30 UTC

6

Gene Calling in Bacterial Annotation

Hi Reddit Fam. Training bioinformatician here.

I am using BV-BRC (formerly PATRIC) to annotate Klebs pneumoniae genome assemblies, the output of which is NOT a gene prediction (only contigs id, location, and functional protein). I am using BV-BRC to further validate my PROKKA annotations.

Two things:

  1. What program do you suggest I use to call pathogenic bacterial genes, aside from PROKKA?

  2. Has anyone managed to annotate multiple genomes in BV-BRC (using CLI). My method was p3-cat them into a combined file. p3-submit that genome annotation. However, the job always rejects my output path, saying it does not exist, even when Klebs-ouput3 is an empty folder and I overwrite it. It also has the correct file path so no mistakes there. (Error: user@bvbrc/home/Experiments/Klebs-output3: No such file or directory).

The command submitted: p3-submit-genome-annotation -f --contigs-file combined2.fasta --scientific-name "Klebsiella pneumoniae subsp. pneumoniae KPX" --taxonomy-id 573 --domain "Bacteria" /user@bvbrc/home/Experiments/Klebs-output3 combined3.fasta

The format: p3-submit-genome-annotation [-f overwrite] [--parameters] output-path output-name

Anyway, any advice or thoughts would be much appreciated!

6 Comments
2024/10/09
09:15 UTC

3

Can you trust Blast? or are chromosomal rearrangement a common feature in filamentous fungi?

Im trying to delineate the promotor region of gene A in my strain and to do that i wanted to compare a reference organism (containing an annotated tss for gene A) with my strain. However, when i blast the upstream region of my strain i dont get a hit upstream of gene A in the reference, instead i get a strong hit downstream.

At this point it seems chromosomal rearrangement is the only thing that might explain this, but the same thing happen when i try to do the same for gene B, as this seems a bit excessive i suspect im not fully fathoming the output of blast... Do you guys have an idea of what could be going on?

My methodology has been to take the upstream region of the genes annotated in Augustus, so for instance if the annoation was 1167-1596 nt i would take the region 1000-1166 nt. This i would modify from the augustus output, so its only the bases ACTCTC etc. that remains. Id then paste these into blast and search - and id get a 92% hit with several hundred bp query match to the region downstream :/

Its a bit confusing because the blast match overlasps with a proposed gen which is oriented in the 3-5' direction, so i dont know if its matching the opposite strand, but can promotors be located on the opposite strand if you wanna express a gene on the complementary strand?

5 Comments
2024/10/09
09:13 UTC

4

Barcode sorting issues

I have some large fastq.gz file and I have been trying to sort by a set of barcodes for months. My setup uses a unique outer barcode, followed by an adapter sequence which is the same between all individuals, followed by a unique inner barcode sequence. Each unique outer barcode by inner barcode combination corresponds to a unique individual / sample. And this fastq.gz file contains approximately 700 unique individuals.

I have tried a few different scripts, mostly using the help of ChatGPT. I had thought my script was working, because I sorted by the outer barcode first and got 95% of my reads matching a sequence. But when I sorted those outer barcode sorted reads by the adapter plus the inner barcode, only 5% of those reads matched a specified sequence.

For some reason when I run my script to sort by all outer barcodes, adapters, and inner barcode combinations at the same time, my script finds no reads at all.

So I took a step back and used grep, to try and identify read counts per individual, and it appears I can find some, but the numbers are still very low, approximately 3,000 reads per individual.

I feel like I am still doing something wrong and I don’t know how to progress. Is there anyone out there that can provide some help, guidance, or better script than an AI made? I’d be willing to share my script or something else that might be necessary to help you help me. Idk. I kind of feel a bit lost at this point.

14 Comments
2024/10/09
01:09 UTC

9

Place to Get Big Data of Gene Expression (Any gene product) in an easy way?

Is there an easy way (not scraping), to get gene expressions(any gene product) of the 20k or so genes for humans for varying cells and varying humans?

The dataset i have tried comes from CAGE experiments which are limited in number. I tried GEOmnibus but it seems i don't get the whole set of genes per experiment and is hard to mix and match genes and what they unit they measure in or even the gene product

I am not so particular of the gene product. I am after the quantity of the data.

Thank you!

16 Comments
2024/10/09
00:57 UTC

9

Bulk vs single - which to use for my research question

Hi! So I’m planning a distant experiment. I’ve created protocols to differentiate iPSCs into cells of different organs (eg. cardiomyocytes, blood cells, neurons, intestinal cells etc). I plan to collect RNA from each of the derived cell types. I want to show that each cell type has gene expression patterns/activated pathways corresponding to their respective primary tissue. Im guessing bulk RNA seq would be more suitable, since I would hopefully have distinct homogenous populations? Also, what online databases can I use to map my results with? Thank you so much!

20 Comments
2024/10/08
21:05 UTC

0

Running MATLAB on Mac silicon

Hey I've been trying to run a script on MATLAB that uses GENIE3 but the required compiled files were supposed to be used on Intel Macs. Is there any way that I can run it on my mac?

This is the error I've been getting:

Invalid MEX-file '/Users/omarhasannin/Documents/Research Projects/Chapter

3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64': dlopen(/Users/omarhasannin/Documents/Research

Projects/Chapter 3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64, 0x0006): Library not loaded:

u/loader_path/libmex.dylib

Referenced from: <05A0769C-5D89-7E42-44D1-9D9AA1BBE4DA> /Users/omarhasannin/Documents/Research Projects/Chapter

3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/rtenslearn_c.mexmaci64

Reason: tried: '/Users/omarhasannin/Documents/Research Projects/Chapter

3/MLGRN/RTP-STAR/GENIE3_MATLAB/RT/libmex.dylib' (no such file), '/usr/local/lib/libmex.dylib' (no such file),

'/usr/lib/libmex.dylib' (no such file, not in dyld cache)

Error in genie3_single (line 145)

[tree,varimp]=rtenslearn_c(expr_matrix(:,input_idx),output_norm,int32(1:nb_samples),[],ok3ensparam,0);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error in genie3 (line 85)

VIM(i,:) = genie3_single(expr_matrix,i,input_idx,tree_method,K,nb_trees);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error in run_regressiontree (line 171)

results = genie3(matrix,input_vec, 'RF', 'sqrt', 10000);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error in regression_tree_pipeline (line 158)

[~,~,bg2,clusterhub] = run_regressiontree(expression_data,time_data,filename_cluster,symbol,istimecourse,[],i,timethreshold,edgenumber);

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Error in RTPSTAR_MAIN (line 116)

regression_tree_pipeline(expression,timecourse,clustering,symbol,...

5 Comments
2024/10/08
18:36 UTC

Back To Top