/r/bioinformatics
A subreddit dedicated to bioinformatics, computational genomics and systems biology.
science | askscience | biology |
microbiology | bioinformatics | biochemistry |
evolution |
news for genome hackers
- If you have a specific bioinformatics related question, there is also the question and answer site BioStar and the next generation sequencing community SEQanswers
- If you want to read more about genetics or personalized medicine, please visit /r/genomics
- Information about curated, biological-relevant databases can be found in /r/BioDatasets
- Multicore, cluster, and cloud computing news, articles and tools can be found over at /r/HPC.
/r/bioinformatics
Hello everyone,
I am writing to you since I am approaching right now to R and spatial transcriptomics. I am pretty confident with Python (since I used it for AI and DL), but I want to face this topic with R. I would like to learn more and more.
Can you kindly suggest a simple step-by-step tutorial/course/video that can help me to learn spatial transcriptomics analysis, please? I tried to follow:
https://bookdown.org/sjcockell/ismb-tutorial-2023/practical-session-2.html
https://satijalab.org/seurat/articles/spatial_vignette
https://bioconductor.org/books/3.13/OSCA.intro/the-singlecellexperiment-class.html (this one for single cell)
https://www.youtube.com/watch?v=L_7VdCeJ4Z8&list=PLOLdjuxsfI4N1SdaQQYXGoa5Z93hPxWVY&index=8
but I am finding them a bit difficult. I know, it is my fault, I just wanted to know if there were something simpler or if these links are the state-of-the-art for spatial transcriptomics tutorial.
Thank you so much in advance.
Hi everyone,
I'm about to start a new job in the field of metagenomics. I have microbiology background, also my thesis was about symbiotic microbiomes.
I will be really grateful if you could share some tips, skills or paths to learn as a bioinformatician. I am really happy to have this opportunity, and don't want to mess it up in the future.
Thank you all in advance.
Hello,
I have a list of mutation hotspots (such as Gene name : NOTCH1 and change is Amino Acid: L1574). I have the data in the above mentioned format for different genes. Is there any way to get the genomic Coordinates and the exon they belong to at once? Right now I am searching one by one and it is taking quite a lot of time. Any help would be really helpful.
Thanks in advance :)
Hi,
I've been investigating this issue for a work related project, but I'm hitting some dead-ends.
Essentially, from the Fastqc documentation, we know that the "sequence length distribution" criterion fails when there's at least one read of length 0. My educated guess was that a read length of 0 indicates an issue with the sequencer, but I can't find any documentation about this; I've checked the sequencer documentation and nothing like this gets brought up. Does anyone have any ideas?
A colleague shared their Dropbox with a bunch of fasta files I need to work on. I want to get them to my root dir on university server. I can download each file to my local machine by clicking the Dropbox link, then move to university owncloud, and finally to my root dir on the university server. There must be a better way. A Dropbox app? A python library designed for this stuff? Thanks for any ideas.
I could try wget or curl but there are so many files. Even getting the exact file name links from Dropbox would take some time. Maybe I could wget the directories? IDK. Thanks again.
Are the coding challenges for the interview same as the ones in the practice? I am referring to those coding challenges on the offered websites. Thank you.
I saw a script somewhere to convert TPM/TP10K to raw counts, but cannot find it anymore. Can anyone help?
You can just assume that the lowest non-zero value corresponds to 1 and apply the same scaling to all values. I think the script I saw accounted for various edge cases.
Is there any way to correct ligand-protein relationships using pyMOL or is this impossible without writing the code myself?
I'm trying to highlight my experience using popular packages on my resume, but I'm not sure the best way to go about it. Should I include a line under the skills section listing which packages I'm proficient in (e.g. "Packages: SAMtools, BEDtools,...")? Or should I only mention my experience in context (e.g. describing the packages I used to accomplish a certain task in my bullet points)? Or should I do both?
I've been doing single cell analyses for a couple of years now and one thing I've consistently observed is that papers with single-cell analyses almost never make the Seurat object(s) (The most common single cell analysis structure in R) they constructed available in their data & materials section. Its almost always just SRA links to the raw sequencing data, a github link to the code (which may or may not be what they actually used for the figures in the paper) and maybe a few spreadsheets indicating annotations for cluster labels, clustering coordinates, etc.
Now, I'm code savvy enough that I can normally reconstruct the original Seurat object using the bits and pieces they've left behind, but it would save me a heck of a lot of time if authors saved their Seurat object and uploaded it online. Plus a lot of people use different versions of the software and so even if I do run through the whole analysis again with the code they've left behind, its common to just get different results. Sometimes it just doesn't work out and I've just had to contact the original authors and beg them for their Seurat object.
So if you are reading this and you are planning on publishing your single cell data soon, please make everyone's life easier and save your Seurat object as a .RDS (R object) or .h5seurat (Seurat object).
Hi all,
This might seem trivial, but I've been struggling with getting the input needed for GATK even after googling and screening through Biostars threads. I am simply trying to get a genome bed file. I have the .fasta, .fna, .gtf, .gff files all from ensembl/NCBI and have tried turning my genome.fasta into genome sam into genome.bam into genome.bed using bowtie and samtools but nothing has worked.
My organisms bed files are on UCSC either :(. Does anyone have something that works?
Here's what I've tried:
bowtie2-build klactis.fasta klactis
# map reads
bowtie2 -p 4 -x klactis -U klactis.fasta -S klactis.sam (it stops here with Error Abort 6)
and with
pyfaidx
faidx --transform bed klactis.fasta > klactis.bed
And with awk:
samtools faidx klactis.fasta awk 'BEGIN {FS="\t"}; {print $1 FS "0" FS $2}' klactis.fasta.fai > klactis.bed
My bed file currently looks like:
head klactis.bed
A 0 1062590
B 0 1320834
C 0 1753957
D 0 1715506
E 0 2234072
F 0 2602197
Any help is greatly appreciated.
Ran a qPCR around a month ago now and since then I have been trying to analyze my results. I'm using this program https://github.com/RDML-consortium/rdml-tools and Ubuntu terminal has been giving me a bunch of errors. Does anyone have experience with this and is willing to lend a quick hand?
Hi all,
I am trying to start to use autodock vina in PyRx. I have downloaded Enamines solubility fragments (18K molecules) https://enamine.net/compound-collections/fragment-collection. In PyRx I have used to openBable to prepare the ligand files. I started by converting all files into pdbqt, and there are a lot of errors.
The error message is AD4LigandPreparation wrote less atoms that present in the molecule: 232.63
AD4LigandPreparation wrote less atoms that present in the molecule: 232.63 Location:C:\Program Files (x86)\PyRx\lib\site-packages\PyRx\vsModel.py:PrepareLigandMol
When looking at the renders of some of the ligands, there are clear issues (broken bonds). However, with so many failures, manually going through the entire list of fragments to visually see where the pdbqt conversion failed is silly. There doesn't appear to be anyway to search for the chemical that failed OR to easily remove these. The only way to remove them is to go to the Autodock tab, manually find the ligand and delete.
Does anyone have any insight into either this error, or how to best filter the results so I can clean up the sdf library?
Thanks,
Background:
I am trimming some data and I have used a couple of reference for what is involved in trimming. The easiest one to read is here. However, I cannot seem to find what good quality cut offs to use for quality trimming when using CutAdapt. CutAdapt does give an example shown:
3' only:
cutadapt -q 10 -o output.fastq input.fastq
Both 3' and 5':
cutadapt -q 15,10 -o output.fastq input.fastq
But here the numbers seem to be largely for illustrative purposes and not the recommendation?
Where `-q
` is described as (linked here):
The -q
(or --quality-cutoff
) parameter can be used to trim low-quality ends from reads. If you specify a single cutoff value, the 3’ end of each read is trimmed
Question/Concerns:
Is the `-q
` different for each end or should I only trim the 3'? I cannot seem to find information about quality trimming cut offs on the Illumina site. I did not see anything that directly applied to Illumina's TruSeq. So any suggestion/help would be much appreciated.
So far I have only found cancer-specific ones. I'm interested in general co-mutations info across different genes.
And no, this isn't exactly the same as looking for protein-protein interactions. And Gnomad contains only info of co-occurring variants in same gene.
Any help would be greatly appreciated!
Hi, about 4 years ago I created an open source Python library for visualization of intersection sets called supervenn: https://github.com/gecko984/supervenn . It has since recieved more than 250 stars on Github.
My post about it in this subreddit has received a warm welcome, so I decided that another one after 4 years would do no harm. I've also implemented a new feature today, now you can use just intersection sizes instead of sets themselves. Hope you find it useful, have a great day.
I am a college junior who just recently switch tracks from pre med to bioinformatics (still kept my Biology Major, and Chemistry and Bioinformatics minors the same) with a 3.8 gpa. It has been a little difficult finding bioinformatics opportunities for the summertime, having no previous experience in this field, so I was wondering if anyone could tell me what I should be doing right now, just starting out in this field. Or should I not even worry too much about college internships and just focus on Master's and post-graduate?
uhmm, so I'm diving deep into my thesis, and I'm all about that phylogenetic tree life right now. But yo, I'm hella lost on what softwares I should be usin'. Like, there's so many out there, and I don't wanna waste time tryin' 'em all out.
I need your help, squad! What softwares do y'all recommend for makin' phylogenetic trees? I need somethin' that's user-friendly 'cause, let's be real, I ain't no computer whiz. But it also gotta be legit, you know? Can't be usin' some janky software that's gonna mess up my data.
Hit me up with your suggestions and tips, y'all! And if you got any insider tricks on how to make these trees pop, I'm all ears. This thesis ain't gonna write itself, and I could use all the help I can get.
Hi all, I have two samples which I’ve done differential expression with comparing between two time points. I’m interested in finding the enriched GO terms of these high ranking genes, and comparing the results of each sample in a sort of network plot.
I’ve seen really cool plots that group GO terms by their term hierarchy, with node size based on -log2(p value) , and I’m wondering how I might be able to reproduce this sort of plot in Python? Any insight would be appreciated!
I recently did a personal project on investigating differential gene expression in breast cancer samples (primary vs metastatic sites). I have 8 sequences and I was just wondering whether deduplicating is necessary after the alignment step?
Hi everyone
I am currently a Master's Student in Molecular Biology and Bioinformatics, with soon prospective graduation. During this time I realized that the wet lab is not for me and that I would rather enhance my computational skills to apply for jobs in Bioinformatics or Computational Biology once I graduate. I do have experience in Python and RStudio, I have data analysis skills too and I just recently implemented a mathematical model in Python, however, I do not feel like this is enough for me to land a job. I have been looking for bioinformatics positions and they require skills in scRNA-seq, RNA-seq, and other omics. In my lab, I do not have the opportunity to do these and that is why I am worried. I feel like I going to be behind once I graduate and that is why I am looking for advice. How Can I develop these skills? How long it would take? How Can I do it? Do you know any source/internship/ useful to learn those skills? Are there jobs that can take you and train you?
I know these are a lot of questions and that is because I really want to be trained and succeed in my future job landing.
I would appreciate you rcomments
I currently don't have my own RNA-seq data but I've found two publicly available datasets that I can use to start to answer my question. Each of the datasets have their own disease condition and control so I would be combining controls and disease conditions from two different papers. I have their data as count matrices both were processed with HTSeq. I was thinking about using edgeR to analyze them but I'm concerned about batch effects interfering with the analysis so I was thinking of normalizing before combining the datasets but edgeR prefers raw counts. Sorry if this is an easy question, I don't typically deal with this. Thanks for your help!
I am helping my Online Chinese friend to find answer for her thesis. She first emailed the author of the article: Peptide-guided lipid nanoparticles deliver mRNA to the neural retina of rodents and nonhuman primates but we're not sure if the author will reply. So I suggested to try internet forum, but unfortunately she can't access Reddit in China. So Im helping her to post this and also fyi I have no knowledge of this topic. Below is her problem:
"I have bought M13 phage peptide display library from NEB, but I can’t electroporate it to electrocompetent cell following its instructions. The parameter is 25 µF, 200 Ω, 2.5 kV."
If you can help, please post reference or materials I can share. I will share your answer to her also. If you are willing to be contacted just in case for follow up questions, please let me know too. Thank you!
Hi all,
I'm having a major issue with bracken when trying to do the third step of the bracken-build command where it attempts to convert kmer mappings into read classifications (step 1b from the manual at https://ccb.jhu.edu/software/bracken/index.shtml?t=manual). I keep getting a segmentation fault during this step, even when requesting the max memory available by my university's supercomputer (2988GB). Has anyone else dealt with this before? If so, PLEASE HELP ME. I have come across the same issue on github (https://github.com/jenniferlu717/Bracken/issues/54), but the creator of bracken doesn't seem to know how to fix it and hasn't responded to my issue thread on github. Here is the script that I am using:
#!/bin/bash
#SBATCH --partition=hugemem
#SBATCH --account=XX
#SBATCH --time=50:00:00
#SBATCH --mem=2988GB
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=80
#SBATCH --job-name=kraken2_bracken_build
/fs/scratch/PAS1725/metagenomics/Bracken-2.6.2/bracken-build -d /fs/scratch/PAS1725/metagenomics/kraken2_standard -t 24 -k 35 -l 100
#####
I am not great at bioinformatics problem solving, and don't really feel confident in the way I set up the script, but the job seems to run okay until I hit that segfault error. Is this just a memory issue? Is there even any way of fixing it or is it a hopeless cause?
I'm set to graduate soon and am freaking out because I've been struggling with bracken and kraken2 for WEEKS and have essentially made no progress, aside from completing the kraken2 steps. If it helps, here is the link that my university's supercomputer tech support directed me to: (https://www.osc.edu/resources/technical_support/supercomputers/pitzer/batch_limit_rules). While they are helping me with this issue, it takes a while for them to respond to my questions, and even more time waiting for my large job script to be queued on the supercomputer to find out it didn't even run properly.
TLDR: I desperately need help with fixing the segmentation fault error on bracken-build, I don't have time to troubleshoot this myself, and I am at my witts end with bracken.
Edit: Thanks to those with helpful comments and suggestions! After talking with my tech support, the issue is with the installed kraken2 libraries being built on another system that is not compatible with the one I'm using, not a memory issue. It has been suggested to install everything again using a container and go from there. Hopefully it works! 🙏
Hello, I'm a freshman biology student and my professor required us to download RAxML for an upcoming activity. I'm not tech savvy so I've been struggling in installing RAxML. All help is appreciated!
Hi all, I have just started using the new R10 chips from oxford nanopore which include loading beads for loading the sample library onto the flowcell. However I found that when going to wash the flow cell for re-using it later I was unable to draw back any fluid from the flowcell in the initial wash step (the "initial draw back 20-30ul of fluid before priming/loading any wash fluid into the flowcell") before you are meant to load the wash mix. I faffed around very carefully repeating this process but to no avail, and had to continue to exert pressure at one point until suddenly the entire chip emptied into my pipette in one go, destroying all the remaining pores all together. I have never had this issue before using any previous chips and it seems as if this issue was caused by the loading beads causing a blockage in one of the channels, preventing me from drawing up any fluid in that initial step.
Has anyone else had this issue or know how to avoid it? And also whether it would be worth contacting nanopore for a replacement chip?
Cheers
Hello, sometimes when searching for proteins within an organism, I come across short protein sequences, which are often referred to as hypothetical proteins. When I blast these sequences, they often do not show a 100% match with other organisms, but they can be closely related proteins.
What I'm curious about is whether I can take these short protein sequences, perform docking, and use them as a novel, untested peptide. What principles should I follow, and has there been any research done with such an approach in the literature? Can anyone familiar with this provide guidance?
Thank you.
Example link: https://www.ncbi.nlm.nih.gov/protein?term=txid562%5Borganism%3Aexp%5D+AND+((%2210%22%5BSLEN%5D+%3A+%2220%22%5BSLEN%5D)&cmd=DetailsSearch
edit: I apologize for not being clear enough, due to my English. What I actually meant to say is, would it be too absurd if I synthetically produce these hypothetical proteins (especially those around 10 to 40 amino acids in length) and investigate their anti-cancer or antimicrobial properties? The amino acid sequences are available in the link I provided above. The reason I'm asking this is whether these hypothetical proteins, which are small peptides, are truly unique entities on their own, or are they just small, meaningless fragments of larger proteins encountered during MS/MS analysis? In other words, are they protein fragments with no discernible properties? Therefore, I'm wondering if it's worth producing them using solid-phase peptide synthesis and whether it's worth researching their properties or not.
I am doing PhD in the major of AI/Computer Vision. I have applied for an ML Engineer role in a Bion Technology startup. I am given a dataset/CSV file that contains three columns- InChIKey, SMILES, and Activity. There are three activity types such as active, inactive, and intermediate.
I know ML and DL classification algorithms to classify objects given input features. However, as I have no domain knowledge in the biosphere, I can't understand what to do with these 2 input features.
What I understood so far is that InChIKey is a 27-character string or a key value of a chemical compound. SMILES is a chemical structure of that chemical compound or molecule (I am not sure what I mean by a molecule or chemical compound, that is what I thought would be correct to name).
How should I preprocess these features before feeding them into the model? Is there any demo notebook that replicates this task?
Help me understand the task!!!
Biologist and statistician (more statistician) here. I got my bachelors more than 10 years ago. I am starting to get involved in scRNAseq research and I am quite rusty in genetics and all concerning bioinformatics field. I was looking here and the most updated comment was about the Biostar Handbook.
I want your opinions on this resource, it seems quite affordable. Any suggestions of other resources to get my self updated and more or less informed to engage in scRNAseq will be apreciated.