nithishbn/motif_finder

wasi

Public
wasmer run nithishbn/motif_finder

Motif Finder

Welcome to Motif Finder! This is a command line utility that allows you to take a FASTA file, specify a few parameters, and (hopefully) get some motifs prevalent in the sequences.

Prerequisites

This utility should work on all platforms if it has been compiled for it. If it hasn't, you can build it for your platform by cloning this repository:

git clone https://github.com/nithishbn/MotifFinder.git

and running

cargo build --release

in the source directory. This will leave an executable in the target/release/ directory which you can then run in the command line: motif_finder.exe

Data format

This tool technically accepts all FASTA files, but the way it's meant to be used is to use an interesting approach in motif finding.

RNASeq

By using RNASeq data and aligning it back to a reference genome, we can identify the alignment sites of transcripts. Using these alignment sites, we can generate the set of sequences x bp upstream of the site in which to look for motifs, specifically for transcription factor binding sites.

This method involves finding an organism with RNASeq data, a reference genome, and a few bioinformatics tools including samtools, bamtools, and bedtools.

Examples

You can try to find the motifs present in promoters.fasta, a set of 4 promoters known in P. tricornutum, a relatively unknown diatom species.

Gibbs Sampler

Gibbs Sampler is an algorithm that iteratively searches for the best set of motifs in a set of sequences and throws out motifs at random until all iterations are finished.

motif_finder.exe -i promoters.fasta -e 4 -k 10 -o promotifs.txt gibbs -t 100 -r 100

Randomized Motif Search

Randomized Motif Search is an algorithm that iteratively searches for the best set of motifs in a set of sequences and throws out motifs at random until the score cannot be improved anymore.

motif_finder.exe -i promoters.fasta -e 4 -k 10 -o promotifs.txt randomized -r 100

Median String

Median String is an algorithm that checks the hamming distance from each kmer from each sequence and returns the minimized kmer from all strings. This algorithm is incredibly slow but can result in very accurate but short kmers. Be warned when using large k values.

motif_finder.exe -i promoters.fasta -e 4 -k 8 -o promotifs.txt median

Find motifs using Gibbs Sampler, Median String, and Randomized Motif Search algorithms in a fasta formatted file of reads Refer to the README to understand the input data

Making software universally accessible