helpHow does it work?
Filter by keyword (english terms)
or resource name :


Filter by operation :

A - B - C - D - E - F - G - H - I - J - K - L - M - N - O - P - Q - R - S - T - U - V - W - X - Y - Z
Name Description Link EDAM category
The AB WT Analysis Pipeline is an off-instrument SOLiD data analysis software package for the analysis of experiments run with the SOLiD™ System. Currently it supports functionality for mapping SOLiD reads from a transcript sample to a reference genome and assigning tag counts to features of the reference genome. The package can be run on PBS and LSF job management systems. The output from this package can be viewed directly in the UCSC genome browser. Oligonucleotide alignment construction
This tool, of course, can not only be used for comparing methods, but also for investigating potential regulators and co-regulators of gene sets from experiments such as ChIP-chip and microarrays. Sequence motif discovery
AMOS is collection of tools and class interfaces for the assembly of DNA sequencing reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulatio Sequence assembly (de-novo assembly)
ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages. ANTLR provides excellent support for tree construction, tree walking, translation, error recovery, and error reporting.
A grounder for logic programs -Current ASP solvers work on variable-free programs. Hence, a grounder is needed that, given an input program with first-order variables, computes an equivalent ground (variable-free) program. Gringo is such a grounder.
Potassco, the Potsdam Answer Set Solving Collection bundles tools for Answer Set Programming developed at the University of Potsdam, among them, the answer set solver clasp, the grounder gringo, and their combinations clingo and iclingo.
A_purva is a Contact Map Overlap maximization (CMO) solver. Given two protein structures represented by two contact maps, A_purva computes the amino-acid alignment which maximize the number of common contacts. Pairwise structure alignment construction (global)
ABySS is a de novo, parallel, paired-end sequence assembler that is designed for short reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes. Sequence assembly (de-novo assembly)
GCG provides over 150 programs for sequence analysis. These programs meet a wide range of analysis needs—from pattern, motif, and reference database searches to RNA secondary structure prediction; from restriction mapping to evolutionary analysis; from fragment assembly to sequence comparison. With SeqLab’s graphical user interface.
ActiveTcl, by ActiveState is a distribution of Tcl and Tk, coupled with a repository (conceptually similar to CPAN, as used by Perl, or the various Linux distribution package repositories). It started out as a Batteries Included effort to bring together in one binary distribution a variety of Tcl extensions
ALLPATHS-LG is a whole-genome shotgun assembler that can generate high-quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers Sequence assembly (de-novo assembly)
Apache Ant is a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other. The main known usage of Ant is the build of Java applications. Ant supplies a number of built-in tasks allowing to compile, assemble, test and run Java applications. Ant can also be used effectively to build non Java applications, for instance C or C++ applications. More generally, Ant can be used to pilot any type of process which can be described in terms of targets and tasks.
AUGUSTUS is a program that predicts genes in eukaryotic genomic sequences.
C++ API & command-line toolkit for working with BAM data Sequence alignment construction
Bcftools is a program for variant calling and manipulating VCFs and BCFs. It replaces the original samtools BCF calling from bcftools subdirectory of samtools and renders obsolete most of the perl VCFtools.
A powerful toolset for genome arithmetic.
BS Seeker 2 is a seamless and versatile pipeline for accurately and fast mapping the bisulfite-treated short reads.
BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to 1Mbp. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads. Oligonucleotide alignment construction
This program, BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model. One of the simplest possible scenarios covered consists of an island model in which subpopulation allele frequencies are correlated through a common migrant gene pool from which they differ in varying degrees. The difference in allele frequency between this common gene pool and each subpopulation is measured by a subpopulation specific FST coefficient. Therefore, this formulation can consider realistic ecological scenarios where the effective size and the immigration rate may differ among subpopulations.

Being Bayesian, BayeScan incorporates the uncertainty on allele frequencies due to small sample sizes. In practice, very small sample size can be use, with the risk of a low power, but with no particular risk of bias. Allele frequencies are estimated using different statistical models depending on the type of genetic marker used. In BayeScan, three different types of data can be used: (i) codominant data (as SNPs or microsatellites), (ii) dominant binary data (as AFLPs) and (iii) AFLP amplification intensity, which are neither considered as dominant nor codominant.

Selection is introduced by decomposing FST coefficients into a population-specific component (beta) shared by all loci, and a locus-specific component (alpha) shared by all the populations using a logistic regression. Departure from neutrality at a given locus is assumed when the locus-specific component is necessary to explain the observed pattern of diversity (alpha significantly different from 0). A positive value of alpha suggests diversifying selection, whereas negative values suggest balancing or purifying selection. This leads to two alternative models for each locus, including or not the alpha component to model selection. For each locus, a reversible-jump MCMC explores models with and without selection (alpha component being either present or absent, respectively) and estimates their relative posterior probabilities.
Phylogenetic tree analysis (natural selection)
Genetic variation analysis
BayeScan aims at identifying candidate loci under natural selection from genetic data, using differences in allele frequencies between populations. BayeScan is based on the multinomial-Dirichlet model.
Base de données embarquée
BioMart is a community-driven project to provide unified access to distributed research data to facilitate the scientific discovery process.
The BioMart project provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. The project adheres to the open source philosophy that promotes collaboration and code reuse.
Bioquali est un logiciel de validation et de prédiction de données qualitatives couplées à un réseau biologique représenté par un graphe d'interactions. Gene regulatory network analysis
Bismark is a program to map bisulfite treated sequencing reads to a
genome of interest and perform methylation calls in a single step.
The output can be easily imported into a genome viewer, such as
SeqMonk, and enables a researcher to analyse the methylation levels
of their samples straight away.
The PacBio® long read aligner
NCBI Blast+ to perform sequence similarity searches. Sequence database search
Blast2GO® is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Sequence annotation
BLAT (BLAST-like alignment tool) is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome.
Boost provides free peer-reviewed portable C++ source libraries.
We emphasize libraries that work well with the C++ Standard Library.
Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications
Bowtie, an ultrafast, memory-efficient short read aligner for short DNA sequences (reads) from next-gen sequencers. Sequence database search (by sequence using local alignment-based methods)
BreakDancer, released under GPLv3, is a Cpp package that provides genome-wide detection of structural variants from next generation paired-end sequencing reads. It includes two complementary programs. BreakDancerMax predicts five types of structural variants: insertions, deletions, inversions, inter- and intra-chromosomal translocations from next-generation short paired-end sequencing reads using read pairs that are mapped with unexpected separation distances or orientation. BreakDancerMini focuses on detecting small indels (usually between 10bp and 100bp) using normally mapped read pairs. Please read our paper for detailed algorithmic description. http://www.nature.com/nmeth/journal/v6/n9/abs/nmeth.1363.html
Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs
a user-friendly utility for processing and anylyzing 454 GS-FLX data in biodiversity studies.
CEGMA (Core Eukaryotic Genes Mapping Approach) is a pipeline for building a set of high reliable set of gene annotations in virtually any eukaryotic genome.
CMake is a cross-platform build system generator.
COPE (Connecting Overlapped Pair-End reads) is a method to align and connect the illumina sequenced Pair-End reads of which the insert size is smaller than the sum of the two read length.The connected reads can be used in genome assembly, resequencing and transcriptome research.
Using RNA-seq, tens of thousands of novel transcripts and isoforms have been identified (Djebali, et al Nature, 2012 , Carbili et al, Gene & Development, 2011) The discovery of these hidden transcriptome rejuvenate the need of distinguishing coding and noncoding RNA. However, Most previous coding potential prediction methods heavily rely on alignment, either pairwise alignment to search for protein evidence or multiple alignments to calculate phylogenetic conservation score (such as CPC , PhyloCSF and RNACode ). This is because most previously identified transcripts including protein coding RNA and short, housekeeping or regulatory RNAs such as snRNAs, snoRNA and tRNA are highly conserved. While still very useful, these approaches have several limitations:

Most lncRNAs are less conserved and tend to be lineage specific which greatly limited the discrimination power of alignment-based methods. For example, of 550 lncRNA detected from zebrafish, only 29 of them had detectable sequence similarity with putative mammalian orthologs (Ulitsky et al, Cell, 2011).
A significant fraction of protein coding genes may have an alternatively processed isoform or one transcribed from an alternative promoter, these part of ncRNA cannot be correctly classified through homologous search because they would have significant match to protein coding genes.
Alignment based method is extremely slow. For example, CPC takes 6050 CPU minutes (> 4 days) to evaluate 14,000 lncRNA transcripts.
Reliability depends on alignment quality. Most multi-alignment tools use heuristic search and do not guarantee to give optimal alignments.
CPAT overcomes the above problems by using logistic regression model based on 4 pure sequence linguistic features. 1) ORF size, 2) ORF coverage, 3) Fickett TESTCODE and 4) Hexamer usage bais. Linguistic feature based method does not require other genomes or protein databases to perform alignment and is more robust. Because it is alignment free, it runs much faster and also easier to use. For example, CPAT takes several minutes to evaluate the above 14,000 lncRNAs. More importantly, compared with alignment-based approaches, it performs better with improved sensitivity and specificity (0.966 tested on human gene annotation).
Coding Potential Calculator (CPC) is a Support Vector Machine-based
classifier to assess the protein-coding potential of a transcript (i.e
whether a cDNA/RNA transcript could encode a peptide or not) based on
six biologically meaningful sequence features. It takes nucleotide
FASTA sequences as input, and generate output about the coding status
and the "supporting evidence" for the sequence.
fichier access.ilm) -Academic Research license"
We propose a novel way of analyzing reads that integrates genomic locations and local coverage, and delivers all above mentioned predictions in a single step. Our program, CRAC, uses a double k-mer profiling approach to detect candidate mutations, indels, splice or fusion junctions in each single read.
CarthaGène is a genetic/radiated hybrid mapping software. CarthaGene looks for multiple populations maximum likelihood consensus maps using a fast EM algorithm for maximum likelihood estimation and powerful ordering algorithms.
ChIPDiff provides a solution for the identification of Differential Histone Modification Sites (DHMSs) by comparing two ChIP-seq libraries (L1 and L2). An HMM is employed in ChIPDiff to infer the states of histone modification changes.
ChromHMM is software for learning and characterizing chromatin states. ChromHMM can integrate multiple chromatin datasets such as ChIP-seq data of various histone modifications to discover de novo the major re-occuring combinatorial and spatial patterns of marks. ChromHMM is based on a multivariate Hidden Markov Model that explicitly models the presence or absence of each chromatin mark. The resulting model can then be used to systematically annotate a genome in one or more cell types. By automatically computing state enrichments for large-scale functional and annotation datasets ChromHMM facilitates the biological characterization of each state. ChromHMM also produces files with genome-wide maps of chromatin state annotations that can be directly visualized in a genome browser.
Clover is a program for identifying functional sites in DNA sequences. If you give it a set of DNA sequences that share a common function, it will compare them to a library of sequence motifs (e.g. transcription factor binding patterns), and identify which if any of the motifs are statistically overrepresented in the sequence set.
COMMET (“COmpare Multiple METagenomes”) provides a global similarity overview between all datasets of a large metagenomic project.

Directly from non-assembled reads, all against all comparisons are performed through an efficient indexing strategy. Then, results are stored as bit vectors, a compressed representation of read files, that can be used to further combine read subsets by common logical operations. Finally, COMMET computes a clusterization of metagenomic datasets, which is visualized by dendrogram and heatmaps.
Compareads is a tool designed to extract similar reads between potentially huge metagenomic datasets (i.e., hundreds of millions reads per dataset)

WARNING ! Compareads is no more maintained. You can now use the new Commet software which does more and better than compareads.
This program determines consensus patterns in unaligned sequences. The algorithm is based on a matrix representation of a consensus pattern. Sequence motif discovery
Corset is a command-line program to go from a de novo transcriptome assembly to gene-level counts. Software takes a set of reads that have been multi-mapped to the transcriptome (where multiple alignments per read were reported) and hierarchically clusters the transcripts based on the proportion of shared reads and expression patterns. It will report the clusters and gene-level counts for each sample, which are easily tested for differential expression with count based tools such as edgeR and DESeq.
CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)).
It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.
CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome.
We do not recommend using CrossMap to convert genome coordinates between species.
Cufflinks assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. It accepts aligned RNA-Seq reads and assembles the alignments into a parsimonious set of transcripts. Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one
A tool for clustering large number of short similar DNA sequences.
This program is originally designed for clustering targeted 16S rRNA pyrosequencing reads.
DSK is a k-mer counting software, similar to Jellyfish. Jellyfish is very fast but limited to large-memory servers and k ≤ 32. In contrast, DSK supports large values of k, and runs with (almost-)arbitrarily low memory usage and reasonably low temporary disk usage. DSK can count k-mers of large Illumina datasets on laptops and desktop computers.
A classifier that predicts 5′ Drosha processing sites in hairpins that are candidate miRNAs. The classifier, called Microprocessor SVM, correctly predicts the processing site for 50% of known human 5′ miRNAs, and 90% of its predictions are within two nucleotides of the true site.
Reference-free short read error correction from diploid genomes, with explicit modeling of heterozygous sites.
EMBOSS (European Molecular Biology Open Software Suite) is a new suite of freely available programs and libraries for sequence analysis.
Eval is a flexible tool for analyzing the performance of gene-structure prediction programs. It provides summaries and graphical distributions for many statistics describing any set of annotations, regardless of their source. It also compares sets of predictions to standard annotations and to one another.
Eye of GNOME is the official image viewer for the GNOME desktop environment. Unlike some other image viewers, Image Viewer will only display images. It does, however, provide basic effects for improved viewing, such as zooming, full-screen, rotation, and transparent image background control.
Annotate long non-coding RNAs (lncRNAs) based on reconstructed transcripts from RNA-seq data (either with or without a reference genome)
FRCbam is a tool able to evaluate and analyze de novo assembly/assemblers. The tool has been already successfully applied in several de novo .

FRCbam: tool to compute Feature Response Curves in order to validate and rank assemblies and assemblers
Distance Based Phylogeny Reconstruction Packages
FastQ Screen is a simple application which allows you to search a large sequence dataset against a panel of different genomes to determine from where the sequences in your data originate. It was built as a QC check for sequencing pipelines but may also be useful in characterising metagenomic samples. When running a sequencing pipeline it is useful to know that your sequencing runs contain the types of sequence they're supposed to. Your search libraries might contain the genomes of all of the organisms you work on, along with PhiX, Vectors or other contaminants commonly seen in sequencing experiments.
A quality control tool for high throughput sequence data.
FastSimBac is a simulator of the coalescent process with bacterial recombination that simulates genealogies spatially
across chromosomes as a Markov process. Rather than implementing the coalescent with cross-over, it uses the coalescent
with gene conversion only (Wiuf and Hein, 2000). Parts of the code are based on MaCS (Chen et al 2009). Recombination
events occur only on the local ARG at the current position on the sequence instead of anywhere on the ARG, and can
coalesce to any lineage on the local ARG.
These changes make the algorithm faster than simMLST (Didelot et al 2009) or SimBac (Brown et al 2015) for whole genomes
and elevated recombination rates.

FastSimBac also supports all the demographic history semantics of MS. Typing ./fastSimBac with no arguments at the
command line lists the usage parameters. Most command line arguments are the same as those in ms.
The FindPeaks application can be used for several purposes:

Converting Eland, Maq (.map), BED or other files into WIG files (See Supported Input formats)
Identifying areas of enrichment (ChIP-Seq analysis)
Noise estimation in quantifying enrichment (in progress)
Flexbar is a software to preprocess high-throughput sequencing data efficiently. It demultiplexes barcoded runs and removes adapter sequences. Moreover, trimming and filtering features are provided. Flexbar increases mapping rates and improves genome and transcriptome assemblies. It supports next-generation sequencing data from Illumina, Roche 454, and the SOLiD platform. Recognition is based on exact overlap sequence alignment.
GAM (Genomic Assemblies Merger) is a free parallel software tool to integrate two different assemblies and improves the overall quality of the genome sequences by merging them.
The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. The toolkit offers a wide variety of tools, with a primary focus on variant discovery and genotyping as well as strong emphasis on data quality assurance. Its robust architecture, powerful processing engine and high-performance computing features make it capable of taking on projects of any size.
The GNU Compiler Collection includes front ends for C, C++, Objective-C, Fortran, Java, Ada, and Go, as well as libraries for these languages (libstdc++, libgcj,...). GCC was originally written as the compiler for the GNU operating system. The GNU system was developed to be 100% free software, free in the sense that it respects the user's freedom.
Gblocks is a computer program written in ANSI C language that eliminates poorly aligned positions and divergent regions of an alignment of DNA or protein sequences. These positions may not be homologous or may have been saturated by multiple substitutions and it is convenient to eliminate them prior to phylogenetic analysis.
Gene discovery for eukaryotic genomes. Whole gene prediction
GeneTorrent is a set of executables for accessing data in the Cancer Genomics Hub (CGHub), a secure repository for storing, cataloging, and accessing cancer genome sequences, alignments, and mutation information from the Cancer Genome Atlas (TCGA) consortium and related projects.
GENEPOP is a population genetics software package originally developed by Michel Raymond (Raymond@isem.univ-montp2.fr) and Francois Rousset (Rousset@isem.univ-montp2.fr), at the Laboratiore de Genetique et Environment, Montpellier, France
The Wise2 package is now a rather stately bioinformatics package that has be around for a while. Its key programs are genewise, a program for aligning proteins or protein HMMs to DNA, and dynamite a rather cranky "macro language" which automates the production of dynamic programming.

Wise2 is maintained by Ewan Birney. I had thought that most of Wise2 development would have stopped around 3 years ago due to the developed of Guy Slater's excellent exonerate package which handles many of the problems the Wise2 package does but around 1,000 fold faster.
The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.
The Gibbs Motif Sampler will allow you to identify motifs, conserved regions, in DNA or protein sequences. Sequence motif discovery
Gist contains software tools for support vector machine classification and for kernel principal components analysis.
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Git is easy to learn and has a tiny footprint with lightning fast performance. It outclasses SCM tools like Subversion, CVS, Perforce, and ClearCase with features like cheap local branching, convenient staging areas, and multiple workflows.
Glint is a general purpose nucleic acid sequence aligner.

It is is based on a seed-and-extend alignment approach.
Initially designed to compare large genomes it handles now different alignment

aligning large chromosomes of genomes
aligning a genome against itself
aligning a set of reads against a genome
aligning a set of paired-end reads against a genome

In addition, glint encapsulates a conversion utility between a large number of formats (fastq, fasta, bam, sam, bed)
GMAP is a standalone program for mapping and aligning cDNA sequences to a genome. The program maps and aligns a single sequence with minimal startup time and memory requirements, and provides fast batch processing of large sequence sets. The program generates accurate gene structures, even in the presence of substantial polymorphisms and sequence errors, without using probabilistic splice site models. Methodology underlying the program includes a minimal sampling strategy for genomic mapping, oligomer chaining for approximate alignment, sandwich DP for splice site detection, and microexon identification with statistical significance testing.
Gnuplot is a portable command-line driven graphing utility for Linux, OS/2, MS Windows, OSX, VMS, and many other platforms
Graph visualization is a way of representing structural information as diagrams of abstract graphs and networks.
GUIDANCE is meant to be used for weighting, filtering or masking unreliably aligned positions in sequence alignments before subsequent analysis. For example, align codon sequences (nucleotide sequences that code for proteins) using PAGAN, remove columns with low GUIDANCE scores, and use the remaining alignment to infer positive selection using the branch-site dN/dS test. Other analyses where GUIDANCE filtering could be useful include phylogeny reconstruction, reconstruction of the history of specific insertion and deletion events, inference of recombination events, etc.
GUIADNCE2 also provides a set of alternative alignments which can be used when adopting statistical point of view, i.e. performing statistical analyses that rely on many possible alignments that are supported by the data.
HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.
HG-CoLoR (Hybrid Graph for the error Correction of Long Reads) is a hybrid method for the error correction of long reads that follows the main idea from NaS to produce corrected long reads from assemblies of related accurate short reads.
The HH-suite is an open-source software package for sensitive protein sequence searching. It contains programs that can search for similar protein sequences in protein sequence databases. These sequence searches are a standard tool in modern biology with which the function of unknown proteins can be inferred from their sequence.
HMMER is an implementation of profile HMM methods for sensitive database searches using multiple sequence alignments as queries. Sequence motif recognition
Sequence motif discovery
HMPP (Hybrid Multicore Parallel Programming) est un ensemble d'outils de développement au service de la programmation multi-cœurs hybride. HMPP est un produit commercial de CAPS entreprise.
The HPG Variant suite is an ambitious project aimed to provide a complete suite of tools to work with genomic variation data, from VCF tools to variant profiling or genomic statistics. It is being implemented using High Performance Computing technologies to provide the best performance possible.
A C library for reading/writing high-throughput sequencing data
Analysing high-throughput sequencing data with Python
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. It supports a wide variety of data types, including array-based and next-generation sequence data, and genomic annotations.
IThOS is a software package dedicated to the design of primers
InStruct implements the Markov Chain Monte Carlo algorithm for the generalized Bayesian clustering method to estimate the self-fertilization rates and cluster individuals into subpopulations simultaneously using genotype data consisting of unlinked markers.
Efficient target prediction incorporating accessibility of interaction sites.
Intel C++ Compiler, also known as icc or icl, is a group of C and C++ compilers from Intel available for OS X, Linux, Windows and Intel-based Android devices.
Java is a general-purpose computer programming language that is concurrent, class-based, object-oriented, and specifically designed to have as few implementation dependencies as possible. It is intended to let application developers "write once, run anywhere" (WORA), meaning that compiled Java code can run on all platforms that support Java without the need for recompilation. Java applications are typically compiled to bytecode that can run on any Java virtual machine (JVM) regardless of computer architecture. As of 2015, Java is one of the most popular programming languages in use, particularly for client-server web applications, with a reported 9 million developers.
(Source : Wikipedia)
KMC is a disk-based programm for counting k-mers from (possibly gzipped) FASTQ/FASTA files
The Kepler Project is dedicated to furthering and supporting the capabilities, use, and awareness of the free and open source, scientific workflow application, Kepler. Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging "R" scripts with compiled "C" code, or facilitating remote, distributed execution of models
KoriBlast is a reliable graphical environment dedicated to sequence data mining. KoriBlast combines
BLAST searches with advanced data visualisation and management capabilities driven by a friendly
graphical user interface.
Linear Algebra PACKage.
LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems.
LASTZ is a program for aligning DNA sequences, a pairwise aligner. Originally designed to handle sequences the size of human chromosomes and from different species, it is also useful for sequences produced by NGS sequencing technologies such as Roche 454.
LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.
LoRMA is a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines.
Logo generator for hmmer profile hmm
Lombarde proposes a method that integrates predictions of transcription factors, binding sites and operons with gene associations induced by transcriptomic data in order to produce a realistic regulatory graph. This graph results from the solution of a combinatorial problem implemented using Answer Set Programming, a logic-based paradigm which enables an effective encoding and solving of complex combinatorial problems.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
With the improvement of sequencing techniques, chromatin immunoprecipitation followed by high throughput sequencing (ChIP-Seq) is getting popular to study genome-wide protein-DNA interactions. To address the lack of powerful ChIP-Seq analysis method, we present a novel algorithm, named Model-based Analysis of ChIP-Seq (MACS), for identifying transcript factor binding sites. MACS captures the influence of genome complexity to evaluate the significance of enriched ChIP regions, and MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used for ChIP-Seq data alone, or with control sample with the increase of specificity.
MAFFT is a multiple sequence alignment program for unix-like operating systems
The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs.
Mobile Element Locator Tool
The MEME Suite allows you to:
* discover motifs using MEME or GLAM2 on groups of related DNA or protein sequences,
* search sequence databases using motifs, * compare a motif to all motifs in a database of motifs, and
* associate motifs with Gene Ontology terms via their putative target genes.
Sequence motif discovery
Sequence motif recognition
MEtools is a software suite for automatic metabolic network reconstruction. Gene regulatory network analysis
MIRA is able to perform true hybrid de-novo assemblies using reads gathered through Sanger, 454 or Solexa sequencing technologies Sequence assembly (de-novo assembly)
ModuleOrganizer : detecting modules in families of transposable elements
Mass Spectrometry driven BLAST: a specialised BLAST –based protocol developed for identification of proteins by sequence similarity searches using peptide sequences produced by the interpretation of tandem mass spectra. MS-BLAST is described in detail in: Shevchenko, A., Sunyaev, S., Loboda, A., Shevchenko, A., Bork, P., Ens, W., and Standing, K. (2001). Charting the proteomes of organisms with unsequenced genomes by MALDI- Quadrupole Time-of-Flight mass spectrometry and BLAST homology searching, Anal Chem 73, 1917-1926.
MUMmer is a modular system for the rapid whole genome alignment of finished or draft sequence. This package provides an efficient suffix tree library, seed-and-extend alignment, SNP detection, repeat detection, and visualization tools
MAKER is a portable and easily configurable genome annotation pipeline. Its purpose is to allow smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. MAKER identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER is also easily trainable: outputs of preliminary runs can be used to automatically retrain its gene prediction algorithm, producing higher quality gene-models on seusequent runs. MAKER's inputs are minimal and its ouputs can be directly loaded into a GMOD database. They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER should prove especially useful for emerging model organism projects with minimal bioinformatics expertise and computer resources.
Maq is a set of programs that map and assemble fixed-length Solexa/SOLiD reads in a fast and accurate way. Sequence assembly (de-novo assembly)
MendelSoft is an open source software which detects marker genotyping incompatibilities (Mendelian errors only) in complex pedigrees using weighted constraint satisfaction techniques. The input of the software is a pedigree data with genotyping data at a single locus. The output of the software is a list of individuals for which the removal of their genotyping data restores consistency. This list is of minimum size when the program ends.

Another possibility is to find the most probable consistent correction with respect to a Bayesian formulation of the problem. In this case, the output of the software is a list of individuals for which predicted genotypes differ from their genotyping data and such that the corresponding joint probability for the whole problem is maximum.
Générateur de reads ou de mate-pairs à partir d’un modèle d’erreur de séquençage adaptable (Sanger, 454 ou Illumina) et d’une base de données.
Minia is a short-read assembler based on a de Bruijn graph, capable of assembling a human genome on a desktop computer in a day. The output of Minia is a set of contigs. Minia produces results of similar contiguity and accuracy to other de Bruijn assemblers (e.g. Velvet).
Système de gestion de packages
miRDeep2 is a completely overhauled tool which discovers microRNA genes by analyzing sequenced RNAs. The tool reports known and hundreds of novel microRNAs with high accuracy in seven species representing the major animal clades. The low consumption of time and memory combined with user-friendly interactive graphic output makes miRDeep2 accessible for straightforward application in current reasearch.

Molecular Descriptor Lab for Computing structural and physichemical properties of molecules from their 3D structures.
This project seeks to develop a single piece of open-source, expandable software to fill the bioinformatics needs of the microbial ecology community
MrBayes is a program for Bayesian inference and model choice across a wide range of phylogenetic and evolutionary models. MrBayes uses Markov chain Monte Carlo (MCMC) methods to estimate the posterior distribution of model parameters.
"Software that allows to align simultaneously several biological sequences on computer that use UNIX system. "
NCBI C++ Toolkit provides free, portable, public domain libraries with no restrictions use - on Unix, MS Windows, and Mac OS platforms:
Entrez Direct (EDirect) is an advanced method for accessing the NCBI’s set of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a UNIX terminal window. Functions take search terms from command-line arguments. Individual operations are combined to build multi-step queries. Record retrieval and formatting normally complete the process.
The NCBI Software Development Toolkit was developed for the production and distribution of GenBank, Entrez, BLAST, and related services by NCBI
NetLogo is a multi-agent programmable modeling environment. It is used by tens of thousands of students, teachers and researchers worldwide. It also powers HubNet participatory simulations. It is authored by Uri Wilensky and developed at the CCL. You can download it free of charge.
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.

Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
An alignment tool for aligning short sequences against an indexed set of
reference sequences. Typically used for aligning Illumina single end and
paired end reads.
Uses base qualities and affine gap penalties to find the most probable
alignment location of the read.
The OBITools package is a set of programs specifically designed for analyzing NGS data in a DNA metabarcoding context, taking into account taxonomic information. It is distributed as an open source software available on the following website: http://metabarcoding.org/obitools.

Citation: Boyer F., Mercier C., Bonin A., Taberlet P., Coissac E. (2014) OBITools: a Unix-inspired software package for DNA metabarcoding. Molecular Ecology Resources, submitted.

De novo transcriptome assembler for very short reads
GNU Octave is a high-level interpreted language, primarily intended for numerical computations. It provides capabilities for the numerical solution of linear and nonlinear problems, and for performing other numerical experiments. It also provides extensive graphics capabilities for data visualization and manipulation. Octave is normally used through its interactive command line interface, but it can also be used to write non-interactive programs. The Octave language is quite similar to Matlab so that most programs are easily portable.
Accurate inference of orthogroups, orthologues, gene trees and rooted species tree made easy!
Ortholog groups of protein sequences
infers population structure and identifies outlier loci that are candi-
dates for local adaptation.
is based on a hierarchical factor model
where population structure is captured using
latent factors. In order
to identify candidates for local adaptation, the hierarchical factor model
searches for loci that are atypically related to population structure as mea-
sured by the latent factors. Parameter inference is based on a Markov Chain
Monte Carlo (MCMC) algorithm.
returns 1/a matrix of latent factors (also called scores) to cap-
ture population structure, 2/a matrix of factor loadings to measure the re-
lationships between SNPs and latent factors, and 3/a list of Bayes factors.
The SNPs with largest Bayes factors are the candidates for local adaptation.
We also provide an additional and faster version of
that is suit-
able with very large datasets containing more than half a million of genetic
markers. The fast version is based on Principal Component Analysis and
does not use a MCMC algorithm. It also returns a matrix of scores and
factor loadings. Instead of using Bayes factors to rank SNPs, outliers SNPs
are found using a summary statistic based on factor loadings.
PHYLIP is a free package of programs for inferring phylogenies
PAGAN is a general-purpose method for the alignment of sequence graphs. PAGAN is based on the phylogeny-aware progressive alignment algorithm and uses graphs to describe the uncertainty in the presence of characters at certain sequence positions. However, graphs also allow describing the uncertainty in input sequences and modelling e.g. homopolymer errors in Roche 454 reads, or representing inferred ancestral sequences against which other sequences can then be aligned. PAGAN is still under development and will hopefully evolve to an easy-to-use, general-purpose method for phylogenetic sequence alignment.
Phylogenetic Analysis by Maximum Likelihood (PAML)
A versatile software package for read mapping and integrative analysis of genomic and epigenomic variation using massively parallel DNA sequencing
Scan a DNA sequence with a posiion-specific scoring matrix (PSSM) Sequence motif recognition
PeakAnalyzer is an open-source solution for the automated processing and annotation of genome-wide signal enrichment peaks derived through ChIP-seq, ChIP-chip, DamID and other functional genomic techniques. The software can accurately subdivide multimodal regions of signal enrichment into distinct subpeaks corresponding to binding sites or chromatin modifications, retrieve genomic sequences encompassing the computed subpeak summits, and identify positional features of interest such as intersection with exon/intron gene components, proximity to up- or downstream transcriptional start sites and cis-regulatory elements.
PeakAnalyzer is implemented as a Java program encompassing two software components, PeakSplitter and PeakAnnotator. These subprograms are also implemented separately in C++ and Java so that users can choose a distribution suited to their requirements. Core facilities needing to process numerous datasets have the option to incorporate the faster C++ versions into a production workflow, whereas the Java implementations can be run either as individual command-line utilities or as a single cross-platform desktop application with an intuitive graphical interface.
PeakSeq is a program for identifying and ranking peak regions in ChIP-Seq experiments. It takes as input, mapped reads from a ChIP-Seq experiment, mapped reads from a control experiment and outputs a file with peak regions ranked with increasing Q-values. Sequence alignment analysis
PeakAnalyzer is implemented as a Java program encompassing two software components, PeakSplitter and PeakAnnotator. These subprograms are also implemented separately in C++ and Java so that users can choose a distribution suited to their requirements. Core facilities needing to process numerous datasets have the option to incorporate the faster C++ versions into a production workflow, whereas the Java implementations can be run either as individual command-line utilities or as a single cross-platform desktop application with an intuitive graphical interface.
PeptideShaker is a search engine independent platform for interpretation of proteomics identification results from multiple search engines, currently supporting X!Tandem, MS-GF+, MS Amanda, OMSSA, MyriMatch, Comet, Tide, Mascot, Andromeda and mzIdentML. By combining the results from multiple search engines, while re-calculating PTM localization scores and redoing the protein inference, PeptideShaker attempts to give you the best possible understanding of your proteomics data!
Perl 5 is a highly capable, feature-rich programming language with over 27 years of development.
PfamScan is used to search a FASTA sequence against a library of Pfam HMM
The program PHASE implements a Bayesian statistical method for recon-
structing haplotypes from population genotype data.
The software can deal with SNP, microsatellite, and other multi-allelic loci (eg tri-allelic SNPs, and HLA alleles), in any combination, and missing data are allowed.
PhyloCSF is a method to determine whether a multi-species nucleotide sequence alignment is likely to represent a protein-coding region. PhyloCSF does not rely on homology to known protein sequences; instead, it examines evolutionary signatures characteristic to alignments of conserved coding regions, such as the high frequencies of synonymous codon substitutions and conservative amino acid substitutions, and the low frequencies of other missense and non-sense substitutions (CSF = Codon Substitution Frequencies). One of PhyloCSF’s main current applications is to help distinguish protein-coding and non-coding RNAs represented among novel transcript models obtained from high-throughput transcriptome sequencing.
Phylogenetic Consensus for Regulatory Motif Identification
Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported.
a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads
PLAST is a parallel alignment search tool for comparing large protein banks.

PLAST runs 5 to 50 times faster than the NCBI-BLAST software when processing large amount of data.

The software is freely available for download under the Affero GPL version 3 License.
PRANK is a probabilistic multiple alignment program for DNA, codon and amino-acid sequences. It’s based on a novel algorithm that treats insertions correctly and avoids over-estimation of the number of deletion events. In addition, PRANK borrows ideas from maximum likelihood methods used in phylogenetics and correctly takes into account the evolutionary distances between sequences. Lastly, PRANK allows for defining a potential structure for sequences to be aligned and then, simultaneously with the alignment, predicts the locations of structural units in the sequences.
Pratt is a tool for discovering patterns that match some minimum number of a set of sequences. The sequences are input all in one file in either FastA or Swiss-Prot format. Sequence motif discovery
Primer3 is a program for designing PCR primers (PCR = "Polymerase Chain Reaction").
You can use the PRINSEQ (PReprocessing and INformation of SEQuences) tool to:

Generate statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios, ...
Filter your data to remove sequence copies, short or long sequences, low quality or complexity sequences, sequences with Ns, ...
Reformat your data to convert between FASTQ and FASTA+QUAL format or DNA and RNA, rename sequence IDs, change the width of sequence lines, ...
Trim sequences to a certain length, trim poly-A/T tails, trim low quality ends, trim bases from the ends, ...
Protomata for pattern discovery and matching using automata Sequence motif comparison
Motif database search
Sequence motif discovery
Sequence motif recognition
Predict Protein Secondary Structure
The pyrocleaner is intended to clean the reads included in the sff file in order to ease the assembly process. It enables filtering sequences on different criteria such as length, complexity, number of undetermined bases which has been proven to correlate with pour quality and multiple copy reads. It also enables to clean paired-ends sff files and generates on one side a sff with the validated paired-ends and on the other the sequences which can be used as shotgun reads.
Python is a programming language that lets you work more quickly and integrate your systems more effectively. You can learn to use Python and see almost immediate gains in productivity and lower maintenance costs.
QTL Cartographer is a suite of programs to map quantitative traits using a map of molecular markers.
QTLMap is a software dedicated to the detection of QTL from experimental designs in outbred population. QTLMap software is developed by the Animal Genetics Division at INRA (French National Institute for Agronomical Research). The statistical techniques used are linkage analysis (LA) and linkage disequilibrium linkage analysis (LDLA) using interval mapping. Different versions of the LA are proposed from a quasi Maximum Likelihood approach to a fully linear (regression) model. The LDLA is a regression approach (Legarra and Fernando, 2009). The population may be sets of half-sib families or mixture of full- and half- sib families. The computations of Phase and Transmission probabilities are optimized to be rapid and as exact as possible. QTLMap is able to deal with large numbers of markers (SNP) and traits (eQTL).
QIIME (canonically pronounced "chime") stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms, but also supporting analysis of other types of data (such as shotgun metagenomic data). QIIME takes users from their raw sequencing output through initial analyses such as OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences of OTUs, and through downstream statistical analysis, visualization, and production of publication-quality graphics. QIIME has been applied to studies based on billions of sequences from thousands of samples.
Quake is a package to correct substitution sequencing errors in experiments with deep
coverage (e.g. >15X), specifically intended for Illumina sequencing reads.
R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.
RAxML (Randomized Axelerated Maximum Likelihood) is a program for sequential and parallel
Maximum Likelihood based inference of large phylogenetic trees. It can also be used for post-
analyses of sets of phylogenetic trees, analyses of alignments and, evolutionary placement of
short reads.
It has originally been derived from fastDNAml which in turn was derived from Joe Felsentein’s
dnaml which is part of the PHYLIP package.
The RNAmmer 1.2 server predicts 5s/8s, 16s/18s, and 23s/28s ribosomal RNA in full genome sequences
RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
Ray is a paralleled computer-controlled software that computes de novo genome assemblies of next-gen sequencing data using message passing interfac
RepeatMasker est un programme qui analyse les séquences d'ADN à la recherche de régions de faible complexité et des séquences répétées. Il peut utiliser des bibliothèques de séquences répétées telles que celles disponibles dans RepBase.
Reptile is a software developed in C++ for correcting sequencing errors in short reads from next-gen sequencing platforms
Easy identification and removal of rRNA-like sequences.
RISOTTO discovers motifs composed of many binding sites separated by spacers. Each binding site is called a box.
Collection of tools for 454 sequencer
Regulatory Sequence Analysis Tools
A dynamic, open source programming language with a focus on simplicity and productivity. It has an elegant syntax that is natural to read and easy to write.
SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
SEG program of Wootton
and Federhen, for identifying and masking segments of low compositional
complexity in amino acid sequences. This program is inappropriate for
masking nucleotide sequences and, in fact, may strip some nucleotide
ambiguity codes from nt. sequences as they are being read.
SHRiMP2 is a software package for mapping reads from a donor genome against a target (refernce) genome. SHRiMP2 was primarily developed to work with short reads produced by Next Generation Sequencing (NGS) machines
has been in evolution from a single alignment tool to a tool package that provides full solution to next generation sequencing data analysis. Currently, it consists of a new alignment tool (SOAPaligner/soap2), a re-sequencing consensus sequence builder (SOAPsnp), an indel finder ( SOAPindel ), a structural variation scanner ( SOAPsv ) and a de novo short reads assembler ( SOAPdenovo ).
SOAP3 is a GPU-based software for aligning short reads with a reference sequence.
The tools in this project provide the ability to create de novo assemblies from SOLiD colorspace reads, thus facilitating the characterization of genomic sequences for which no closely related reference genome exists. The short-read assembler at the core of this pipeline is Velvet, developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute. The tools in this package, together with Velvet, can be used to assemble high-coverage SOLiD reads from microbial genomes into nucleotide sequence contigs of several thousand bases, and scaffolds of tens of thousands of bases. Sequence assembly (de-novo assembly)
SPAdes – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies.
SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine.
The NCBI SRA Toolkit enables reading ("dumping") of sequencing files from the SRA database and writing ("loading") files into the .sra format.
SSPACE standard is a stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data.
SWI-Prolog offers a comprehensive free Prolog environment.
Sablotron is an XML toolkit which implements XSLT, DOM, and XPath. Sablotron is written in C++, and it can be used from C, Perl, Python, PHP, ObjectPascal, and via a command line interface. It supports the XSLT 1.0, XPath 1.0, and DOM Level 2 W3C specifications. It is designed to be as compact and portable as possible, and is maintained as an Open Source project by Ginger Alliance.
Scripture is a method for transcriptome reconstruction that relies solely on RNA-Seq reads and an assembled genome to build a transcriptome ab initio. The statistical methods to estimate read coverage significance are also applicable to other sequencing data. Scripture also has modules for ChIP-Seq peak calling.
The SEQIO package is a set of C functions which can read and write biological sequence files formatted using various file formats and which can be used to perform database searches on biological databases. All of the code is packaged together into a single file, making it easy to incorporate into your programs.
SICStus is a state-of-the-art, ISO standard compliant, Prolog development system.
SignalP 4.1 predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
align an expressed DNA sequence with a genomic sequence, allowing for introns
Enabling users to have full control of their environment
SMILE est un outil d'inférence de motifs dans des jeux de séquences. À l'origine, il a été concu pour détecter des sites exceptionnels (sites de fixation ou de régulation de facteurs de transcription) dans des séquences d'ADN. L'intérêt de SMILE est de permettre l'inférence de motifs structurés séparés par des intervalles de distance contraints. ATTENTION: smile est obsolète: Risotto est son remplaçant Sequence motif discovery
Genetic variant annotation and effect prediction toolbox.
Sputnik searches dna sequence files in Fasta format for microsatellite repeats.
Squizz is a sequence/alignment format checker, but it has some conversion capabilities too. Most common sequence and alignment formats are supported : - EMBL, FASTA, GCG, GDE, GENBANK, IG, NBRF, PIR (codata), RAW, and SWISSPROT. - CLUSTAL, FASTA, MEGA, MSF, NEXUS, PHYLIP (interleaved and sequential) and STOCKHOLM.
Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography. Nucleic acid sequence analysis
A fully developed set of DNA sequence assembly (Gap4 and Gap5), editing and analysis tools (Spin) for Unix, Linux, MacOSX and MS Windows.
Suffix-tree analyser (STAN): looking for nucleotidic and peptidic patterns in chromosomes
Spliced Transcripts Alignment to a Reference © Alexander Dobin, 2009-2016 https://www.ncbi.nlm.nih.gov/pubmed/23104886
Stitch/audy-stitch joins overlapping paired-end Illumina reads.
Stitch assembles overlapping paired-end reads into a single contig for each pair. This increases the read length and hopefully the quality of a de novo or reference assembly. Stitch is multi-threaded and will automatically use all cores on your system unless told otherwise. Stitch currently only reads FASTQ format. QSEQ and FASTA formats to come. Reads that are not found to overlap are dumped in a file called <prefix>-singletons and are in FASTQ format. These can then be trimmed and combined with contigs to do a de novo assembly.
Subread package: high-performance read alignment, quantification and mutation discovery

The Subread package comprises a suite of software programs for processing next-gen sequencing read data including:
- Subread: an accurate and efficient aligner for mapping both genomic DNA-seq reads and RNA-seq reads (for the purpose of expression analysis).
- Subjunc: an RNA-seq aligner suitable for all purposes of RNA-seq analyses.
featureCounts: a highly efficient and accurate read summarization program.
- exactSNP: a SNP caller that discovers SNPs by testing signals against local background noises.
SWIG is a software development tool that connects programs written in C and C++ with a variety of high-level programming languages. SWIG is used with different types of target languages including common scripting languages such as Javascript, Perl, PHP, Python, Tcl and Ruby.
SyMAP is a system for computing, displaying, and analyzing syntenic alignments between genomes. Its features include the following :

GUI manager to run synteny computations and view results.
Multiple display modes (dot plot, circular, side-by-side, closeup, 3D).
Draft sequence ordering by synteny.
Construction of cross-species gene families.
Complete annotation-based queries.
All displays accessible through the web.
Can align FPC maps to sequenced genomes.
T-COFFEE est un programme d'alignement multiple de s
Transcription Element Search System: Partial Weight Matrices (tessWms)

This program reads (selected) PWMs from a file and predicts binding sites on DNA sequences read from another file.
TFBS is a computational framework for transcription factor binding site
analysis. It can also be used for analysis involving other DNA paterns
representable by matrices, e.g. splice sites.
TMAP is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies.
Prediction of transmembrane helices in proteins
Tablet is a lightweight, high-performance graphical viewer for next generation sequence assemblies and alignments.
TakeABreak is a tool that can detect inversion breakpoints directly from raw NGS reads, without the need of any reference genome and without de novo assembling the genomes. Its implementation has a very limited memory impact allowing its usage on common desktop computers and acceptable runtime (Illumina reads simulated at 2x40x coverage from human chromosome 22 can be treated in less than two hours, with less than 1GB of memory). Nucleic acid feature prediction
s a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons
TRANSCompel is the only database presenting information on combinatorial gene transcriptional regulation and protein-protein interactions between different transcription factors bound to their cognate promoter elements.
Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing)
RNA-Seq De novo Assembly
UCSC genome browser 'kent' bioinformatic utilities
These are only the command line bioinformatic utilities from the kent source tree. This is not the genome browser install.
Highly acclaimed software package for predicting the folding or hybridization of one or two single-stranded RNA or DNA molecules using experimentally derived free energy parameters. Free energy minimization methods predict optimal and close to optimal structures, while partition function calculations predict base pair probabilities, stochastic samples of foldings or hybridizations, and melting profiles as a function of increasing temperature. melting profiles as a function of increasing temperature.
UCLUST is a clustering algorithm that uses USEARCH as a subroutine to achieve exceptional high speed and sensitivity
VCFtools is a program package designed for working with VCF files, such as those generated by the 1000 Genomes Project. The aim of VCFtools is to provide easily accessible methods for working with complex genetic variation data in the form of VCF files.
Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.
The VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on genes, transcripts, and protein sequence, as well as regulatory regions.
RNA Secondary Structure Prediction and Comparison RNA structure prediction
Nucleic acid structure comparison
a versatile software tool for efficiently solving large scale sequence matching tasks
weighted automata pattern matching
See Wapam?
"wconsensus" differs from the "consensus" program in that the user does not directly supply the width of the pattern being sought. However, the user still needs to vary a bias that directly influences the width of the pattern. Sequence motif discovery
WebLogo 3 (http://code.google.com/p/weblogo/) is a tool for creating sequence logos from biological sequence alignments. It can be run on the command line, as a standalone webserver, as a CGI webapp, or as a python library.
ZGRViewer is a graph visualizer implemented in Java and based upon the Zoomable Visual Transformation Machine. It is specifically aimed at displaying graphs expressed using the DOT language from AT&T GraphViz and processed by programs dot, neato or others such as twopi.
bamUtil is a repository that contains several programs that perform operations on SAM/BAM files. All of these programs are built into a single executable, bam.
Illumina sequencing instruments generate per-cycle BCL basecall files as primary
sequencing output, but many downstream analysis applications use per-read FASTQ
files as input. bcl2fastq combines these per-cycle BCL files from a run and translates them into FASTQ files. bcl2fastq can begin bcl conversion as soon as the first read has been completely sequenced.
At the same time as converting, bcl2fastq also separates multiplexed samples
(demultiplexing). Multiplexed sequencing allows you to run multiple individual samples in one lane. The samples are identified by index sequences that were attached to the template during sample prep. The multiplexed sample FASTQ files are assigned to projects and samples based on a user-generated sample sheet, and stored in corresponding project and sample directories. If no sample sheet is provided, a single Project directory is created named with a suffix of the FlowCell ID.
bedClip - Remove lines from bed file that refer to off-chromosome places.
bedClip input.bed chrom.sizes output.bed
-verbose=2 - set to get list of lines clipped and why
bedGraphToBigWig v 4 - Convert a bedGraph file to bigWig format.
bedGraphToBigWig in.bedGraph chrom.sizes out.bw
where in.bedGraph is a four column file in the format:
<chrom> <start> <end> <value>
and chrom.sizes is two column: <chromosome name> <size in bases>
and out.bw is the output indexed big wig file.
Use the script: fetchChromSizes to obtain the actual chrom.sizes information
from UCSC, please do not make up a chrom sizes from your own information.
The input bedGraph file must be sorted, use the unix sort command:
sort -k1,1 -k2,2n unsorted.bedGraph > sorted.bedGraph
-blockSize=N - Number of items to bundle in r-tree. Default 256
-itemsPerSlot=N - Number of data points bundled at lowest level. Default 1024
-unc - If set, do not use compression.
BEDOPS v2.4.2 is a suite of tools to address common questions raised in genomic studies — mostly with regard to overlap and proximity relationships between data sets. It aims to be scalable and flexible, facilitating the efficient and accurate analysis and management of large-scale genomic data.
Python tools/libraries for: Influence graph analysis, Metabolic network expansion, PSN optimizer
DNA Sequence Assembly Program Sequence assembly (de-novo assembly)
caspo combines BioASP and CellNOpt to provide an easy to use software for learning Protein Signaling Logic Models from a Prior Knowledge Network in .sif format and a phospho-proteomics dataset in MIDAS format.
Cassiopee is an index and search tool and library for bioinformatics (could be used for others but provides additional stuff usefull in sequence analysis). It is a complete rewrite of the ruby Cassiopee gem. It scan an input genomic sequence (dna/rna/protein) and search for a subsequence with exact match or allowing substitutions (Hamming distance) and/or insertion/deletions. It also support alphabet ambiguity.

This program provides both a binary (Cassiopee) and a shared library.

Index is based on a suffix tree with compression. It is possible to save the indexed sequence for later use without the need to reindex the whole sequence (for large data sets).

See cassiopee -h for all options.

Expected input sequence is a one-line sequence with no header. cassiopeeknife (see later chapter) can be used to convert Fasta sequences in cassiopee input sequences.
Search and retrieval
Cluster Database at High Identity with Tolerance
Celera Assembler is a de novo whole-genome shotgun (WGS) DNA sequence assembler. It reconstructs long sequences of genomic DNA from fragmentary data produced by whole-genome shotgun sequencing. Celera Assembler has enabled many advances in genomics, including the first whole genome shotgun sequence of a multi-cellular organism (Myers 2000) and the first diploid sequence of an individual human (Levy 2007). Celera Assembler was developed at Celera Genomics starting in 1999. It was released to SourceForge in 2004 as the wgs-assembler under the GNU General Public License. The pipeline revised for 454 data was named CABOG (Miller 2008).
Classification hiérarchique de l’analyse de la vraisemblance des liens relationnels en cas de données hétérogènes. Le programme CHAVLH propose une chaîne complète de classification ascendante hiérarchique permettant de classifier aussi bien les individus que les variables descriptives.
Circos is a software package for visualizing data and information. It visualizes data in a circular layout — this makes Circos ideal for exploring relationships between objects or positions.
clasp is an answer set solver for (extended) normal logic programs. It combines the high-level modeling capacities of answer set programming (ASP) with state-of-the-art techniques from the area of Boolean constraint solving. The primary clasp algorithm relies on conflict-driven nogood learning, a technique that proved very successful for satisfiability checking (SAT).

Unlike other learning ASP solvers, clasp does not rely on legacy software, such as a SAT solver or any other existing ASP solver. Rather, clasp has been genuinely developed for answer set solving based on conflict-driven nogood learning. clasp can be applied as an ASP solver (on SMODELS format, as output by Gringo), as a SAT solver (on a simplified version of DIMACS/CNF format), or as a PB solver (on OPB format).
Common Lisp is a high-level, general-purpose, object-oriented, dynamic, functional programming language.
Clustal Omega is the latest addition to the Clustal family. It offers a significant increase in scalability over previous versions, allowing hundreds of thousands of sequences to be aligned in only a few hours. It will also make use of multiple processors, where present. In addition, the quality of alignments is superior to previous versions, as measured by a range of popular benchmarks.
ClustalW2 is a general purpose multiple sequence alignment program for DNA or proteins.
Software contains clustering routines that can be used to analyze gene expression data. Routines for hierarchical (pairwise simple, complete, average, and centroid linkage) clustering, k-means and k-medians clustering, and 2D self-organizing maps are included.
compounds2sbml is a computer program whose purpose is to transform a list of metacyc identifiers into a SBML file containing only species. This is done by using a graph database containing a version of MetaCyc.

Each identifier in the tgdb is an integer, this is why a prefix is given

Usage : ./compounds2sbml <tgdb_file> <input_file_name> <output_file_name> <prefix>

Dependance : none, the tinyxml2 and tinygraphdb are integrated

Input :
- <tgdb_file> : A graph database in tgdb format (often produced by metacyc2tgdb)
- <input_file_name> : A text file that contains a list of metabolite identifiers from MetaCyc
- <output_file_name> : The name of the output file
- <prefix> : A string that will be added at the beginning of all identifiers (because SBML does not allow numbers in id field)

Output :
- An SBML file containing species only. each species MetaCyc ID has been mapped on the TGDB file. The new ID is created as follows : prefix (string) + metacyc_node_id (integer)
Crass is a program that searches through raw metagenomic reads for Clustered Regularly Interspersed Short Palindromic Repeats
cutadapt removes adapter sequences from high-throughput sequencing data. This is usually necessary when the read length of the sequencing machine is longer than the molecule that is sequenced, for example when sequencing microRNAs.
A bash pipeline for ddRAD sequencing
deFuse is a software package for gene fusion discovery using RNA-Seq data. The software uses clusters of discordant paired end alignments to inform a split read alignment analysis for finding fusion boundaries. The software also employs a number of heuristic filters in an attempt to reduce the number of false positives and produces a fully annotated output for each predicted fusion.
deepTools: tools for exploring deep sequencing data
DIALIGN is a software program for multiple sequence alignment. DIALIGN constructs pairwise and multiple alignments by comparing entire segments of the sequences. No gap penalty is used.
DIAMOND is a new alignment tool for aligning short DNA sequencing reads to a protein reference database such as NCBI-NR. On Illumina reads of length 100-150bp, in fast mode, DIAMOND is about 20,000 times faster than BLASTX, while reporting about 80-90% of all matches that BLASTX finds, with an e-value of at most 1e-5. In sensitive mode, DIAMOND ist about 2,500 times faster than BLASTX, finding more than 94% of all matches.
Software discoSnp is designed for discovering Single Nucleotide Polymorphism (SNP) from raw set(s) of reads obtained with Next Generation Sequencers (NGS).
Note that number of input read sets is not constrained, it can be one, two, or more. Note also that no other data as reference genome or annotations are needed.
The software is composed by two modules. First module, kissnp2, detects SNPs from read sets. A second module, kissreads, enhance the kissnp2 results by computing per read set and for each found SNP i/ its mean read coverage and ii/ the (phred) quality of reads generating the polymorphism.

Note that from release of DiscoSnp++-2.0.0, the tool also detects close SNPs and indels.

Note that from release of DiscoSnp++-2.3.0, a new script DiscoSnpRad.sh is available for RadSeq analysis.
SNP detection
Assemble genomes and find variants with DISCOVAR & DISCOVAR de novo
DRMAA is an interface standard for job submission and control to Distributed Resource Manager (DRM) systems such as Condor, PBS/OpenPBS/Torque, Platform LSF, Sun's N1 Grid Engine and others.
Command-line tools for processing biological sequencing data. Barcode demultiplexing, adapter trimming, etc. Primarily written to support an Illumina based pipeline - but should work with any FASTQs.

* fastq-mcf - Scans a sequence file for adapters, and, based on a log-scaled threshold, determines a set of clipping parameters and performs clipping. Also does skewing detection and quality filtering.

* fastq-multx - Demultiplexes a fastq. Capable of auto-determining barcode id's based on a master set fields. Keeps multiple reads in-sync during demultiplexing. Can verify that the reads are in-sync as well, and fail if they're not.
fastq-join - Similar to audy's stitch program, but in C, more efficient and supports some automatic benchmarking and tuning. It uses the same "squared distance for anchored alignment" as other tools.
* varcall - Takes a pileup and calculates variants in a more easily parameterized manner than some other tools.
ESPRIT is a standard implementation of complete-linkage based hierarchical clustering method. It can conformably process several tens of thousands sequences using a desktop computer. They have used the algorithm to process 1.1M human gut sequences using a small computer cluster consisting of 100 nodes
A generic tool for pairwise sequence comparison
FreeBayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment.
Global Alignment Short Sequence Search Tool
genBlast is a program suite, consisting of two programs: genBlastA and genBlastG. genBlastA parses local alignments, or high-scoring segment pairs (HSPs) produced by local sequence alignment programs such as BLAST and WU-BLAST and identify groups of HSPs. Each HSP group represent a putative gene that is homologous to the query protein (gene). genBlastG is a fast homology-based gene finder. It builds on genBlastA. In particular, it takes genBlastA output as input and define gene models baesd on the HSP group. The development of genBlast programs is supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants, a Michael Smith Foundation for Health Research (MSFHR) Establishment Grant, and Simon Fraser University Community Trust Endowment Fund (CTEF).
Software to design primers
go-perl is a collection of perl modules for working with ontologies and data, in particular the Gene Ontology and other Open Bio-Ontologies.
grappe is a pattern matching program. It looks, in a text, for a set of patterns containing don't care symbols (wildcards) of unbounded or bounded length. The size of patterns, as well as their length, is unlimited. Sequence motif recognition
hapflk is a software implementing the hapFLK [1] and FLK [2] tests for the detection of selection signatures based on multiple population genotyping data.
hapflk is a software implementing the hapFLK [1] and FLK [2] tests for the detection of selection signatures based on multiple population genotyping data.
Hapsembler is a haplotype-specific genome assembly toolkit that is designed for genomes that are rich in SNPs and other types of polymorphism. Hapsembler can be used to assemble reads from a variety of platforms including Illumina and Roche/454. Hapsembler is available free of charge for academic and commercial use under the GNU General Public License (GPL).
html4blast is a simple hypertext Blast output formatter. It provides
links to retrieve databanks entries from their accessions numbers,
names, via www servers; and if wanted, an alignment summary image.
Concert Technology allows the user to describe a model that is quite similar to the natural mathematical programming description.ILOG Concert Technology provides a set of lightweight C++, Java and .NET objects for representing optimization problems. It is included as part of ILOG CPLEX, ILOG CP Opti
ImageMagick® is a software suite to create, edit, compose, or convert bitmap images. It can read and write images in a variety of formats (over 100) including DPX, EXR, GIF, JPEG, JPEG-2000, PDF, PNG, Postscript, SVG, and TIFF. Use ImageMagick to resize, flip, mirror, rotate, distort, shear and transform images, adjust image colors, apply various special effects, or draw text, lines, polygons, ellipses and Bézier curves.
InterProScan is a tool that combines different protein signature recognition methods into one resource. Sequence motif recognition
Reporting server to collect SGE statistics Installed on arco vm. Uses iReport for report creation Data are uploaded with a local script made with Talend Studio
If you need to submit numerous job commands via qsub in the cluster, you can use the joborganizer tool.
JobOrganizer is a utility tool we developped to submit a list of commands as jobs, while limiting the number of simultaneous jobs.
This eases the way to submit numerous jobs without queuing everything at one.

It also provides the possibility to add dependencies between the jobs.

JobOrganizer is available at: /softs/local/utils/joborganizer

More info with:

man /softs/local/utils/man1/joborganizer.1
K2R reads a KAKSI XML output (.kxo) and outputs a lot of different informations in raw formats (mainly FASTA format), see k2r -h
"KAKSI takes a PDB file as input (or a PDB code if you have a mirror of the PDB on your computer) and prints an XML output with a lot of informations from the PDB file and the secondary structure assignment."
A fast and accurate multiple sequence alignment algorithm
The khmer software is a set of command-line tools for working with DNA shotgun sequencing data from genomes, transcriptomes, metagenomes, and single cells. khmer can make de novo assemblies faster, and sometimes better. khmer can also identify (and fix) problems with shotgun data.
A local transcriptome assembler for SNPs, indels and AS events
KlastRunner (release 4.3) for the command-line actually provides the core engine of ngKLAST: use KLAST to compare sequences, use the data integration engine to add usefull data to reported hits (such as Taxonomy, Gene Ontology, Enzyme, Interpro, etc.), use the data filtering engine to select relevant hits matching your requirements (% identity, % coverage, hit descriptions, etc.). In addition, KlastRunner comes with the ngKLAST's Databank Manager available as a standalone application for use either from the command-line or the graphical user interface.
KmerGenie estimates the best k-mer length for genome de novo assembly. Given a set of reads, KmerGenie first computes the k-mer abundance histogram for many values of k. Then, for each value of k, it predicts the number of distinct genomic k-mers in the dataset, and returns the k-mer length which maximizes this number. Experiments show that KmerGenie's choices lead to assemblies that are close to the best possible over all k-mer lengths.
LibSBML is a free, open-source programming library to help you read, write, manipulate, translate, and validate SBML files and data streams. It's not an application itself (though it does come with example programs), but rather a library you embed in your own applications.
library for handling page faults in user mode.
Expressive pattern modelling and search in DNA/protein sequence Sequence motif recognition
Structural variant and indel caller for mapped sequencing data
MapMaker is a set of applications that can be used in order to build genetic linkage maps and performing gene mapping.
checks for each fragments contained into the fragments.fasta file if 1) it is read coherent (each position is covered by at least one read) 2) if it is read coherent: micro de novo assemble as long as possible on its left and right, using coherent reads
Script perl permettant la récupération de données en provenance de BioMart.
mawk is an interpreter for the AWK Programming Language. The AWK lan-
guage is useful for manipulation of data files, text retrieval and pro-
cessing, and for prototyping and experimenting with algorithms. mawk
is a new awk meaning it implements the AWK language as defined in Aho,
Kernighan and Weinberger, The AWK Programming Language, Addison-Wesley
Publishing, 1988 (hereafter referred to as the AWK book.) mawk con-
forms to the Posix 1003.2 (draft 11.3) definition of the AWK language
which contains a few features not described in the AWK book, and mawk
provides a small number of extensions.
MEGAHIT is a single node assembler for large and complex metagenomics NGS reads, such as soil. It makes use of succinct de Bruijn graph (SdBG) to achieve low memory assembly. MEGAHIT can optionally utilize a CUDA-enabled GPU to accelerate its SdBG contstruction. The GPU-accelerated version of MEGAHIT has been tested on NVIDIA GTX680 (4G memory) and Tesla K40c (12G memory) with CUDA 5.5, 6.0 and 6.5. Sequence assembly (genome assembly)
memap (MEtabolic network MAPping) is a tool that map the identifiers of an SBML file onto a given namespace (in a TGDB file)

TGDB file is for tinygraphdb file, a graph db format (tabular).

memap first tries to find the ID fields of reaction SBML nodes into the TGDB file.
If the ID is found, memap checks if the ID corresponds to a reaction then generate the corresponding SBML node in the output file.
memerge (MEtabolic network MERGE) is a tool that merges two sbml files.

It uses the tinyxml2 library freely available here http://www.grinninglizard.com/tinyxml2/

The merging is based on the ID sbml field. Thus the merging of two SBML files with IDs from different namespaces will simply add everything from the two files.

With IDs from the same namespace, redondancy will be removed.

To map IDs from different namespaces onto the same namespace see memap.

Usage: ./memerge <sbml_file1> <sbml_file2> <output_file>

Dependance : tinyxml2 library

Input :
- <sbml_file1> : the first SBML filename
- <sbml_file2> : the second SBML file name
- <output_file> : The output file name

Output :
- An SBML file containing SBML reactions from sbml_file1 and sbml_file2
Membrane Helix Prediction
metacyc2tgdb is a tool to parse metacyc flat files as given by Pathway Tools and transform them into a tinygraphdb.

It uses tinygraphdb library to store everything.
RNA structure prediction program.
miRanda is an algorithm for finding genomic targets for microRNAs. This algorithm has been written in C and is available as an open-source method under the GPL. MiRanda was developed at the Computational Biology Center of Memorial Sloan-Kettering Cancer Center.
Portable MPI (Message Passing Interface, the Standard for message-passing libraries) Model Implementation
MPICH2 is an all-new implementation of MPI, designed to support research into high-performance implementations of MPI-1 and MPI-2 functionality.
Multiple alignment software for protein and nucleotide sequences. MUSCLE stands for multiple sequence comparison by log-expectation.
Naccess program calculates the atomic accessible surface defined by rolling a probe of given size around a van der Waals surface.
Nagios monitors hosts and services on your network. Actual host and service checks are performed by separate plugins which return the host or service status to Nagios.
coiled-coil prediction
NetBeans IDE is a modular, standards-based integrated development environment (IDE), written in the Java programming language.
Permet les assemblages de novo, ou avec référence. Permet les analyses d'amplicons. Sequence assembly (de-novo assembly)
Langage, de la famille des langages ML, est un projet open source dirig
PAL2NAL is a program that converts a multiple sequence alignment of proteins and the corresponding DNA (or mRNA) sequences into a codon alignment. The program automatically assigns the corresponding codon sequence even if the input DNA sequence has mismatches with the input protein sequence, or contains UTRs, polyA tails. It can also deal with frame shifts in the input alignment, which is suitable for the analysis of pseudogenes. The resulting codon alignment can further be subjected to the calculation of synonymous (dS) and non-synonymous (dN) substitution rates.

If the input is a pair of sequences, PAL2NAL automatically calculates dS and dN by the codeml program in PAML.
A sequence consensus algorithm implementation based on using directed acyclic graphs to encode multiple sequence alignment
Polyhedral Hybrid Automaton Verifyer -tool for verifying safety properies of hybrid systems
A hidden Markov Model capable of predicting both Transmembrane Topology and Signal peptides
Phrap fait partie de l'ensemble Phred-Phrap-Consed couramment utilisé pour l'assemblage des séquences. Phrap (Phragment Assembly Program) est un programme qui permet l'assemblage d'un génome à partir des résultats du phred. Il prend en entrée les fichiers «.phr» générés par le programme phred et gènère un fichier «.ace». On l'utilise en conjonction avec d'autres programmes tels que : * phd2fasta qui convertit les fichiers «.phr» au format fasta pris en charge par le programme phred.* cross_match et swat qui masquent les séquences appartenant aux vecteurs et permettent ainsi d'éviter les erreurs d'interprétation. Des versions spéciales "séquences nombreuses" (plus de 64000) ou des séquences longues (plus de 64000 nucléotides) a été également installée sur les machines de GenOuest. Elles s'appellent phrap.longreads, phrap.manyreads, cross_match.manyread.
PhyML [1] is a software package which primary task that is to estimate maximum likelihood phylogenies from alignments of nucleotide or amino acid sequences. It provides a wide range of options that were designed to facilitate standard phyloge- netic analyses
Fast and accurate identification of CRISPR repeats

Edgar, R.C. (2007) PILER-CR: fast and accurate identification of CRISPR repeats, BMC Bioinformatics, Jan 20;8:18.
alignements multiples de séquences de protéines
Ptolemy II is a set of Java packages supporting heterogeneous, concurrent modeling and design. Its kernel package supports clustered hierarchical graphs, which are collections of entities and relations between those entities
pwt2assoc generates a file that contains the gene-reaction associations and the evidence used by pathway tools to create the associations.

example: Esi0165_0036 GO-TERM

reaction (MetaCyc ID) has been associated to gene Esi0165_0036 because they share identical Gene Ontology terms (GO-TERM)

Usage: ./pwt2assoc <reactions_file> <enzrsxns_file> <genes_file> <output_file>

The metacyc_dir must contains the following files (others are not used):
QuickTree allows the reconstruction of phylogenies for very large protein families that would be infeasible using other popular methods.
This is a program for estimating absolute rates ("r8s") of molecular evolution and divergence times on a phylogenetic tree.
re2c is a tool for writing very fast and very flexible scanners. re2c focuses on generating high efficient code for regular expression matching.
reactions2sbml is a computer program whose purpose is to transform a list of reaction Metools identifiers into a SBML file containing the given reaction and their species. This is done by using a graph database containing a version of MetaCyc.

Each identifier in the tgdb is an integer, this is why a prefix is given.

Be careful to use always the same prefix in the workflow.

Usage : ./reactions2sbml <tgdb_file> <reaction_list_file> <output_file_name> <prefix>

Dependance : tinyxml2 library

Input :
- <tgdb_file> : A graph database in tgdb format (often produced by metacyc2tgdb)
- <reaction_list_file> : A text file that contains a list of reaction identifiers from Metools
- <output_file_name> : The name of the output file
- <prefix> : A string that will be added at the beginning of all identifiers (because SBML does not allow numbers in id field)

Output :
- An SBML file containing reactions and species involved in those reactions. Each reaction Metools ID has been mapped on the TGDB file. The new ID is created as follows : prefix (string) + metacyc_node_id (integer)
Read & reformat biosequences, Java command-line version
A tool that evaluates the accuracy of a genome assembly using mapped paired end reads.

REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.

REAPR was published in Genome Biology: REAPR: a universal tool for genome assembly evaluation. Genome Biology 2013, 14:R47, doi:10.1186/gb-2013-14-5-r47.
Evaluation and validation
RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data. Some basic modules quickly inspect sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.
RXP is a validating XML parser written in C. It is used by the LT XML toolkit, and the Festival speech synthesis system.
Usage: ./sbml2assoc <tinygraphdb> <sbml_input> <output_fname>
SeqMonk is a program to enable the visualisation and analysis of mapped sequence data
The program 'sff2fastq' extracts read information from a SFF file,
produced by the 454 genome sequencer, and outputs the sequences and
quality scores in a FASTQ format.
Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads. It will also discard reads based upon the length threshold. It takes the quality values and slides a window across them whose length is 0.1 times the length of the read. If this length is less than 1, then the window is set to be equal to the length of the read. Otherwise, the window slides along the quality values until the average quality in the window rises above the threshold, at which point the algorithm determines where within the window the rise occurs and cuts the read and quality there for the 5'-end cut. Then when the average quality in the window drops below the threshold, the algorithm determines where in the window the drop occurs and cuts both the read and quality strings there for the 3'-end cut. However, if the length of the remaining sequence is less than the minimum length threshold, then the read is discarded entirely. 5'-end trimming can be disabled.
Indexing web application and framwork
RNA-seq aligner

Spliced Transcripts Alignment to a Reference
Sequence alignment construction
StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.To identify differentially expressed genes between experiments, StringTie's output can be processed either by the Cuffdiff or Ballgown programs.
Search for tRNA genes in genomic sequence
Tedna is a lightweight de novo transposable element assembler. It assembles the transposable elements directly from the raw reads.
tgdb2sbml produces an sbml file from a tgdb file given a prefix id.

Because elements of a tgdb file are indexed by an integer, we add the prefix to the given integer to create a unique identifier valid for SBML.

This program uses the tinygrahdb library (integrated with the sources)

Usage: ./tgdb2sbml <tgdb_file> <sbml_file> <id_prefix>
Hardware performance monitoring counters have recently received a lot of attention. They have been used by diverse communities to understand and improve the quality of computing systems: for example, architects use them to extract application characteristics and propose new hardware mechanisms; compiler writers study how generated code behaves on particular hardware; software developers identify critical regions of their applications and evaluate design choices to select the best performing implementation. We propose that counters be used by all categories of users, in particular non-experts, and we advocate that a few simple metrics derived from these counters are relevant and useful. For example, a low IPC (number of executed instructions per cycle) indicates that the hardware is not performing at its best; a high cache miss ratio can suggest several causes, such as conflicts between processes in a multicore environment.
Ce programme détecte et affiche des séquences répétées en tandem.
trimAl is a tool for the automated removal of spurious sequences or poorly aligned regions from a multiple sequence alignment.
A flexible read trimming tool for Illumina NGS data
Tuiuiu removes from a sequence or from a set of sequences areas as large as possible that do not contain researched repeats.Tuiuiu is used as a preliminary step before applying a multiple local aligner tool.
a useful tool for shuffling biological sequences while preserving the k-let counts
Valkyrie is an open-source graphical user interface for the Valgrind 3.3.X line.
A simple C++ library for parsing and manipulating VCF files, + many command-line utilities
Velvet is a de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454. Sequence assembly (de-novo assembly)
XPLOR-NIH is a structure determination program (efficient molecular dynamics and minimizations using internal coordinates, such as torsion angles, etc)