The BONSAI team has successfully proposed methods in high-throughput sequence indexing (seeding techniques and design) [13,14] and in genomic rearrangements [15,16] (see also our web server bioinfo.lifl.fr). We are looking forward a motivated applicant who will design and implement an original algorithm, that will be tested on both simulated and real datasets. The applicant will benefit from the experience of the team and its collaborations. Researchers of the BONSAI team design bioinformatic algorithms with a special interest for high- throughput sequencing data processing and genome structure.
Despite the advent of a number of new methods during the last few years, the assembly of full-length eukaryotic genomes remains a challenging problem. This problem is known to be difficult because of its intrinsic computational complexity and of the nature of genomic sequences, that may contain a high fraction of repetitive elements, low complexity regions, rearrangements, large insertions or deletions, etc. The advent of high-throughput sequencing technologies made the challenge even harder. The sequence reads are much shorter and much more numerous. For example, a vertebrate genome puzzle contains several billion pieces with multiple overlapping copies of each piece. Yet de novo assembly is still mainly unsolved for large eukaryotic genomes. Generally, two approaches are considered. The first one is used mainly in resequencing projects. It is based on read mapping and assumes the availability of a reference sequence coming from the same species against which the reads can be aligned. It thus allows identification of small local variations (substitutions or indels). The second one, called de novo sequence assembly, does not make use of any prior assembly. Both can take advantage of paired-end reads to detect structural variations such as large insertions or rearrangements. Some works also explore a new path for the assembly problem, inspired both from de novo sequencing problem [1,2] and methods that have been developed for resequencing [3,4,5]. The key idea is that even when a reference sequence is not available, there are now representative genome sequences of most of the major phylogenetic clades. So a set of closely related sequences can be used to guide the assembly process. In  some of these approaches have been explored for Sanger conventional sequencing. More recently, in , a non automated process has been used on four genomes at the intraspecific level. In [8, 9], the authors took into account both assembly and mapping but considering only one reference genome. In [10, 11], this approach has been used for scaffolding. In  the knowledge of predicted breakpoints from a set of variants has been used to improve assembly. We thus propose the exciting project of designing a tool which aims to remove these barriers, using a higher sensitive read mapping process with a multiple reference set, combined with an assembly approach taking into account paired-end information and structural variation through a guided phylogenetic approach.
Duration : 16 months Salary: 2 621 gross/month Monthly salary after taxes : around 2 138