Sequential modeling, generative recurrent neural networks, and their applications to audio

Mehri, Soroush

Show metadata

Permalink

https://hdl.handle.net/1866/18762

Thesis or Dissertation

Mehri_Soroush_2016_memoire.pdf (2.183Mb)

2016-12 (degree granted: 2017-03-28)

Author(s)

Mehri, Soroush

Advisor(s)

Bengio, Yoshua

Courville, Aaron

Level

Master's

Discipline

Informatique

Keywords

Abstract(s)

L'apprentissage profond s'est imposé comme étant le cadre de concrétisation d'une intelligence artificielle spécialisée; le chemin rêvé de beaucoup vers un futur où l'IA est omniprésente ou ce qu'on appellerait une intelligence artificielle générale. Durant ce projet, notre motivation a été l'envie de dompter cette puissante approche d'apprentissage afin de réaliser une avancée considérable vers la création d'une ``Machine Parlante''. Cette thèse décrit un modèle statistique paramétrique pour la génération inconditionnelle et de bout en bout de séquences audio dont la parole, des onomatopées et de la musique. Contrairement aux travaux réalisés dans ce sens dans le domaine du traitement du signal, les modèles qu'on propose se basent uniquement sur les échantillons audio bruts sans aucune manipulation ou extraction préalable de caractéristiques. La dimension générale de notre approche lui permet d'être appliquée à tout autre domaine - à savoir le traitement naturel du langage - dont les données requièrent une représentation séquentielle des données. Les chapitres 1 et 2 sont consacrés aux principes de bases de l'apprentissage automatique et de l'apprentissage profond. Les chapitres suivants détaillent l'approche adoptée afin d'atteindre notre but.

By far Deep Learning showed to be the most promising venue of achieving applied Artificial Intelligence which has been the dream of many as the path toward AI-powered future and eventually the Artificial General Intelligence. In this work we are interested in harnessing this powerful method to make bigger strides in the direction of creating a ``Talking Machine''. This thesis is dedicated to presenting a parametric statistical model for generating unconditional audio sequences including speech, onomatopoeia, and music in an end-to-end manner. Proposed model does not benefit from any handcrafted features that are developed over the course of many years in the field of signal processing rather operates on raw sample audio. As a general framework it can also potentially be applied in other domains that require modeling sequential data; e.g. Natural Language Processing. Chapter 1 and 2 give a brief overview of the background topics including machine learning and basic building blocks of deep learning algorithms. Following chapters of this thesis present our endeavor toward the aforementioned goal.

Collections

This document disseminated on Papyrus is the exclusive property of the copyright holders and is protected by the Copyright Act (R.S.C. 1985, c. C-42). It may be used for fair dealing and non-commercial purposes, for private study or research, criticism and review as provided by law. For any other use, written authorization from the copyright holders is required.