Audio samples accompanying the paper MelNet: A Generative Model for Audio in the Frequency Domain.
Samples generated by MelNet trained on the task of unconditional single-speaker speech generation using professionally recorded audiobook data from the Blizzard 2013 dataset.
Samples from the model without biasing or priming.
Samples from the model using a bias of 1.0.
The first 5 seconds of each audio clip are from the dataset and the remaining 5 seconds are generated by the model.
Samples generated by MelNet trained on the task of unconditional multi-speaker speech generation using noisy, multispeaker, multilingual speech data from the VoxCeleb2 dataset.
Samples from the model without biasing or priming.
Samples generated by MelNet trained on the task of unconditional music generation using recorded piano performances from the MAESTRO dataset.
Samples from the model without biasing or priming.
The first 5 seconds of each audio clip are from the dataset and the remaining 5 seconds are generated by the model.
Samples generated by MelNet trained on the task of single-speaker TTS using professionally recorded audiobook data from the Blizzard 2013 dataset.
The first audio clip for each text is taken from the dataset and the remaining 3 are samples generated by the model.
“My dear Fanny, you feel these things a great deal too much. I am most happy that you like the chain,”
Looking with a half fantastic curiosity to see whether the tender grass of early spring,
“I like them round,” said Mary. “And they are exactly the color of the sky over the moor.”
Lydia was Lydia still; untamed, unabashed, wild, noisy, and fearless.
“Oh, he has been away from New York—he has been all round the world. He doesn't know many people here, but he's very sociable, and he wants to know every one.”
Each unlabelled audio clip is taken from the dataset and the audio clip that directly follows is a sample generated by the model primed with that sequence.
Write a fond note to the friend you cherish.
Pluck the bright rose without leaves.
Two plus seven is less than ten.
He said the same phrase thirty times.
We frown when events take a bad turn.
Samples generated by MelNet trained on the task of multi-speaker TTS using noisy speech recognition data from the TED-LIUM 3 dataset.
Samples generated by the model conditioned on text and speaker ID. The conditioning text and speaker IDs are taken directly from the validation set (text in the dataset is unnormalized and unpunctuated).
it wasn't like i was asking for the code to a nuclear bunker or anything like that but the amount of resistance i got from this
and what that form is modeling and shaping is not cement
that every person here every decision that you've made today every decision you've made in your life you've not really made that decision but in fact
syria was largely a place of tolerance historically accustomed
and no matter what the rest of the world tells them they should be
the years went by and the princess grew up into a beautiful young woman
i spent so much time learning this language why do i only
and we were down to eating one meal a day running from place to place but wherever we could help we did at a certain point in time in
phrases and words even if you have a phd of chinese language you can't understand them
and when they came back and told us about it we really started thinking about the ways in which we see styrofoam every day
is only a very recent religious enthusiasm it surfaced only in the west
chances are that they are rooted in the productivity crisis
i cannot face your fears or chase your dreams and you can't do that for me but we can be supportive of eachother
the first law of travel and therefore of life you're only as strong
Samples generated by the model for selected speakers. Reference audio for each of the speakers can be found on the TED website.
A cramp is no small danger on a swim.
He said the same phrase thirty times.
Pluck the bright rose without leaves.
Two plus seven is less than ten.
The glow deepened in the eyes of the sweet girl.
Bring your problems to the wise chief.
Write a fond note to the friend you cherish.
Clothes and lodging are free to new men.
We frown when events take a bad turn.
Port is a strong wine with a smoky taste.
For comparison, we train WaveNet on the same three unconditional audio generation tasks used to evaluate MelNet (single-speaker speech generation, multi-speaker speech generation, and music generation).
Samples without biasing or priming.
Samples with priming: 5 seconds from the dataset followed by 5 seconds generated by WaveNet.
Samples without biasing or priming.
Samples without biasing or priming.
Samples from a two-stage model which separately models MIDI notes and then uses WaveNet to synthesize audio conditioned on the generated MIDI notes.
The following models were trained on the same data, with each model using a different number of tiers.