ChatTTSを使って音声合成をDockerで試す

hugging faceのトップトレンドにあるChatTTSを試してみたいと思います。

どうやら、会話に特化したText2Speechのモデルのようです。また、単なる文字から音声への変換ではなく、笑い、休止、感嘆詞などの制御ができるようです。

huggingfaceのページはこちら

2Noise/ChatTTS · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

githubはこちら

GitHub - 2noise/ChatTTS: ChatTTS is a generative speech model for daily dialogue.

ChatTTS is a generative speech model for daily dialogue. - 2noise/ChatTTS

説明を読むと、英語と中国語のみの対応みたいですね。

実行環境

dev-containerで試しました。以下、Dockerfileとdevcontainer.json、docker-compose.ymlです。

Dockerfile

前に作ったhugging faceの実行環境をベースに構築しました。

FROM nvcr.io/nvidia/pytorch:22.04-py3

ARG USERNAME=vscode
ARG USER_UID=1000
ARG USER_GID=$USER_UID

ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL ja_JP.UTF-8
ENV TZ JST-9
ENV TERM xterm

RUN apt-get update \
    && groupadd --gid $USER_GID $USERNAME \
    && useradd -s /bin/bash --uid $USER_UID --gid $USER_GID -m $USERNAME \
    && apt-get install -y sudo \
    && echo $USERNAME ALL=\(root\) NOPASSWD:ALL > /etc/sudoers.d/$USERNAME \
    && chmod 0440 /etc/sudoers.d/$USERNAME \
    && apt-get -y install locales \
    && localedef -f UTF-8 -i ja_JP ja_JP.UTF-8

RUN apt-get -y install git

RUN pip install --upgrade pip
RUN pip install --upgrade setuptools

# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
RUN python -m pip install \
    torchvision \
    torchaudio \
    invisible_watermark
RUN python -m pip install \
    accelerate \
    datasets \
    hf-doc-builder \
    huggingface-hub \
    Jinja2 \
    librosa \
    numpy \
    scipy \
    tensorboard \
    transformers \
    pytorch-lightning

RUN python -m pip install \
    omegaconf~=2.3.0 \
    tqdm \
    einops \
    vector_quantize_pytorch \
    vocos \
    IPython \
    nemo_text_processing \
    gradio

ENV HF_HOME /work/.cache/huggingface
ENV TORCH_HOME /work/.cache/torchvision

chatTTSのrequirements.txtにかかれているライブラリ以外にも、nemo_text_processingとかgradioが必要でした。

devcontainer.json

名前とサービス名を決定してください。また、必要な拡張機能を入れてください。

{
    "name": "chat_tts_test",
    "service": "chat_tts_test",
    "dockerComposeFile": "docker-compose.yml",
    "remoteUser": "vscode",
    "workspaceFolder": "/work",
    "customizations": {
      "vscode": {
        "extensions": [
          "ms-python.python",
          "ms-toolsai.jupyter"
        ]
      }
    }
}

docker-compose.yml

設定したサービス名を間違えないでください。

また、GPUを使うので、GPUが使えるようにdeployを書き足します。

version: '3'
services:
  chat_tts_test:
    container_name: 'chat_tts_test-container'
    hostname: 'chat_tts_test-container'
    build: .
    restart: always
    working_dir: '/work' 
    tty: true
    volumes:
      - type: bind
        source: ..
        target: /work
    ulimits:
      memlock: -1
      stack: -1
    shm_size: '10gb'
    deploy:
      resources:
          reservations:
              devices:
                - capabilities: [gpu]

レポジトリのクローン

コンテナをつくったら、ChatTTSのレポジトリをクローンします。

コンテナ内で、git cloneすればOKです。

git clone https://github.com/2noise/ChatTTS.git

その後、クローンしたディレクトリに移動して、プログラムを書きます。

cd ChatTTS

サンプルコードを動かしてみる

hugging faceに掲載してあるサンプルコードを動かしてみます。

# Import necessary libraries and configure settings
import torch
import torchaudio
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

import ChatTTS
from IPython.display import Audio

# Initialize and load the model: 
chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance

# Define the text input for inference (Support Batching)
texts = [
    "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
    ]

# Perform inference and play the generated audio
wavs = chat.infer(texts)
Audio(wavs[0], rate=24_000, autoplay=True)

# Save the generated audio 
torchaudio.save("output.wav", torch.from_numpy(wavs[0]), 24000)

実行すると、初回実行時にはモデルのダウンロードが始まります。

その後、output.wavが出力されています。

キレイな音声が出力されました。音の途切れなどは確認できません。ただ、英語音声なので、筆者には発音の正しさとかの確認はできません。

センテンスレベルでの制御

githubにあるセンテンスレベルの制御を試してみます。

import torch
import torchaudio
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

import ChatTTS

chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance

texts = [
    "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",
]

rand_spk = chat.sample_random_speaker()

params_infer_code = {
  'spk_emb': rand_spk, # add sampled speaker 
  'temperature': .3, # using custom temperature
  'top_P': 0.7, # top P decode
  'top_K': 20, # top K decode
}
params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_6]'
} 
wavs = chat.infer(texts, params_refine_text=params_refine_text, params_infer_code=params_infer_code)
# Perform inference and play the generated audio

# Save the generated audio 
torchaudio.save("output2.wav", torch.from_numpy(wavs[0]), 24000)

実行すると、output2.wavが出力されます。

確かに、先ほどと違う口調の文章が生成されました。

ワードレベルでの制御

ワードの前方や後方に制御文を付け加えられるようです。

import torch
import torchaudio
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')

import ChatTTS

chat = ChatTTS.Chat()
chat.load_models(compile=False) # Set to True for better performance

rand_spk = chat.sample_random_speaker()

inputs_en = """
chat T T S is a text to speech model designed for dialogue applications. 
[uv_break]it supports mixed language input [uv_break]and offers multi speaker 
capabilities with precise control over prosodic elements [laugh]like like 
[uv_break]laughter[laugh], [uv_break]pauses, [uv_break]and intonation. 
[uv_break]it delivers natural and expressive speech,[uv_break]so please
[uv_break] use the project responsibly at your own risk.[uv_break]
""".replace('\n', '') # English is still experimental.

params_refine_text = {
  'prompt': '[oral_2][laugh_0][break_4]'
} 
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]'
wavs = chat.infer(inputs_en, params_refine_text=params_refine_text)
# Save the generated audio 
torchaudio.save("output3.wav", torch.from_numpy(wavs[0]), 24000)

実行すると、output3.wavが出力されます。

確かにブレスを入れた制御ができていますね。音声の崩れも内容に感じます。英語で使うなら、問題ない精度なのではないでしょうか

サーバーモードを試してみる(失敗)

レポジトリ内にwebui.pyがあったので、Usageには書いてありませんでしたが、使ってみます。

python webui.py

正常起動すると、ブラウザが立ち上がります。

Generateを押して少し待ちましたが、エラーがでました。

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Found NVIDIA GeForce GTX 1080 Ti which is too old to be supported by the triton GPU compiler, which is used as the backend. Triton only supports devices of CUDA Capability >= 7.0, but your device is of CUDA capability 6.1

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

「お前のGPU古すぎ」って言われました。そんな事言われても、代替のGPUは無いので、諦めます。