banner
andrewji8

Being towards death

Heed not to the tree-rustling and leaf-lashing rain, Why not stroll along, whistle and sing under its rein. Lighter and better suited than horses are straw sandals and a bamboo staff, Who's afraid? A palm-leaf plaited cape provides enough to misty weather in life sustain. A thorny spring breeze sobers up the spirit, I feel a slight chill, The setting sun over the mountain offers greetings still. Looking back over the bleak passage survived, The return in time Shall not be affected by windswept rain or shine.
telegram
twitter
github

pyannote.audioを使用して音声の分離と話者識別を行います。

pip install pyannote.audio

場面:

複数の話者が含まれるオーディオから、異なる話者の発言を分離します。
いくつかの人物の音声特徴が既知であり、分離されたセグメントとそれぞれの特徴の余弦距離を計算し、最小の余弦距離を持つものを話者として選択します。

 _*_ coding: utf-8 _*_
# @Time : 2024/3/16 10:47
# @Author : Michael
# @File : spearker_rec.py
# @desc :
import torch
from pyannote.audio import Model, Pipeline, Inference
from pyannote.core import Segment
from scipy.spatial.distance import cosine


def extract_speaker_embedding(pipeline, audio_file, speaker_label):
    diarization = pipeline(audio_file)
    speaker_embedding = None
    for turn, _, label in diarization.itertracks(yield_label=True):
        if label == speaker_label:
            segment = Segment(turn.start, turn.end)
            speaker_embedding = inference.crop(audio_file, segment)
            break
    return speaker_embedding

# オーディオから抽出した声紋特徴と声紋データベースの声紋を比較します
def recognize_speaker(pipeline, audio_file):
    diarization = pipeline(audio_file)
    speaker_turns = []
    for turn, _, speaker_label in diarization.itertracks(yield_label=True):
        # セグメントの声紋特徴を抽出します
        embedding = inference.crop(audio_file, turn)  
        distances = {}
        for speaker, embeddings in speaker_embeddings.items():  
         # 既知の話者の声紋特徴との余弦距離を計算します
            distances[speaker] = min([cosine(embedding, e) for e in embeddings])
        # 最小の距離を持つ話者を選択します
        recognized_speaker = min(distances, key=distances.get)  
        speaker_turns.append((turn, recognized_speaker))  
        # 話者の時間範囲と最小の余弦距離を持つ予測話者を記録します
    return speaker_turns

if __name__ == "__main__":
    token = "hf_***"  # ご自身のHugging Face Tokenに置き換えてください

    # スピーカーダイアリゼーションモデルをロードします
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=token,  # プロジェクトページで利用規約に同意し、Hugging Face Tokenを取得してください
        # cache_dir="/home/huggingface/hub/models--pyannote--speaker-diarization-3.1/"
    )

    # 声紋埋め込みモデルをロードします
    embed_model = Model.from_pretrained("pyannote/embedding", use_auth_token=token)
    inference = Inference(embed_model, window="whole")

    # pipeline.to(torch.device("cuda"))

    # 異なる話者の音声ファイルと対応する人物があると仮定します
    audio_files = {
        "mick": "mick.wav",  # mickの音声
        "moon": "moon.wav",  # moonの音声
    }
    speaker_embeddings = {}
    for speaker, audio_file in audio_files.items():
        diarization = pipeline(audio_file)
        for turn, _, speaker_label in diarization.itertracks(yield_label=True):
            embedding = extract_speaker_embedding(pipeline, audio_file, speaker_label)
            # 既知の話者の声紋特徴を取得します
            speaker_embeddings.setdefault(speaker, []).append(embedding)

    # 未知の人物の音声ファイルを与えます
    given_audio_file = "2_voice.wav"  # 前半はmickの発言、後半はmoonの発言です

    # 音声ファイル内の話者を識別します
    recognized_speakers = recognize_speaker(pipeline, given_audio_file)
    print("Recognized speakers in the given audio:")
    for turn, speaker in recognized_speakers:
        print(f"Speaker {speaker} spoke between {turn.start:.2f}s and {turn.end:.2f}s")

出力:

Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.8.1+cu102, yours is 2.2.1+cpu. Bad things might happen unless you revert torch to 1.x.

Recognized speakers in the given audio:
Speaker mick spoke between 0.57s and 1.67s
Speaker moon spoke between 2.47s and 2.81s
Speaker moon spoke between 3.08s and 4.47s
読み込み中...
文章は、創作者によって署名され、ブロックチェーンに安全に保存されています。