522 lines
14 KiB
Plaintext
522 lines
14 KiB
Plaintext
服务 用途 示例格式
|
||
ASR 服务器 语音识别(WebSocket) wss://api.example.com/asr
|
||
LLM 服务器 AI 对话(HTTP SSE) https://api.example.com/chat
|
||
TTS 服务器 语音合成 https://api.example.com/tts
|
||
|
||
iOS(Objective-C,iOS 15+)端技术实现文档
|
||
低延迟流式语音陪伴聊天(按住说话,类似猫箱首页)
|
||
0. 范围与目标
|
||
|
||
实现首页语音陪伴对话:
|
||
|
||
按住说话:开始录音并实时流式发送到 ASR
|
||
|
||
松开结束:ASR 立即 finalize,返回最终文本并显示
|
||
|
||
AI 回复:边显示文字(打字机效果)边播放服务端 TTS 音频
|
||
|
||
延迟低优先:不等待完整回答/完整音频,采用“分句触发 + 流式/准流式播放”
|
||
|
||
打断(Barge-in):AI 正在播报时用户再次按住 → 立即停止播报/取消请求,进入新一轮录音
|
||
|
||
iOS 最低版本:iOS 15
|
||
|
||
1. 总体架构(客户端模块)
|
||
KBAiMainVC
|
||
└─ ConversationOrchestrator (核心状态机 / 串联模块 / 取消与打断)
|
||
├─ AudioSessionManager (AVAudioSession 配置与中断处理)
|
||
├─ AudioCaptureManager (AVAudioEngine input tap -> 20ms PCM frames)
|
||
├─ ASRStreamClient (NSURLSessionWebSocketTask 流式识别)
|
||
├─ LLMStreamClient (SSE/WS token stream)
|
||
├─ Segmenter (句子切分:够一句就触发 TTS)
|
||
├─ TTSServiceClient (请求 TTS,适配多种返回形态)
|
||
├─ TTSPlaybackPipeline (可插拔:URL播放器 / AAC解码 / PCM直喂)
|
||
├─ AudioStreamPlayer (AVAudioEngine + AVAudioPlayerNode 播 PCM)
|
||
└─ SubtitleSync (按播放进度映射文字进度)
|
||
|
||
2. 音频会话(AVAudioSession)与权限
|
||
2.1 麦克风权限
|
||
|
||
仅在用户第一次按住说话前请求
|
||
|
||
若用户拒绝:提示到设置开启
|
||
|
||
2.2 AudioSession 配置(对话模式)
|
||
|
||
Objective-C(建议参数):
|
||
|
||
category:AVAudioSessionCategoryPlayAndRecord
|
||
|
||
mode:AVAudioSessionModeVoiceChat
|
||
|
||
options:
|
||
|
||
AVAudioSessionCategoryOptionDefaultToSpeaker
|
||
|
||
AVAudioSessionCategoryOptionAllowBluetooth
|
||
|
||
(可选)AVAudioSessionCategoryOptionMixWithOthers:若你希望不打断宿主音频(看产品)
|
||
|
||
2.3 中断与路由变化处理(必须)
|
||
|
||
监听:
|
||
|
||
AVAudioSessionInterruptionNotification
|
||
|
||
AVAudioSessionRouteChangeNotification
|
||
|
||
处理原则:
|
||
|
||
来电/中断开始:停止采集 + 停止播放 + cancel 网络会话
|
||
|
||
中断结束:回到 Idle,等待用户重新按住
|
||
|
||
3. 音频采集(按住期间流式上传)
|
||
3.1 固定音频参数(锁死,便于端到端稳定)
|
||
|
||
Sample Rate:16000 Hz
|
||
|
||
Channels:1
|
||
|
||
Format:PCM Int16(pcm_s16le)
|
||
|
||
Frame Duration:20ms
|
||
|
||
16kHz * 0.02s = 320 samples
|
||
|
||
每帧 bytes = 320 * 2 = 640 bytes
|
||
|
||
3.2 AudioCaptureManager(AVAudioEngine 输入 tap)
|
||
|
||
使用:
|
||
|
||
AVAudioEngine
|
||
|
||
inputNode installTapOnBus:bufferSize:format:block:
|
||
|
||
关键点:
|
||
|
||
tap 回调线程不可做重活:只做拷贝 + dispatch 到 audioQueue
|
||
|
||
将 AVAudioPCMBuffer 转成 Int16 PCM NSData
|
||
|
||
确保稳定输出“20ms帧”,如果 tap 回调 buffer 不刚好是 20ms,需要做 帧拼接/切片(ring buffer)
|
||
|
||
3.3 接口定义(OC)
|
||
@protocol AudioCaptureManagerDelegate <NSObject>
|
||
- (void)audioCaptureManagerDidOutputPCMFrame:(NSData *)pcmFrame; // 20ms/640B
|
||
- (void)audioCaptureManagerDidUpdateRMS:(float)rms; // 可选:UI波形
|
||
@end
|
||
|
||
@interface AudioCaptureManager : NSObject
|
||
@property (nonatomic, weak) id<AudioCaptureManagerDelegate> delegate;
|
||
- (BOOL)startCapture:(NSError **)error;
|
||
- (void)stopCapture;
|
||
@end
|
||
|
||
4. ASR 流式识别(iOS15:NSURLSessionWebSocketTask)
|
||
4.1 建议协议(控制帧 JSON + 音频帧二进制)
|
||
|
||
Start(文本帧)
|
||
|
||
{
|
||
"type":"start",
|
||
"sessionId":"uuid",
|
||
"format":"pcm_s16le",
|
||
"sampleRate":16000,
|
||
"channels":1
|
||
}
|
||
|
||
|
||
Audio(二进制帧)
|
||
|
||
直接发送 640B/帧 PCM
|
||
|
||
频率:50fps(每秒 50 帧)
|
||
|
||
Finalize(文本帧)
|
||
|
||
{ "type":"finalize", "sessionId":"uuid" }
|
||
|
||
4.2 下行事件
|
||
{ "type":"partial", "text":"今天" }
|
||
{ "type":"final", "text":"今天天气怎么样" }
|
||
{ "type":"error", "code":123, "message":"..." }
|
||
|
||
4.3 ASRStreamClient 接口(OC)
|
||
@protocol ASRStreamClientDelegate <NSObject>
|
||
- (void)asrClientDidReceivePartialText:(NSString *)text;
|
||
- (void)asrClientDidReceiveFinalText:(NSString *)text;
|
||
- (void)asrClientDidFail:(NSError *)error;
|
||
@end
|
||
|
||
@interface ASRStreamClient : NSObject
|
||
@property (nonatomic, weak) id<ASRStreamClientDelegate> delegate;
|
||
- (void)startWithSessionId:(NSString *)sessionId;
|
||
- (void)sendAudioPCMFrame:(NSData *)pcmFrame; // 20ms frame
|
||
- (void)finalize;
|
||
- (void)cancel;
|
||
@end
|
||
|
||
5. LLM 流式生成(token stream)
|
||
5.1 目标
|
||
|
||
低延迟:不要等整段回答
|
||
|
||
使用 SSE 或 WS 收 token
|
||
|
||
token 进入 Segmenter,够一句就触发 TTS
|
||
|
||
5.2 LLMStreamClient 接口(OC)
|
||
@protocol LLMStreamClientDelegate <NSObject>
|
||
- (void)llmClientDidReceiveToken:(NSString *)token;
|
||
- (void)llmClientDidComplete;
|
||
- (void)llmClientDidFail:(NSError *)error;
|
||
@end
|
||
|
||
@interface LLMStreamClient : NSObject
|
||
@property (nonatomic, weak) id<LLMStreamClientDelegate> delegate;
|
||
- (void)sendUserText:(NSString *)text conversationId:(NSString *)cid;
|
||
- (void)cancel;
|
||
@end
|
||
|
||
6. Segmenter(句子切分:先播第一句)
|
||
6.1 切分规则(推荐)
|
||
|
||
任一满足则切分成 segment:
|
||
|
||
遇到 。!?\n 之一
|
||
|
||
或累积字符数 ≥ 30(可配置)
|
||
|
||
6.2 Segmenter 接口(OC)
|
||
@interface Segmenter : NSObject
|
||
- (void)appendToken:(NSString *)token;
|
||
- (NSArray<NSString *> *)popReadySegments; // 返回立即可TTS的片段数组
|
||
- (void)reset;
|
||
@end
|
||
|
||
7. TTS:返回形态未定 → 客户端做“可插拔播放管线”
|
||
|
||
由于服务端同事未定输出格式,客户端必须支持以下 四种 TTS 输出模式 的任意一种:
|
||
|
||
模式 A:返回 m4a/MP3 URL(最容易落地)
|
||
|
||
服务端返回 URL(或 base64 文件)
|
||
|
||
客户端用 AVPlayer / AVAudioPlayer 播放
|
||
|
||
字幕同步用“音频时长映射”(可拿到 duration)
|
||
|
||
优点:服务端简单
|
||
缺点:首帧延迟通常更高(要等整段生成、至少等首包)
|
||
|
||
模式 B:返回 AAC chunk(流式)
|
||
|
||
服务端 WS 推 AAC 帧
|
||
|
||
客户端需要 AAC 解码成 PCM,再喂 AudioStreamPlayer
|
||
|
||
模式 C:返回 Opus chunk(流式)
|
||
|
||
需 Opus 解码库(服务端/客户端成本更高)
|
||
|
||
解码后喂 PCM 播放
|
||
|
||
模式 D:返回 PCM chunk(最适合低延迟)
|
||
|
||
服务端直接推 PCM16 chunk(比如 100ms 一块)
|
||
|
||
客户端直接转 AVAudioPCMBuffer schedule
|
||
|
||
延迟最低、实现最稳
|
||
|
||
8. TTSServiceClient(统一网络层接口)
|
||
8.1 统一回调事件(抽象)
|
||
typedef NS_ENUM(NSInteger, TTSPayloadType) {
|
||
TTSPayloadTypeURL, // A
|
||
TTSPayloadTypePCMChunk, // D
|
||
TTSPayloadTypeAACChunk, // B
|
||
TTSPayloadTypeOpusChunk // C
|
||
};
|
||
|
||
@protocol TTSServiceClientDelegate <NSObject>
|
||
- (void)ttsClientDidReceiveURL:(NSURL *)url segmentId:(NSString *)segmentId;
|
||
- (void)ttsClientDidReceiveAudioChunk:(NSData *)chunk
|
||
payloadType:(TTSPayloadType)type
|
||
segmentId:(NSString *)segmentId;
|
||
- (void)ttsClientDidFinishSegment:(NSString *)segmentId;
|
||
- (void)ttsClientDidFail:(NSError *)error;
|
||
@end
|
||
|
||
@interface TTSServiceClient : NSObject
|
||
@property (nonatomic, weak) id<TTSServiceClientDelegate> delegate;
|
||
- (void)requestTTSForText:(NSString *)text segmentId:(NSString *)segmentId;
|
||
- (void)cancel;
|
||
@end
|
||
|
||
|
||
这样服务端最后选哪种输出,你只需实现对应分支即可,不需要推翻客户端架构。
|
||
|
||
9. TTSPlaybackPipeline(播放管线:根据 payloadType 路由)
|
||
9.1 设计目标
|
||
|
||
支持 URL 播放与流式 chunk 播放
|
||
|
||
提供统一的“开始播放/停止/进度”接口供字幕同步与打断使用
|
||
|
||
9.2 Pipeline 结构(建议)
|
||
|
||
TTSPlaybackPipeline 只做路由与队列管理
|
||
|
||
URL → TTSURLPlayer(AVPlayer)
|
||
|
||
PCM → AudioStreamPlayer(AVAudioEngine)
|
||
|
||
AAC/Opus → Decoder → PCM → AudioStreamPlayer
|
||
|
||
9.3 Pipeline 接口(OC)
|
||
@protocol TTSPlaybackPipelineDelegate <NSObject>
|
||
- (void)pipelineDidStartSegment:(NSString *)segmentId duration:(NSTimeInterval)duration;
|
||
- (void)pipelineDidUpdatePlaybackTime:(NSTimeInterval)time segmentId:(NSString *)segmentId;
|
||
- (void)pipelineDidFinishSegment:(NSString *)segmentId;
|
||
@end
|
||
|
||
@interface TTSPlaybackPipeline : NSObject
|
||
@property (nonatomic, weak) id<TTSPlaybackPipelineDelegate> delegate;
|
||
|
||
- (BOOL)start:(NSError **)error; // 启动音频引擎等
|
||
- (void)stop; // 立即停止(打断)
|
||
|
||
- (void)enqueueURL:(NSURL *)url segmentId:(NSString *)segmentId;
|
||
- (void)enqueueChunk:(NSData *)chunk payloadType:(TTSPayloadType)type segmentId:(NSString *)segmentId;
|
||
|
||
// 可选:用于字幕同步
|
||
- (NSTimeInterval)currentTimeForSegment:(NSString *)segmentId;
|
||
- (NSTimeInterval)durationForSegment:(NSString *)segmentId;
|
||
@end
|
||
|
||
10. AudioStreamPlayer(PCM 流式播放,低延迟核心)
|
||
10.1 使用 AVAudioEngine + AVAudioPlayerNode
|
||
|
||
将 PCM chunk 转 AVAudioPCMBuffer
|
||
|
||
scheduleBuffer 播放
|
||
|
||
维护“当前 segment 的播放时间/总时长”(可估算或累加 chunk 时长)
|
||
|
||
10.2 接口(OC)
|
||
@interface AudioStreamPlayer : NSObject
|
||
- (BOOL)start:(NSError **)error;
|
||
- (void)stop;
|
||
- (void)enqueuePCMChunk:(NSData *)pcmData
|
||
sampleRate:(double)sampleRate
|
||
channels:(int)channels
|
||
segmentId:(NSString *)segmentId;
|
||
|
||
- (NSTimeInterval)playbackTimeForSegment:(NSString *)segmentId;
|
||
- (NSTimeInterval)durationForSegment:(NSString *)segmentId;
|
||
@end
|
||
|
||
|
||
PCM chunk 的粒度建议:50ms~200ms(太小 schedule 太频繁,太大延迟高)。
|
||
|
||
11. 字幕同步(延迟优先)
|
||
11.1 策略
|
||
|
||
对每个 segment 的文本 text,按播放进度映射显示字符数:
|
||
|
||
visibleCount = round(text.length * (t / T))
|
||
|
||
t:segment 当前播放进度(pipeline 提供)
|
||
|
||
T:segment 总时长(URL 模式直接取;chunk 模式可累加估算)
|
||
|
||
11.2 SubtitleSync 接口(OC)
|
||
@interface SubtitleSync : NSObject
|
||
- (NSString *)visibleTextForFullText:(NSString *)fullText
|
||
currentTime:(NSTimeInterval)t
|
||
duration:(NSTimeInterval)T;
|
||
@end
|
||
|
||
12. ConversationOrchestrator(状态机 + 打断 + 队列)
|
||
12.1 状态
|
||
typedef NS_ENUM(NSInteger, ConversationState) {
|
||
ConversationStateIdle,
|
||
ConversationStateListening,
|
||
ConversationStateRecognizing,
|
||
ConversationStateThinking,
|
||
ConversationStateSpeaking
|
||
};
|
||
|
||
12.2 关键流程
|
||
事件:用户按住(userDidPressRecord)
|
||
|
||
如果正在 Speaking/Thinking:
|
||
|
||
[ttsService cancel]
|
||
|
||
[llmClient cancel]
|
||
|
||
[asrClient cancel](如仍在识别)
|
||
|
||
[pipeline stop](立即停播)
|
||
|
||
清空 segment 队列、字幕队列
|
||
|
||
配置/激活 AudioSession
|
||
|
||
新建 sessionId
|
||
|
||
[asrClient startWithSessionId:]
|
||
|
||
[audioCapture startCapture:]
|
||
|
||
state = Listening
|
||
|
||
事件:用户松开(userDidReleaseRecord)
|
||
|
||
[audioCapture stopCapture]
|
||
|
||
[asrClient finalize]
|
||
|
||
state = Recognizing
|
||
|
||
回调:ASR final text
|
||
|
||
UI 显示用户最终文本
|
||
|
||
state = Thinking
|
||
|
||
开始 LLM stream:[llmClient sendUserText:conversationId:]
|
||
|
||
回调:LLM token
|
||
|
||
segmenter appendToken
|
||
|
||
segments = [segmenter popReadySegments]
|
||
|
||
对每个 segment:
|
||
|
||
生成 segmentId
|
||
|
||
记录 segmentTextMap[segmentId] = segmentText
|
||
|
||
[ttsService requestTTSForText:segmentId:]
|
||
|
||
当收到第一个可播放音频并开始播:
|
||
|
||
state = Speaking
|
||
|
||
回调:TTS 音频到达
|
||
|
||
URL:[pipeline enqueueURL:segmentId:]
|
||
|
||
chunk:[pipeline enqueueChunk:payloadType:segmentId:]
|
||
|
||
回调:pipeline 播放时间更新(每 30-60fps 或定时器)
|
||
|
||
根据当前 segmentId 取到 fullText
|
||
|
||
visible = [subtitleSync visibleTextForFullText:currentTime:duration:]
|
||
|
||
UI 更新 AI 可见文本
|
||
|
||
12.3 打断(Barge-in)
|
||
|
||
当用户再次按住:
|
||
|
||
立即 stop 播放
|
||
|
||
取消所有未完成网络请求
|
||
|
||
丢弃所有未播放 segments
|
||
|
||
开始新一轮录音
|
||
|
||
12.4 Orchestrator 接口(OC)
|
||
@interface ConversationOrchestrator : NSObject
|
||
@property (nonatomic, assign, readonly) ConversationState state;
|
||
|
||
- (void)userDidPressRecord;
|
||
- (void)userDidReleaseRecord;
|
||
|
||
@property (nonatomic, copy) void (^onUserFinalText)(NSString *text);
|
||
@property (nonatomic, copy) void (^onAssistantVisibleText)(NSString *text);
|
||
@property (nonatomic, copy) void (^onError)(NSError *error);
|
||
@end
|
||
|
||
13. 线程/队列模型(强制要求,避免竞态)
|
||
|
||
建议三条队列 + 一条 orchestrator 串行队列:
|
||
|
||
dispatch_queue_t audioQueue;(采集帧处理、ring buffer)
|
||
|
||
dispatch_queue_t networkQueue;(WS 收发解析)
|
||
|
||
dispatch_queue_t orchestratorQueue;(状态机串行,唯一修改 state/队列的地方)
|
||
|
||
UI 更新统一回主线程
|
||
|
||
规则:
|
||
|
||
任何网络/音频回调 → dispatch_async(orchestratorQueue, ^{ ... })
|
||
|
||
Orchestrator 内部再决定是否发 UI 回调(主线程)
|
||
|
||
14. 关键参数(延迟与稳定性)
|
||
|
||
音频帧:20ms
|
||
|
||
PCM:16k/mono/int16
|
||
|
||
ASR 上传:WS 二进制
|
||
|
||
LLM:token stream
|
||
|
||
TTS:优先 chunk;若 URL 模式也要尽快开始下载与播放
|
||
|
||
chunk 播放缓冲:100~200ms(防抖动)
|
||
|
||
15. 开发落地建议(服务端未定情况下的迭代路径)
|
||
Phase 1:先跑通端到端(用“URL 模式”模拟)
|
||
|
||
TTSServiceClient 先假定服务端返回 m4a URL(或本地 mock URL)
|
||
|
||
Pipeline 实现 URL 播放(AVPlayer)
|
||
|
||
打断 + 字幕同步先跑通
|
||
|
||
Phase 2:服务端定了输出后再替换
|
||
|
||
若服务端给 PCM chunk:直接走 AudioStreamPlayer(最推荐)
|
||
|
||
若给 AAC chunk:补 AAC 解码模块(AudioConverter 或第三方)
|
||
|
||
若给 Opus chunk:集成 Opus 解码库,再喂 PCM
|
||
|
||
关键:Orchestrator/Segmenter/ASR/字幕同步都不需要改,只替换 TTSPlaybackPipeline 分支。
|
||
|
||
16. 合规/体验注意
|
||
|
||
录音必须由用户动作触发(按住)
|
||
|
||
明确的“正在录音”提示与波形
|
||
|
||
避免自动偷录
|
||
|
||
播放时允许随时打断
|
||
|
||
文档结束
|
||
给“写代码的 AI”的额外要求(建议你一并附上)
|
||
|
||
语言:Objective-C(.h/.m)
|
||
|
||
iOS 15+,WebSocket 用 NSURLSessionWebSocketTask
|
||
|
||
音频采集用 AVAudioEngine + ring buffer 切 20ms 帧
|
||
|
||
播放管线必须支持:URL 播放(AVPlayer)+ PCM chunk 播放(AVAudioEngine)
|
||
|
||
其余 AAC/Opus 分支可留 TODO / stub,但接口要预留
|