添加语音websocket等,还没测试
This commit is contained in:
521
keyBoard/Class/AiTalk/AI技术分析.txt
Normal file
521
keyBoard/Class/AiTalk/AI技术分析.txt
Normal file
@@ -0,0 +1,521 @@
|
||||
服务 用途 示例格式
|
||||
ASR 服务器 语音识别(WebSocket) wss://api.example.com/asr
|
||||
LLM 服务器 AI 对话(HTTP SSE) https://api.example.com/chat
|
||||
TTS 服务器 语音合成 https://api.example.com/tts
|
||||
|
||||
iOS(Objective-C,iOS 15+)端技术实现文档
|
||||
低延迟流式语音陪伴聊天(按住说话,类似猫箱首页)
|
||||
0. 范围与目标
|
||||
|
||||
实现首页语音陪伴对话:
|
||||
|
||||
按住说话:开始录音并实时流式发送到 ASR
|
||||
|
||||
松开结束:ASR 立即 finalize,返回最终文本并显示
|
||||
|
||||
AI 回复:边显示文字(打字机效果)边播放服务端 TTS 音频
|
||||
|
||||
延迟低优先:不等待完整回答/完整音频,采用“分句触发 + 流式/准流式播放”
|
||||
|
||||
打断(Barge-in):AI 正在播报时用户再次按住 → 立即停止播报/取消请求,进入新一轮录音
|
||||
|
||||
iOS 最低版本:iOS 15
|
||||
|
||||
1. 总体架构(客户端模块)
|
||||
KBAiMainVC
|
||||
└─ ConversationOrchestrator (核心状态机 / 串联模块 / 取消与打断)
|
||||
├─ AudioSessionManager (AVAudioSession 配置与中断处理)
|
||||
├─ AudioCaptureManager (AVAudioEngine input tap -> 20ms PCM frames)
|
||||
├─ ASRStreamClient (NSURLSessionWebSocketTask 流式识别)
|
||||
├─ LLMStreamClient (SSE/WS token stream)
|
||||
├─ Segmenter (句子切分:够一句就触发 TTS)
|
||||
├─ TTSServiceClient (请求 TTS,适配多种返回形态)
|
||||
├─ TTSPlaybackPipeline (可插拔:URL播放器 / AAC解码 / PCM直喂)
|
||||
├─ AudioStreamPlayer (AVAudioEngine + AVAudioPlayerNode 播 PCM)
|
||||
└─ SubtitleSync (按播放进度映射文字进度)
|
||||
|
||||
2. 音频会话(AVAudioSession)与权限
|
||||
2.1 麦克风权限
|
||||
|
||||
仅在用户第一次按住说话前请求
|
||||
|
||||
若用户拒绝:提示到设置开启
|
||||
|
||||
2.2 AudioSession 配置(对话模式)
|
||||
|
||||
Objective-C(建议参数):
|
||||
|
||||
category:AVAudioSessionCategoryPlayAndRecord
|
||||
|
||||
mode:AVAudioSessionModeVoiceChat
|
||||
|
||||
options:
|
||||
|
||||
AVAudioSessionCategoryOptionDefaultToSpeaker
|
||||
|
||||
AVAudioSessionCategoryOptionAllowBluetooth
|
||||
|
||||
(可选)AVAudioSessionCategoryOptionMixWithOthers:若你希望不打断宿主音频(看产品)
|
||||
|
||||
2.3 中断与路由变化处理(必须)
|
||||
|
||||
监听:
|
||||
|
||||
AVAudioSessionInterruptionNotification
|
||||
|
||||
AVAudioSessionRouteChangeNotification
|
||||
|
||||
处理原则:
|
||||
|
||||
来电/中断开始:停止采集 + 停止播放 + cancel 网络会话
|
||||
|
||||
中断结束:回到 Idle,等待用户重新按住
|
||||
|
||||
3. 音频采集(按住期间流式上传)
|
||||
3.1 固定音频参数(锁死,便于端到端稳定)
|
||||
|
||||
Sample Rate:16000 Hz
|
||||
|
||||
Channels:1
|
||||
|
||||
Format:PCM Int16(pcm_s16le)
|
||||
|
||||
Frame Duration:20ms
|
||||
|
||||
16kHz * 0.02s = 320 samples
|
||||
|
||||
每帧 bytes = 320 * 2 = 640 bytes
|
||||
|
||||
3.2 AudioCaptureManager(AVAudioEngine 输入 tap)
|
||||
|
||||
使用:
|
||||
|
||||
AVAudioEngine
|
||||
|
||||
inputNode installTapOnBus:bufferSize:format:block:
|
||||
|
||||
关键点:
|
||||
|
||||
tap 回调线程不可做重活:只做拷贝 + dispatch 到 audioQueue
|
||||
|
||||
将 AVAudioPCMBuffer 转成 Int16 PCM NSData
|
||||
|
||||
确保稳定输出“20ms帧”,如果 tap 回调 buffer 不刚好是 20ms,需要做 帧拼接/切片(ring buffer)
|
||||
|
||||
3.3 接口定义(OC)
|
||||
@protocol AudioCaptureManagerDelegate <NSObject>
|
||||
- (void)audioCaptureManagerDidOutputPCMFrame:(NSData *)pcmFrame; // 20ms/640B
|
||||
- (void)audioCaptureManagerDidUpdateRMS:(float)rms; // 可选:UI波形
|
||||
@end
|
||||
|
||||
@interface AudioCaptureManager : NSObject
|
||||
@property (nonatomic, weak) id<AudioCaptureManagerDelegate> delegate;
|
||||
- (BOOL)startCapture:(NSError **)error;
|
||||
- (void)stopCapture;
|
||||
@end
|
||||
|
||||
4. ASR 流式识别(iOS15:NSURLSessionWebSocketTask)
|
||||
4.1 建议协议(控制帧 JSON + 音频帧二进制)
|
||||
|
||||
Start(文本帧)
|
||||
|
||||
{
|
||||
"type":"start",
|
||||
"sessionId":"uuid",
|
||||
"format":"pcm_s16le",
|
||||
"sampleRate":16000,
|
||||
"channels":1
|
||||
}
|
||||
|
||||
|
||||
Audio(二进制帧)
|
||||
|
||||
直接发送 640B/帧 PCM
|
||||
|
||||
频率:50fps(每秒 50 帧)
|
||||
|
||||
Finalize(文本帧)
|
||||
|
||||
{ "type":"finalize", "sessionId":"uuid" }
|
||||
|
||||
4.2 下行事件
|
||||
{ "type":"partial", "text":"今天" }
|
||||
{ "type":"final", "text":"今天天气怎么样" }
|
||||
{ "type":"error", "code":123, "message":"..." }
|
||||
|
||||
4.3 ASRStreamClient 接口(OC)
|
||||
@protocol ASRStreamClientDelegate <NSObject>
|
||||
- (void)asrClientDidReceivePartialText:(NSString *)text;
|
||||
- (void)asrClientDidReceiveFinalText:(NSString *)text;
|
||||
- (void)asrClientDidFail:(NSError *)error;
|
||||
@end
|
||||
|
||||
@interface ASRStreamClient : NSObject
|
||||
@property (nonatomic, weak) id<ASRStreamClientDelegate> delegate;
|
||||
- (void)startWithSessionId:(NSString *)sessionId;
|
||||
- (void)sendAudioPCMFrame:(NSData *)pcmFrame; // 20ms frame
|
||||
- (void)finalize;
|
||||
- (void)cancel;
|
||||
@end
|
||||
|
||||
5. LLM 流式生成(token stream)
|
||||
5.1 目标
|
||||
|
||||
低延迟:不要等整段回答
|
||||
|
||||
使用 SSE 或 WS 收 token
|
||||
|
||||
token 进入 Segmenter,够一句就触发 TTS
|
||||
|
||||
5.2 LLMStreamClient 接口(OC)
|
||||
@protocol LLMStreamClientDelegate <NSObject>
|
||||
- (void)llmClientDidReceiveToken:(NSString *)token;
|
||||
- (void)llmClientDidComplete;
|
||||
- (void)llmClientDidFail:(NSError *)error;
|
||||
@end
|
||||
|
||||
@interface LLMStreamClient : NSObject
|
||||
@property (nonatomic, weak) id<LLMStreamClientDelegate> delegate;
|
||||
- (void)sendUserText:(NSString *)text conversationId:(NSString *)cid;
|
||||
- (void)cancel;
|
||||
@end
|
||||
|
||||
6. Segmenter(句子切分:先播第一句)
|
||||
6.1 切分规则(推荐)
|
||||
|
||||
任一满足则切分成 segment:
|
||||
|
||||
遇到 。!?\n 之一
|
||||
|
||||
或累积字符数 ≥ 30(可配置)
|
||||
|
||||
6.2 Segmenter 接口(OC)
|
||||
@interface Segmenter : NSObject
|
||||
- (void)appendToken:(NSString *)token;
|
||||
- (NSArray<NSString *> *)popReadySegments; // 返回立即可TTS的片段数组
|
||||
- (void)reset;
|
||||
@end
|
||||
|
||||
7. TTS:返回形态未定 → 客户端做“可插拔播放管线”
|
||||
|
||||
由于服务端同事未定输出格式,客户端必须支持以下 四种 TTS 输出模式 的任意一种:
|
||||
|
||||
模式 A:返回 m4a/MP3 URL(最容易落地)
|
||||
|
||||
服务端返回 URL(或 base64 文件)
|
||||
|
||||
客户端用 AVPlayer / AVAudioPlayer 播放
|
||||
|
||||
字幕同步用“音频时长映射”(可拿到 duration)
|
||||
|
||||
优点:服务端简单
|
||||
缺点:首帧延迟通常更高(要等整段生成、至少等首包)
|
||||
|
||||
模式 B:返回 AAC chunk(流式)
|
||||
|
||||
服务端 WS 推 AAC 帧
|
||||
|
||||
客户端需要 AAC 解码成 PCM,再喂 AudioStreamPlayer
|
||||
|
||||
模式 C:返回 Opus chunk(流式)
|
||||
|
||||
需 Opus 解码库(服务端/客户端成本更高)
|
||||
|
||||
解码后喂 PCM 播放
|
||||
|
||||
模式 D:返回 PCM chunk(最适合低延迟)
|
||||
|
||||
服务端直接推 PCM16 chunk(比如 100ms 一块)
|
||||
|
||||
客户端直接转 AVAudioPCMBuffer schedule
|
||||
|
||||
延迟最低、实现最稳
|
||||
|
||||
8. TTSServiceClient(统一网络层接口)
|
||||
8.1 统一回调事件(抽象)
|
||||
typedef NS_ENUM(NSInteger, TTSPayloadType) {
|
||||
TTSPayloadTypeURL, // A
|
||||
TTSPayloadTypePCMChunk, // D
|
||||
TTSPayloadTypeAACChunk, // B
|
||||
TTSPayloadTypeOpusChunk // C
|
||||
};
|
||||
|
||||
@protocol TTSServiceClientDelegate <NSObject>
|
||||
- (void)ttsClientDidReceiveURL:(NSURL *)url segmentId:(NSString *)segmentId;
|
||||
- (void)ttsClientDidReceiveAudioChunk:(NSData *)chunk
|
||||
payloadType:(TTSPayloadType)type
|
||||
segmentId:(NSString *)segmentId;
|
||||
- (void)ttsClientDidFinishSegment:(NSString *)segmentId;
|
||||
- (void)ttsClientDidFail:(NSError *)error;
|
||||
@end
|
||||
|
||||
@interface TTSServiceClient : NSObject
|
||||
@property (nonatomic, weak) id<TTSServiceClientDelegate> delegate;
|
||||
- (void)requestTTSForText:(NSString *)text segmentId:(NSString *)segmentId;
|
||||
- (void)cancel;
|
||||
@end
|
||||
|
||||
|
||||
这样服务端最后选哪种输出,你只需实现对应分支即可,不需要推翻客户端架构。
|
||||
|
||||
9. TTSPlaybackPipeline(播放管线:根据 payloadType 路由)
|
||||
9.1 设计目标
|
||||
|
||||
支持 URL 播放与流式 chunk 播放
|
||||
|
||||
提供统一的“开始播放/停止/进度”接口供字幕同步与打断使用
|
||||
|
||||
9.2 Pipeline 结构(建议)
|
||||
|
||||
TTSPlaybackPipeline 只做路由与队列管理
|
||||
|
||||
URL → TTSURLPlayer(AVPlayer)
|
||||
|
||||
PCM → AudioStreamPlayer(AVAudioEngine)
|
||||
|
||||
AAC/Opus → Decoder → PCM → AudioStreamPlayer
|
||||
|
||||
9.3 Pipeline 接口(OC)
|
||||
@protocol TTSPlaybackPipelineDelegate <NSObject>
|
||||
- (void)pipelineDidStartSegment:(NSString *)segmentId duration:(NSTimeInterval)duration;
|
||||
- (void)pipelineDidUpdatePlaybackTime:(NSTimeInterval)time segmentId:(NSString *)segmentId;
|
||||
- (void)pipelineDidFinishSegment:(NSString *)segmentId;
|
||||
@end
|
||||
|
||||
@interface TTSPlaybackPipeline : NSObject
|
||||
@property (nonatomic, weak) id<TTSPlaybackPipelineDelegate> delegate;
|
||||
|
||||
- (BOOL)start:(NSError **)error; // 启动音频引擎等
|
||||
- (void)stop; // 立即停止(打断)
|
||||
|
||||
- (void)enqueueURL:(NSURL *)url segmentId:(NSString *)segmentId;
|
||||
- (void)enqueueChunk:(NSData *)chunk payloadType:(TTSPayloadType)type segmentId:(NSString *)segmentId;
|
||||
|
||||
// 可选:用于字幕同步
|
||||
- (NSTimeInterval)currentTimeForSegment:(NSString *)segmentId;
|
||||
- (NSTimeInterval)durationForSegment:(NSString *)segmentId;
|
||||
@end
|
||||
|
||||
10. AudioStreamPlayer(PCM 流式播放,低延迟核心)
|
||||
10.1 使用 AVAudioEngine + AVAudioPlayerNode
|
||||
|
||||
将 PCM chunk 转 AVAudioPCMBuffer
|
||||
|
||||
scheduleBuffer 播放
|
||||
|
||||
维护“当前 segment 的播放时间/总时长”(可估算或累加 chunk 时长)
|
||||
|
||||
10.2 接口(OC)
|
||||
@interface AudioStreamPlayer : NSObject
|
||||
- (BOOL)start:(NSError **)error;
|
||||
- (void)stop;
|
||||
- (void)enqueuePCMChunk:(NSData *)pcmData
|
||||
sampleRate:(double)sampleRate
|
||||
channels:(int)channels
|
||||
segmentId:(NSString *)segmentId;
|
||||
|
||||
- (NSTimeInterval)playbackTimeForSegment:(NSString *)segmentId;
|
||||
- (NSTimeInterval)durationForSegment:(NSString *)segmentId;
|
||||
@end
|
||||
|
||||
|
||||
PCM chunk 的粒度建议:50ms~200ms(太小 schedule 太频繁,太大延迟高)。
|
||||
|
||||
11. 字幕同步(延迟优先)
|
||||
11.1 策略
|
||||
|
||||
对每个 segment 的文本 text,按播放进度映射显示字符数:
|
||||
|
||||
visibleCount = round(text.length * (t / T))
|
||||
|
||||
t:segment 当前播放进度(pipeline 提供)
|
||||
|
||||
T:segment 总时长(URL 模式直接取;chunk 模式可累加估算)
|
||||
|
||||
11.2 SubtitleSync 接口(OC)
|
||||
@interface SubtitleSync : NSObject
|
||||
- (NSString *)visibleTextForFullText:(NSString *)fullText
|
||||
currentTime:(NSTimeInterval)t
|
||||
duration:(NSTimeInterval)T;
|
||||
@end
|
||||
|
||||
12. ConversationOrchestrator(状态机 + 打断 + 队列)
|
||||
12.1 状态
|
||||
typedef NS_ENUM(NSInteger, ConversationState) {
|
||||
ConversationStateIdle,
|
||||
ConversationStateListening,
|
||||
ConversationStateRecognizing,
|
||||
ConversationStateThinking,
|
||||
ConversationStateSpeaking
|
||||
};
|
||||
|
||||
12.2 关键流程
|
||||
事件:用户按住(userDidPressRecord)
|
||||
|
||||
如果正在 Speaking/Thinking:
|
||||
|
||||
[ttsService cancel]
|
||||
|
||||
[llmClient cancel]
|
||||
|
||||
[asrClient cancel](如仍在识别)
|
||||
|
||||
[pipeline stop](立即停播)
|
||||
|
||||
清空 segment 队列、字幕队列
|
||||
|
||||
配置/激活 AudioSession
|
||||
|
||||
新建 sessionId
|
||||
|
||||
[asrClient startWithSessionId:]
|
||||
|
||||
[audioCapture startCapture:]
|
||||
|
||||
state = Listening
|
||||
|
||||
事件:用户松开(userDidReleaseRecord)
|
||||
|
||||
[audioCapture stopCapture]
|
||||
|
||||
[asrClient finalize]
|
||||
|
||||
state = Recognizing
|
||||
|
||||
回调:ASR final text
|
||||
|
||||
UI 显示用户最终文本
|
||||
|
||||
state = Thinking
|
||||
|
||||
开始 LLM stream:[llmClient sendUserText:conversationId:]
|
||||
|
||||
回调:LLM token
|
||||
|
||||
segmenter appendToken
|
||||
|
||||
segments = [segmenter popReadySegments]
|
||||
|
||||
对每个 segment:
|
||||
|
||||
生成 segmentId
|
||||
|
||||
记录 segmentTextMap[segmentId] = segmentText
|
||||
|
||||
[ttsService requestTTSForText:segmentId:]
|
||||
|
||||
当收到第一个可播放音频并开始播:
|
||||
|
||||
state = Speaking
|
||||
|
||||
回调:TTS 音频到达
|
||||
|
||||
URL:[pipeline enqueueURL:segmentId:]
|
||||
|
||||
chunk:[pipeline enqueueChunk:payloadType:segmentId:]
|
||||
|
||||
回调:pipeline 播放时间更新(每 30-60fps 或定时器)
|
||||
|
||||
根据当前 segmentId 取到 fullText
|
||||
|
||||
visible = [subtitleSync visibleTextForFullText:currentTime:duration:]
|
||||
|
||||
UI 更新 AI 可见文本
|
||||
|
||||
12.3 打断(Barge-in)
|
||||
|
||||
当用户再次按住:
|
||||
|
||||
立即 stop 播放
|
||||
|
||||
取消所有未完成网络请求
|
||||
|
||||
丢弃所有未播放 segments
|
||||
|
||||
开始新一轮录音
|
||||
|
||||
12.4 Orchestrator 接口(OC)
|
||||
@interface ConversationOrchestrator : NSObject
|
||||
@property (nonatomic, assign, readonly) ConversationState state;
|
||||
|
||||
- (void)userDidPressRecord;
|
||||
- (void)userDidReleaseRecord;
|
||||
|
||||
@property (nonatomic, copy) void (^onUserFinalText)(NSString *text);
|
||||
@property (nonatomic, copy) void (^onAssistantVisibleText)(NSString *text);
|
||||
@property (nonatomic, copy) void (^onError)(NSError *error);
|
||||
@end
|
||||
|
||||
13. 线程/队列模型(强制要求,避免竞态)
|
||||
|
||||
建议三条队列 + 一条 orchestrator 串行队列:
|
||||
|
||||
dispatch_queue_t audioQueue;(采集帧处理、ring buffer)
|
||||
|
||||
dispatch_queue_t networkQueue;(WS 收发解析)
|
||||
|
||||
dispatch_queue_t orchestratorQueue;(状态机串行,唯一修改 state/队列的地方)
|
||||
|
||||
UI 更新统一回主线程
|
||||
|
||||
规则:
|
||||
|
||||
任何网络/音频回调 → dispatch_async(orchestratorQueue, ^{ ... })
|
||||
|
||||
Orchestrator 内部再决定是否发 UI 回调(主线程)
|
||||
|
||||
14. 关键参数(延迟与稳定性)
|
||||
|
||||
音频帧:20ms
|
||||
|
||||
PCM:16k/mono/int16
|
||||
|
||||
ASR 上传:WS 二进制
|
||||
|
||||
LLM:token stream
|
||||
|
||||
TTS:优先 chunk;若 URL 模式也要尽快开始下载与播放
|
||||
|
||||
chunk 播放缓冲:100~200ms(防抖动)
|
||||
|
||||
15. 开发落地建议(服务端未定情况下的迭代路径)
|
||||
Phase 1:先跑通端到端(用“URL 模式”模拟)
|
||||
|
||||
TTSServiceClient 先假定服务端返回 m4a URL(或本地 mock URL)
|
||||
|
||||
Pipeline 实现 URL 播放(AVPlayer)
|
||||
|
||||
打断 + 字幕同步先跑通
|
||||
|
||||
Phase 2:服务端定了输出后再替换
|
||||
|
||||
若服务端给 PCM chunk:直接走 AudioStreamPlayer(最推荐)
|
||||
|
||||
若给 AAC chunk:补 AAC 解码模块(AudioConverter 或第三方)
|
||||
|
||||
若给 Opus chunk:集成 Opus 解码库,再喂 PCM
|
||||
|
||||
关键:Orchestrator/Segmenter/ASR/字幕同步都不需要改,只替换 TTSPlaybackPipeline 分支。
|
||||
|
||||
16. 合规/体验注意
|
||||
|
||||
录音必须由用户动作触发(按住)
|
||||
|
||||
明确的“正在录音”提示与波形
|
||||
|
||||
避免自动偷录
|
||||
|
||||
播放时允许随时打断
|
||||
|
||||
文档结束
|
||||
给“写代码的 AI”的额外要求(建议你一并附上)
|
||||
|
||||
语言:Objective-C(.h/.m)
|
||||
|
||||
iOS 15+,WebSocket 用 NSURLSessionWebSocketTask
|
||||
|
||||
音频采集用 AVAudioEngine + ring buffer 切 20ms 帧
|
||||
|
||||
播放管线必须支持:URL 播放(AVPlayer)+ PCM chunk 播放(AVAudioEngine)
|
||||
|
||||
其余 AAC/Opus 分支可留 TODO / stub,但接口要预留
|
||||
Reference in New Issue
Block a user