走讀Webrtc 中的視頻JitterBuffer(一) · 音視頻開發之路

## 基本概念在實時流媒體系統中，jitterbuffer是在解碼端，起到如下幾個作用: 1. 對rtp包進行排序 2. 對rtp包進行去重 3. 去抖動對于1，2點比較簡單。核心的是去抖動，去抖動實現的目標就是使視頻能平滑播放，不因為抖動忽快忽慢。簡單的視頻jitterbuffer可以只針對rtp包進行處理，只對rtp進行排序，去重。并不處理視頻幀。如下圖 ![在這里插入圖片描述](https://www.codeleading.com/imgrdrct/https://img-blog.csdnimg.cn/20200913230755280.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L21vNDc3Ng==,size_16,color_FFFFFF,t_70#pic_center) 這種簡單的實現并不關心視頻幀是否有錯，是否可以解碼。視頻幀間是否可以持續解碼。(針對視頻幀內的RTP包，如果經過排序和去重，可以認為是可解碼的)。這些全部放在解碼模塊去做。當然這種形式的Jitterbuffer無法處理抖動。因為對視頻幀而言，抖動的處理是針對幀間的，而不是RTP包間的。把它叫做rtp buffer應該更合適些。 ## webrtc中的視頻jitterbuffer webrtc中的jitterBuffer也是QOS機制中的核心，它會估算抖動，丟包，決定是否通過Nack來重傳。這里我先忽略與QOS相關的一些邏輯，先看看jitterBuffer中的一些核心的基礎功能。 webrtc中的jitterBuffer是基于視頻幀實現，在RTP的接收端存入jitterbuffer的是rtp包，在解碼端取出的是視頻幀。 ![在這里插入圖片描述](https://www.codeleading.com/imgrdrct/https://img-blog.csdnimg.cn/20200913230815215.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L21vNDc3Ng==,size_16,color_FFFFFF,t_70#pic_center) 輸入的是rtp包，輸出的是平滑的(去抖動)視頻幀，是保證可解碼，可連續解碼的。針對去抖動或幀錯誤的處理，都是在該類中實現的。所以如果視頻出現卡頓，黑屏等問題。應該在VCMJitterBuffer找原因，查看它的幀錯誤處理邏輯。 ### 基本數據結構 VCMJitterBuffer 類是具體的實現，因為是針對視頻幀，所以基本的數據結構就是幀隊列，如下分為三種類型的幀隊列： ~~~ UnorderedFrameList free_frames_ ; FrameList decodable_frames_ ; FrameList incomplete_frames_ ; ~~~ free\_frames 空幀隊列，decodable\_frames\_ 可解碼幀隊列，incomplete\_frames\_ 隊列 * 幀 `VMCFrameBuffer`就是幀，free\_frames\_,decodable\_frames\_,incomplete\_frames\_分別為`VMCFrameBuffer`的list, map。而`VMCFrameBuffer`的`VCMSessionInfo`類型的成員變量就是具體的幀緩存，查看VCMSessionInfo代碼，幀緩存其實就是`typedef std::list<VCMPacket> PacketList`是一個Rtp packet的list。VCMSessionInfo類就是處理Rtp包，根據Rtp包來轉換幀的狀態。 ### VCMSessionInfo VCMSessionInfo類的幾個核心接口`InsertPacket,UpdateCompleteSession,UpdateDecodableSession` * InsertPacket 就是實現rtp包的排序，去重，更新插入Rtp包后的視頻幀的狀態 ~~~ int VCMSessionInfo::InsertPacket(const VCMPacket& packet, uint8_t* frame_buffer, VCMDecodeErrorMode decode_error_mode, const FrameData& frame_data) { if (packet.frameType == kFrameEmpty) { // Update sequence number of an empty packet. // Only media packets are inserted into the packet list. InformOfEmptyPacket(packet.seqNum); return 0; } //一幀視頻可以包含的最大rtp packet數(Nalu的最大分包數) if (packets_.size() == kMaxPacketsInSession) { LOG(LS_ERROR) << "Max number of packets per frame has been reached."; return -1; } //按SN從小到大的順序排序 // Find the position of this packet in the packet list in sequence number // order and insert it. Loop over the list in reverse order. ReversePacketIterator rit = packets_.rbegin(); for (; rit != packets_.rend(); ++rit) if (LatestSequenceNumber(packet.seqNum, (*rit).seqNum) == packet.seqNum) break; //去重 // Check for duplicate packets. if (rit != packets_.rend() && (*rit).seqNum == packet.seqNum && (*rit).sizeBytes > 0) return -2; //rtp包的mark為true，則是一幀的最后一個包 if (packet.codec == kVideoCodecH264) { frame_type_ = packet.frameType; if (packet.isFirstPacket && (first_packet_seq_num_ == -1 || IsNewerSequenceNumber(first_packet_seq_num_, packet.seqNum))) { first_packet_seq_num_ = packet.seqNum; } if (packet.markerBit && (last_packet_seq_num_ == -1 || IsNewerSequenceNumber(packet.seqNum, last_packet_seq_num_))) { last_packet_seq_num_ = packet.seqNum; } } else { // Only insert media packets between first and last packets (when // available). // Placing check here, as to properly account for duplicate packets. // Check if this is first packet (only valid for some codecs) // Should only be set for one packet per session. if (packet.isFirstPacket && first_packet_seq_num_ == -1) { // The first packet in a frame signals the frame type. frame_type_ = packet.frameType; // Store the sequence number for the first packet. first_packet_seq_num_ = static_cast<int>(packet.seqNum); } else if (first_packet_seq_num_ != -1 && IsNewerSequenceNumber(first_packet_seq_num_, packet.seqNum)) { LOG(LS_WARNING) << "Received packet with a sequence number which is out " "of frame boundaries"; return -3; } else if (frame_type_ == kFrameEmpty && packet.frameType != kFrameEmpty) { // Update the frame type with the type of the first media packet. // TODO(mikhal): Can this trigger? frame_type_ = packet.frameType; } // Track the marker bit, should only be set for one packet per session. if (packet.markerBit && last_packet_seq_num_ == -1) { last_packet_seq_num_ = static_cast<int>(packet.seqNum); } else if (last_packet_seq_num_ != -1 && IsNewerSequenceNumber(packet.seqNum, last_packet_seq_num_)) { LOG(LS_WARNING) << "Received packet with a sequence number which is out " "of frame boundaries"; return -3; } } // The insert operation invalidates the iterator |rit|. PacketIterator packet_list_it = packets_.insert(rit.base(), packet); //插入rtp payload數據 size_t returnLength = InsertBuffer(frame_buffer, packet_list_it); //在對一幀內的rtp包進行排序，去重后。更新該幀的狀態 UpdateCompleteSession(); if (decode_error_mode == kWithErrors) decodable_ = true; else if (decode_error_mode == kSelectiveErrors) UpdateDecodableSession(frame_data); return static_cast<int>(returnLength); } ~~~ 基本流程： 1. 按SN從小到大的順序排序 2. 去重 3. 插入rtp payload數據 4. 更新該幀的狀態 * UpdateCompleteSession 更新幀狀態為Complete狀態，**更新為Commplete狀態的條件是有第一個包和最后一個包(打了mark的包)，并且之間的SeqNum都是連續的，其實這種條件下的幀是滿足可解碼的**。 ~~~ void VCMSessionInfo::UpdateCompleteSession() { if (HaveFirstPacket() && HaveLastPacket()) { // Do we have all the packets in this session? bool complete_session = true; PacketIterator it = packets_.begin(); PacketIterator prev_it = it; ++it; for (; it != packets_.end(); ++it) { if (!InSequence(it, prev_it)) { complete_session = false; break; } prev_it = it; } complete_ = complete_session; } } ~~~ * UpdateDecodableSession 這里可解碼的條件，我有點搞不明白。按照代碼注釋來羅列下條件: 1. 非關鍵 * 必須有第一個包，按代碼注釋說明如下： > It has the first packet: In VP8 the first packet contains all or part of the first partition, which consists of the most relevant information for decoding. * 根據幀的平均RTP包數判斷 > Either more than the upper threshold of the average number of packets per frame is present or less than the lower threshold of the average number of packets per frame is present: suggests a small frame.Such a frame is unlikely to contain many motion vectors, so having the first packet will likely suffice.Once we have more than the lower threshold of the frame, we know that the frame is medium or large-sized. 翻譯: **存在大于每幀平均包數上限的閾值或小于表示每幀平均包數下限的閾值：建議為一小幀。這樣一幀不太可能包含許多運動矢量，因此擁有第一個數據包就足夠了。一旦我們獲得了幀的下限閾值以上，我們就知道該幀是中型或大型的。** **對于較小的幀，其幀中包含的rtp包的數量是小于每幀平均包的下限閾值，這樣的幀不會攜帶許多運動矢量，擁有第一個數據包就足夠了。** **對于比較大的幀，包含的rtp包的數量大于每幀平均包數的上限閾值，是不是這樣幀是已經攜帶了大部分的運動矢量。加上第一個包的信息是可以解碼的？** **RTP包的數量介于這兩者之間的是不能解碼的。** ~~~ void VCMSessionInfo::UpdateDecodableSession(const FrameData& frame_data) { // Irrelevant if session is already complete or decodable if (complete_ || decodable_) return; // TODO(agalusza): Account for bursty loss. // TODO(agalusza): Refine these values to better approximate optimal ones. // Do not decode frames if the RTT is lower than this. const int64_t kRttThreshold = 100; // Do not decode frames if the number of packets is between these two // thresholds. const float kLowPacketPercentageThreshold = 0.2f; const float kHighPacketPercentageThreshold = 0.8f; if (frame_data.rtt_ms < kRttThreshold || frame_type_ == kVideoFrameKey || !HaveFirstPacket() || (NumPackets() <= kHighPacketPercentageThreshold * frame_data.rolling_average_packets_per_frame && NumPackets() > kLowPacketPercentageThreshold * frame_data.rolling_average_packets_per_frame)) return; decodable_ = true; } ~~~ 計算每幀包數的平均值使用了 moving average 算法，該算法是在時間段內取RTP包個數的平均值來估算視頻流每幀的平均包數。 * VCMJitterBuffer VCMJitterBuffer對視頻幀進行處理,下面是`InsertPacket`方法 ~~~ VCMFrameBufferEnum VCMJitterBuffer::InsertPacket(const VCMPacket& packet, bool* retransmitted) { CriticalSectionScoped cs(crit_sect_); if (nack_module_) nack_module_->OnReceivedPacket(packet); ++num_packets_; if (num_packets_ == 1) { time_first_packet_ms_ = clock_->TimeInMilliseconds(); } // Does this packet belong to an old frame? if (last_decoded_state_.IsOldPacket(&packet)) { //來的太遲的包，會被丟棄掉 // Account only for media packets. if (packet.sizeBytes > 0) { num_discarded_packets_++; num_consecutive_old_packets_++; if (stats_callback_ != NULL) stats_callback_->OnDiscardedPacketsUpdated(num_discarded_packets_); } // Update last decoded sequence number if the packet arrived late and // belongs to a frame with a timestamp equal to the last decoded // timestamp. last_decoded_state_.UpdateOldPacket(&packet); DropPacketsFromNackList(last_decoded_state_.sequence_num()); // Also see if this old packet made more incomplete frames continuous. FindAndInsertContinuousFramesWithState(last_decoded_state_); if (num_consecutive_old_packets_ > kMaxConsecutiveOldPackets) { LOG(LS_WARNING) << num_consecutive_old_packets_ << " consecutive old packets received. Flushing the jitter buffer."; Flush(); return kFlushIndicator; } return kOldPacket; } num_consecutive_old_packets_ = 0; //根據RTP包的時間戳在incomplete_frames_,decodable_frames_,free_frames_ 三種list中找到對應的Frame及Frame List VCMFrameBuffer* frame; FrameList* frame_list; const VCMFrameBufferEnum error = GetFrame(packet, &frame, &frame_list); if (error != kNoError) return error; int64_t now_ms = clock_->TimeInMilliseconds(); // We are keeping track of the first and latest seq numbers, and // the number of wraps to be able to calculate how many packets we expect. if (first_packet_since_reset_) { // Now it's time to start estimating jitter // reset the delay estimate. inter_frame_delay_.Reset(now_ms); } // Empty packets may bias the jitter estimate (lacking size component), // therefore don't let empty packet trigger the following updates: if (packet.frameType != kEmptyFrame) { if (waiting_for_completion_.timestamp == packet.timestamp) { // This can get bad if we have a lot of duplicate packets, // we will then count some packet multiple times. waiting_for_completion_.frame_size += packet.sizeBytes; waiting_for_completion_.latest_packet_time = now_ms; } else if (waiting_for_completion_.latest_packet_time >= 0 && waiting_for_completion_.latest_packet_time + 2000 <= now_ms) { // A packet should never be more than two seconds late UpdateJitterEstimate(waiting_for_completion_, true); waiting_for_completion_.latest_packet_time = -1; waiting_for_completion_.frame_size = 0; waiting_for_completion_.timestamp = 0; } } //獲取在插入該RTP包之前,幀的狀態 VCMFrameBufferStateEnum previous_state = frame->GetState(); // Insert packet. FrameData frame_data; frame_data.rtt_ms = rtt_ms_; frame_data.rolling_average_packets_per_frame = average_packets_per_frame_; //插入RTP包,同時獲取幀的最新狀態 VCMFrameBufferEnum buffer_state = frame->InsertPacket(packet, now_ms, decode_error_mode_, frame_data); if (previous_state != kStateComplete) { TRACE_EVENT_ASYNC_BEGIN1("webrtc", "Video", frame->TimeStamp(), "timestamp", frame->TimeStamp()); } /* buffer_stat大于0 的狀態都視頻幀的正常狀態,包括: *kIncomplete //Frame incomplete *kCompleteSession //at least one layer in the frame complete *kDecodableSession //Frame incomplete, but ready to be decoded *kDuplicatePacket //We're receiving a duplicate packet */ if (buffer_state > 0) { incoming_bit_count_ += packet.sizeBytes << 3; if (first_packet_since_reset_) { latest_received_sequence_number_ = packet.seqNum; first_packet_since_reset_ = false; } else { if (IsPacketRetransmitted(packet)) { frame->IncrementNackCount(); } if (!UpdateNackList(packet.seqNum) && packet.frameType != kVideoFrameKey) { buffer_state = kFlushIndicator; } latest_received_sequence_number_ = LatestSequenceNumber(latest_received_sequence_number_, packet.seqNum); } } // Is the frame already in the decodable list? bool continuous = IsContinuous(*frame); switch (buffer_state) { case kGeneralError: case kTimeStampError: case kSizeError: { //幀為錯誤幀，直接被丟棄掉 free_frames_.push_back(frame); break; } case kCompleteSession: { if (previous_state != kStateDecodable && previous_state != kStateComplete) { CountFrame(*frame); if (continuous) { // Signal that we have a complete session. frame_event_->Set(); } } FALLTHROUGH(); } // Note: There is no break here - continuing to kDecodableSession. case kDecodableSession: { *retransmitted = (frame->GetNackCount() > 0); if (continuous) { decodable_frames_.InsertFrame(frame); FindAndInsertContinuousFrames(*frame); } else { incomplete_frames_.InsertFrame(frame); // If NACKs are enabled, keyframes are triggered by |GetNackList|. if (nack_mode_ == kNoNack && NonContinuousOrIncompleteDuration() > 90 * kMaxDiscontinuousFramesTime) { return kFlushIndicator; } } break; } case kIncomplete: { if (frame->GetState() == kStateEmpty && last_decoded_state_.UpdateEmptyFrame(frame)) { free_frames_.push_back(frame); return kNoError; } else { incomplete_frames_.InsertFrame(frame); // If NACKs are enabled, keyframes are triggered by |GetNackList|. if (nack_mode_ == kNoNack && NonContinuousOrIncompleteDuration() > 90 * kMaxDiscontinuousFramesTime) { return kFlushIndicator; } } break; } case kNoError: case kOutOfBoundsPacket: case kDuplicatePacket: { //錯誤的RTP Packet，相關的Frame還是保持原樣 // Put back the frame where it came from. if (frame_list != NULL) { frame_list->InsertFrame(frame); } else { free_frames_.push_back(frame); } ++num_duplicated_packets_; break; } case kFlushIndicator: free_frames_.push_back(frame); return kFlushIndicator; default: assert(false); } return buffer_state; } ~~~ 基本流程: 1. 根據RTP包的時間戳在incomplete\_frames\_,decodable\_frames\_,free\_frames\_ 三種list中找到對應的Frame及Frame List 2. 獲取在插入該RTP包之前,幀的狀態 3. 在所屬的幀中插入該RTP 包 4. 判斷插入后的幀的狀態，buffer\_stat大于0的狀態都為正常狀態，包括: kIncomplete(Frame incomplete)，kCompleteSession(at least one layer in the frame complete)，kDecodableSession(Frame incomplete, but ready to be decoded)，kDuplicatePacket(收到一個重復的RTP包，對幀的狀態并沒有影響) 5. 判斷在decodable\_frames\_中加入該幀后，g改幀所屬的GOP是否可解碼(decodable\_frames\_可以視為視頻流，包含多個GOP) 6. 判斷該幀所屬的List，如果所屬的GOP可解碼則把它放入decodable\_frames\_ 中，可能是從incomplete\_frames\_ list中放入decodeable\_frames中。需要注意：錯誤狀態(kGeneralError，kTimeStampError，kSizeError)的幀會被丟棄（插入新的RTP包后，造成幀錯誤），插入FreeFrame List中。kNoError，kOutOfBoundsPacket，kDuplicatePacket只是丟棄了RTP包，對幀沒有影響，視頻幀原來屬于哪個List就放入哪個。 ### VCMDecodingState 在VCMJitterBuffer中的decodeable\_frames\_可以認為是一系列GOP，每個GOP包含了多個視頻幀。按照編碼理論，GOP間是不相互引用的，一個GOP內錯誤是不會傳遞到下一GOP。一個GOP的起始幀就是關鍵幀。對complete和decodeable狀態的幀，會判斷是否屬于一個GOP并且可以解碼后，才插入decodeable\_frames\_。那么VCMDecodingState就是用于判斷幀間關系，是否屬于一個GOP，可解碼。 ### 關于幀是否解碼，視頻流是否可連續解碼 **這兩種的判斷邏輯大部分是針對VP8/9的，特別的是在VCMDecodingState中通過picture id，temporal layers等信息判斷，幀之間是否連續。這些信息在H264中好像并不存在？(我不太確定)，應該是針對VP8/9的(因為對VP8/9不熟悉，我也不清楚判斷的幀是否連續的依據是什么)。所以這里要注意下，如果在啟用H264后，出現視頻卡頓，黑屏等問題。這些判斷幀是否解碼以及幀間是否可連續解碼的邏輯也可能是原因之一，針對H264這些條件可能并不成立。**