Files
monibuca/doc/bufreader_analysis.md
2025-10-19 08:03:11 +08:00

20 KiB
Raw Blame History

BufReader: Zero-Copy Network Reading with Non-Contiguous Memory Buffers

Table of Contents

TL;DR (Key Takeaways)

Core Innovation: Non-Contiguous Memory Buffer Passing Mechanism

  • Data stored as sliced memory blocks, non-contiguous layout
  • Pass references via ReadRange callback, zero-copy
  • Memory blocks reused from object pool, avoiding allocation and GC

Performance Data (Streaming server, 100 concurrent streams):

bufio.Reader: 79 GB allocated, 134 GCs, 374.6 ns/op
BufReader:    0.6 GB allocated, 2 GCs, 30.29 ns/op

Result: 98.5% GC reduction, 11.6x throughput improvement

Ideal For: High-concurrency network servers, streaming media, long-running services


1. Problem: Traditional Contiguous Memory Buffer Bottlenecks

1.1 bufio.Reader's Contiguous Memory Model

The standard library bufio.Reader uses a fixed-size contiguous memory buffer:

type Reader struct {
    buf []byte    // Single contiguous buffer (e.g., 4KB)
    r, w int      // Read/write pointers
}

func (b *Reader) Read(p []byte) (n int, err error) {
    // Copy from contiguous buffer to target
    n = copy(p, b.buf[b.r:b.w])  // Must copy
    return
}

Cost of Contiguous Memory:

Reading 16KB data (with 4KB buffer):

Network → bufio buffer → User buffer
  ↓      (4KB contiguous)    ↓
1st      [████]  →  Copy to result[0:4KB]
2nd      [████]  →  Copy to result[4KB:8KB]
3rd      [████]  →  Copy to result[8KB:12KB]
4th      [████]  →  Copy to result[12KB:16KB]

Total: 4 network reads + 4 memory copies
Allocates result (16KB contiguous memory)

1.2 Issues in High-Concurrency Scenarios

In streaming servers (100 concurrent connections, 30fps each):

// Typical processing pattern
func handleStream(conn net.Conn) {
    reader := bufio.NewReaderSize(conn, 4096)
    for {
        // Allocate contiguous buffer for each packet
        packet := make([]byte, 1024)  // Allocation 1
        n, _ := reader.Read(packet)   // Copy 1
        
        // Forward to multiple subscribers
        for _, sub := range subscribers {
            data := make([]byte, n)  // Allocations 2-N
            copy(data, packet[:n])   // Copies 2-N
            sub.Write(data)
        }
    }
}

// Performance impact:
// 100 connections × 30fps × (1 + subscribers) allocations = massive temporary memory
// Triggers frequent GC, system instability

Core Problems:

  1. Must maintain contiguous memory layout → Frequent copying
  2. Allocate new buffer for each packet → Massive temporary objects
  3. Forwarding requires multiple copies → CPU wasted on memory operations

2. Core Solution: Non-Contiguous Memory Buffer Passing Mechanism

2.1 Design Philosophy

BufReader uses non-contiguous memory block slices:

No longer require data in contiguous memory:
1. Data scattered across multiple memory blocks (slice)
2. Each block independently managed and reused
3. Pass by reference, no data copying

Core Data Structures:

type BufReader struct {
    Allocator *ScalableMemoryAllocator  // Object pool allocator
    buf       MemoryReader               // Memory block slice
}

type MemoryReader struct {
    Buffers [][]byte  // Multiple memory blocks, non-contiguous!
    Size    int       // Total size
    Length  int       // Readable length
}

2.2 Non-Contiguous Memory Buffer Model

Contiguous vs Non-Contiguous Comparison

bufio.Reader (Contiguous Memory):
┌─────────────────────────────────┐
│ 4KB Fixed Buffer                │
│ [Read][Available]               │
└─────────────────────────────────┘
- Must copy to contiguous target buffer
- Fixed size limitation
- Read portion wastes space

BufReader (Non-Contiguous Memory):
┌──────┐ ┌──────┐ ┌────────┐ ┌──────┐
│Block1│→│Block2│→│ Block3 │→│Block4│
│ 512B │ │ 1KB  │ │  2KB   │ │ 3KB  │
└──────┘ └──────┘ └────────┘ └──────┘
- Directly pass reference to each block (zero-copy)
- Flexible block sizes
- Recycle immediately after processing

Memory Block Chain Workflow

sequenceDiagram
    participant N as Network
    participant P as Object Pool
    participant B as BufReader.buf
    participant U as User Code
    
    N->>P: 1st read (returns 512B)
    P-->>B: Block1 (512B) - from pool or new
    B->>B: Buffers = [Block1]
    
    N->>P: 2nd read (returns 1KB)
    P-->>B: Block2 (1KB) - reused from pool
    B->>B: Buffers = [Block1, Block2]
    
    N->>P: 3rd read (returns 2KB)
    P-->>B: Block3 (2KB)
    B->>B: Buffers = [Block1, Block2, Block3]
    
    U->>B: ReadRange(4096)
    B->>U: yield(Block1) - pass reference
    B->>U: yield(Block2) - pass reference
    B->>U: yield(Block3) - pass reference
    B->>U: yield(Block4[0:512])
    
    U->>B: Processing complete
    B->>P: Recycle Block1, Block2, Block3, Block4
    Note over P: Memory blocks return to pool for reuse

2.3 Zero-Copy Passing: ReadRange API

Core API:

func (r *BufReader) ReadRange(n int, yield func([]byte)) error

How It Works:

// Internal implementation (simplified)
func (r *BufReader) ReadRange(n int, yield func([]byte)) error {
    remaining := n
    
    // Iterate through memory block slice
    for _, block := range r.buf.Buffers {
        if remaining <= 0 {
            break
        }
        
        if len(block) <= remaining {
            // Pass entire block
            yield(block)  // Zero-copy: pass reference directly!
            remaining -= len(block)
        } else {
            // Pass portion
            yield(block[:remaining])
            remaining = 0
        }
    }
    
    // Recycle processed blocks
    r.recycleFront()
    return nil
}

Usage Example:

// Read 4096 bytes of data
reader.ReadRange(4096, func(chunk []byte) {
    // chunk is reference to original memory block
    // May be called multiple times with different sized blocks
    // e.g.: 512B, 1KB, 2KB, 512B
    
    processData(chunk)  // Process directly, zero-copy!
})

// Characteristics:
// - No need to allocate target buffer
// - No need to copy data
// - Each chunk automatically recycled after processing

2.4 Advantages in Real Network Scenarios

Scenario: Read 10KB from network, each read returns 500B-2KB

bufio.Reader (Contiguous Memory):
1. Read 2KB to internal buffer (contiguous)
2. Copy 2KB to user buffer ← Copy
3. Read 1.5KB to internal buffer
4. Copy 1.5KB to user buffer ← Copy
5. Read 2KB...
6. Copy 2KB... ← Copy
... Repeat ...
Total: Multiple network reads + Multiple memory copies
Must allocate 10KB contiguous buffer

BufReader (Non-Contiguous Memory):
1. Read 2KB → Block1, append to slice
2. Read 1.5KB → Block2, append to slice
3. Read 2KB → Block3, append to slice
4. Read 2KB → Block4, append to slice
5. Read 2.5KB → Block5, append to slice
6. ReadRange(10KB):
   → yield(Block1) - 2KB
   → yield(Block2) - 1.5KB
   → yield(Block3) - 2KB
   → yield(Block4) - 2KB
   → yield(Block5) - 2.5KB
Total: Multiple network reads + 0 memory copies
No contiguous memory needed, process block by block

2.5 Real Application: Stream Forwarding

Problem Scenario: 100 concurrent streams, each forwarded to 10 subscribers

Traditional Approach (Contiguous Memory):

func forwardStream_Traditional(reader *bufio.Reader, subscribers []net.Conn) {
    packet := make([]byte, 4096)  // Alloc 1: contiguous memory
    n, _ := reader.Read(packet)   // Copy 1: from bufio buffer
    
    // Copy for each subscriber
    for _, sub := range subscribers {
        data := make([]byte, n)  // Allocs 2-11: 10 times
        copy(data, packet[:n])   // Copies 2-11: 10 times
        sub.Write(data)
    }
}
// Per packet: 11 allocations + 11 copies
// 100 concurrent × 30fps × 11 = 33,000 allocations/sec

BufReader Approach (Non-Contiguous Memory):

func forwardStream_BufReader(reader *BufReader, subscribers []net.Conn) {
    reader.ReadRange(4096, func(chunk []byte) {
        // chunk is original memory block reference, may be non-contiguous
        // All subscribers share the same memory block!
        
        for _, sub := range subscribers {
            sub.Write(chunk)  // Send reference directly, zero-copy
        }
    })
}
// Per packet: 0 allocations + 0 copies
// 100 concurrent × 30fps × 0 = 0 allocations/sec

Performance Comparison:

  • Allocations: 33,000/sec → 0/sec
  • Memory copies: 33,000/sec → 0/sec
  • GC pressure: High → Very low

2.6 Memory Block Lifecycle

stateDiagram-v2
    [*] --> Get from Pool
    Get from Pool --> Read Network Data
    Read Network Data --> Append to Slice
    Append to Slice --> Pass to User
    Pass to User --> User Processing
    User Processing --> Recycle to Pool
    Recycle to Pool --> Get from Pool
    
    note right of Get from Pool
        Reuse existing blocks
        Avoid GC
    end note
    
    note right of Pass to User
        Pass reference, zero-copy
        May pass to multiple subscribers
    end note
    
    note right of Recycle to Pool
        Active recycling
        Immediately reusable
    end note

Key Points:

  1. Memory blocks circularly reused in pool, bypassing GC
  2. Pass references instead of copying data, achieving zero-copy
  3. Recycle immediately after processing, minimizing memory footprint

2.7 Core Code Implementation

// Create BufReader
func NewBufReader(reader io.Reader) *BufReader {
    return &BufReader{
        Allocator: NewScalableMemoryAllocator(16384), // Object pool
        feedData: func() error {
            // Get memory block from pool, read network data directly
            buf, err := r.Allocator.Read(reader, r.BufLen)
            if err != nil {
                return err
            }
            // Append to slice (only add reference)
            r.buf.Buffers = append(r.buf.Buffers, buf)
            r.buf.Length += len(buf)
            return nil
        },
    }
}

// Zero-copy reading
func (r *BufReader) ReadRange(n int, yield func([]byte)) error {
    for r.buf.Length < n {
        r.feedData()  // Read more data from network
    }
    
    // Pass references block by block
    for _, block := range r.buf.Buffers {
        yield(block)  // Zero-copy passing
    }
    
    // Recycle processed blocks
    r.recycleFront()
    return nil
}

// Recycle memory blocks to pool
func (r *BufReader) Recycle() {
    if r.Allocator != nil {
        r.Allocator.Recycle()  // Return all blocks to pool
    }
}

3. Performance Validation

3.1 Test Design

Real Network Simulation: Each read returns random size (64-2048 bytes), simulating real network fluctuations

Core Test Scenarios:

  1. Concurrent Network Connection Reading - Simulate 100+ concurrent connections
  2. GC Pressure Test - Demonstrate long-term running differences
  3. Streaming Server - Real business scenario (100 streams × forwarding)

3.2 Performance Test Results

Test Environment: Apple M2 Pro, Go 1.23.0

GC Pressure Test (Core Comparison)

Metric bufio.Reader BufReader Improvement
Operation Latency 1874 ns/op 112.7 ns/op 16.6x faster
Allocation Count 5,576,659 3,918 99.93% reduction
Per Operation 2 allocs/op 0 allocs/op Zero allocation
Throughput 2.8M ops/s 45.7M ops/s 16x improvement

Streaming Server Scenario

Metric bufio.Reader BufReader Improvement
Operation Latency 374.6 ns/op 30.29 ns/op 12.4x faster
Memory Allocation 79,508 MB 601 MB 99.2% reduction
GC Runs 134 2 98.5% reduction
Throughput 10.1M ops/s 117M ops/s 11.6x improvement

Performance Visualization

📊 GC Runs Comparison (Core Advantage)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
bufio.Reader   ████████████████████████████████████████████████████████████████  134 runs
BufReader      █  2 runs  ← 98.5% reduction!

📊 Total Memory Allocation
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
bufio.Reader   ████████████████████████████████████████████████████████████████  79 GB
BufReader      █  0.6 GB  ← 99.2% reduction!

📊 Throughput Comparison
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
bufio.Reader   █████  10.1M ops/s
BufReader      ████████████████████████████████████████████████████████  117M ops/s

3.3 Why Non-Contiguous Memory Is So Fast

Reason 1: Zero-Copy Passing

// bufio - Must copy
buf := make([]byte, 1024)
reader.Read(buf)  // Copy to contiguous memory

// BufReader - Pass reference
reader.ReadRange(1024, func(chunk []byte) {
    // chunk is original memory block, no copy
})

Reason 2: Memory Block Reuse

bufio: Allocate → Use → GC → Reallocate → ...
BufReader: Allocate → Use → Return to pool → Reuse from pool → ...
         ↑ Same memory block reused repeatedly, no GC

Reason 3: Multi-Subscriber Sharing

Traditional: 1 packet → Copy 10 times → 10 subscribers
BufReader: 1 packet → Pass reference → 10 subscribers share
          ↑ Only 1 memory block, all 10 subscribers reference it

4. Usage Guide

4.1 Basic Usage

func handleConnection(conn net.Conn) {
    // Create BufReader
    reader := util.NewBufReader(conn)
    defer reader.Recycle()  // Return all blocks to pool
    
    // Zero-copy read and process
    reader.ReadRange(4096, func(chunk []byte) {
        // chunk is non-contiguous memory block
        // Process directly, no copy needed
        processChunk(chunk)
    })
}

4.2 Real-World Use Cases

Scenario 1: Protocol Parsing

// Parse FLV packet (header + data)
func parseFLV(reader *BufReader) {
    // Read packet type (1 byte)
    packetType, _ := reader.ReadByte()
    
    // Read data size (3 bytes)
    dataSize, _ := reader.ReadBE32(3)
    
    // Skip timestamp etc (7 bytes)
    reader.Skip(7)
    
    // Zero-copy read data (may span multiple non-contiguous blocks)
    reader.ReadRange(int(dataSize), func(chunk []byte) {
        // chunk may be complete data or partial
        // Parse block by block, no need to wait for complete data
        parseDataChunk(packetType, chunk)
    })
}

Scenario 2: High-Concurrency Forwarding

// Read from one source, forward to multiple targets
func relay(source *BufReader, targets []io.Writer) {
    reader.ReadRange(8192, func(chunk []byte) {
        // All targets share the same memory block
        for _, target := range targets {
            target.Write(chunk)  // Zero-copy forwarding
        }
    })
}

Scenario 3: Streaming Server

// Receive RTSP stream and distribute to subscribers
type Stream struct {
    reader      *BufReader
    subscribers []*Subscriber
}

func (s *Stream) Process() {
    s.reader.ReadRange(65536, func(frame []byte) {
        // frame may be part of video frame (non-contiguous)
        // Send directly to all subscribers
        for _, sub := range s.subscribers {
            sub.WriteFrame(frame)  // Shared memory, zero-copy
        }
    })
}

4.3 Best Practices

Correct Usage:

// 1. Always recycle resources
reader := util.NewBufReader(conn)
defer reader.Recycle()

// 2. Process directly in callback, don't save references
reader.ReadRange(1024, func(data []byte) {
    processData(data)  // ✅ Process immediately
})

// 3. Explicitly copy when retention needed
var saved []byte
reader.ReadRange(1024, func(data []byte) {
    saved = append(saved, data...)  // ✅ Explicit copy
})

Wrong Usage:

// ❌ Don't save references
var dangling []byte
reader.ReadRange(1024, func(data []byte) {
    dangling = data  // Wrong: data will be recycled
})
// dangling is now a dangling reference!

// ❌ Don't forget to recycle
reader := util.NewBufReader(conn)
// Missing defer reader.Recycle()
// Memory blocks cannot be returned to pool

4.4 Performance Optimization Tips

Tip 1: Batch Processing

// ✅ Optimized: Read multiple packets at once
reader.ReadRange(65536, func(chunk []byte) {
    // One chunk may contain multiple packets
    for len(chunk) >= 4 {
        size := int(binary.BigEndian.Uint32(chunk[:4]))
        packet := chunk[4 : 4+size]
        processPacket(packet)
        chunk = chunk[4+size:]
    }
})

Tip 2: Choose Appropriate Block Size

// Choose based on application scenario
const (
    SmallPacket  = 4 << 10   // 4KB  - RTSP/HTTP
    MediumPacket = 16 << 10  // 16KB - Audio streams
    LargePacket  = 64 << 10  // 64KB - Video streams
)

reader := util.NewBufReaderWithBufLen(conn, LargePacket)

5. Summary

Core Innovation: Non-Contiguous Memory Buffering

BufReader's core is not "better buffering" but fundamentally changing the memory layout model:

Traditional thinking: Data must be in contiguous memory
BufReader: Data can be scattered across blocks, passed by reference

Result:
✓ Zero-copy: No need to reassemble into contiguous memory
✓ Zero allocation: Memory blocks reused from object pool
✓ Zero GC pressure: No temporary objects created

Key Advantages

Feature Implementation Performance Impact
Zero-Copy Pass memory block references No copy overhead
Zero Allocation Object pool reuse 98.5% GC reduction
Multi-Subscriber Sharing Same block referenced multiple times 10x+ memory savings
Flexible Block Sizes Adapt to network fluctuations No reassembly needed

Ideal Use Cases

Scenario Recommended Reason
High-concurrency network servers BufReader 98% GC reduction, 10x+ throughput
Stream forwarding BufReader Zero-copy multicast, memory sharing
Protocol parsers BufReader Parse block by block, no complete packet needed
Long-running services BufReader Stable system, minimal GC impact
Simple file reading bufio.Reader Standard library sufficient

Key Points

Remember when using BufReader:

  1. Accept non-contiguous data: Process each block via callback
  2. Don't hold references: Data recycled after callback returns
  3. Leverage ReadRange: This is the core zero-copy API
  4. Must call Recycle(): Return memory blocks to pool

Performance Data

Streaming Server (100 concurrent streams, continuous running):

1-hour running estimation:

bufio.Reader (Contiguous Memory):
- Allocates 2.8 TB memory
- Triggers 4,800 GCs
- Frequent system pauses

BufReader (Non-Contiguous Memory):
- Allocates 21 GB memory (133x less)
- Triggers 72 GCs (67x less)
- Almost no GC impact

Testing and Documentation

Run Tests:

sh scripts/benchmark_bufreader.sh

References

  • GoMem Project - Memory object pool implementation
  • Monibuca v5 - Streaming media server
  • Test Code: pkg/util/buf_reader_benchmark_test.go

Core Idea: Eliminate traditional contiguous buffer copying overhead through non-contiguous memory block slices and zero-copy reference passing, achieving high-performance network data processing.