# BufReader: Zero-Copy Network Reading with Non-Contiguous Memory Buffers ## Table of Contents - [1. Problem: Traditional Contiguous Memory Buffer Bottlenecks](#1-problem-traditional-contiguous-memory-buffer-bottlenecks) - [2. Core Solution: Non-Contiguous Memory Buffer Passing Mechanism](#2-core-solution-non-contiguous-memory-buffer-passing-mechanism) - [3. Performance Validation](#3-performance-validation) - [4. Usage Guide](#4-usage-guide) ## TL;DR (Key Takeaways) **Core Innovation**: Non-Contiguous Memory Buffer Passing Mechanism - Data stored as **sliced memory blocks**, non-contiguous layout - Pass references via **ReadRange callback**, zero-copy - Memory blocks **reused from object pool**, avoiding allocation and GC **Performance Data** (Streaming server, 100 concurrent streams): ``` bufio.Reader: 79 GB allocated, 134 GCs, 374.6 ns/op BufReader: 0.6 GB allocated, 2 GCs, 30.29 ns/op Result: 98.5% GC reduction, 11.6x throughput improvement ``` **Ideal For**: High-concurrency network servers, streaming media, long-running services --- ## 1. Problem: Traditional Contiguous Memory Buffer Bottlenecks ### 1.1 bufio.Reader's Contiguous Memory Model The standard library `bufio.Reader` uses a **fixed-size contiguous memory buffer**: ```go type Reader struct { buf []byte // Single contiguous buffer (e.g., 4KB) r, w int // Read/write pointers } func (b *Reader) Read(p []byte) (n int, err error) { // Copy from contiguous buffer to target n = copy(p, b.buf[b.r:b.w]) // Must copy return } ``` **Cost of Contiguous Memory**: ``` Reading 16KB data (with 4KB buffer): Network → bufio buffer → User buffer ↓ (4KB contiguous) ↓ 1st [████] → Copy to result[0:4KB] 2nd [████] → Copy to result[4KB:8KB] 3rd [████] → Copy to result[8KB:12KB] 4th [████] → Copy to result[12KB:16KB] Total: 4 network reads + 4 memory copies Allocates result (16KB contiguous memory) ``` ### 1.2 Issues in High-Concurrency Scenarios In streaming servers (100 concurrent connections, 30fps each): ```go // Typical processing pattern func handleStream(conn net.Conn) { reader := bufio.NewReaderSize(conn, 4096) for { // Allocate contiguous buffer for each packet packet := make([]byte, 1024) // Allocation 1 n, _ := reader.Read(packet) // Copy 1 // Forward to multiple subscribers for _, sub := range subscribers { data := make([]byte, n) // Allocations 2-N copy(data, packet[:n]) // Copies 2-N sub.Write(data) } } } // Performance impact: // 100 connections × 30fps × (1 + subscribers) allocations = massive temporary memory // Triggers frequent GC, system instability ``` **Core Problems**: 1. Must maintain contiguous memory layout → Frequent copying 2. Allocate new buffer for each packet → Massive temporary objects 3. Forwarding requires multiple copies → CPU wasted on memory operations ## 2. Core Solution: Non-Contiguous Memory Buffer Passing Mechanism ### 2.1 Design Philosophy BufReader uses **non-contiguous memory block slices**: ``` No longer require data in contiguous memory: 1. Data scattered across multiple memory blocks (slice) 2. Each block independently managed and reused 3. Pass by reference, no data copying ``` **Core Data Structures**: ```go type BufReader struct { Allocator *ScalableMemoryAllocator // Object pool allocator buf MemoryReader // Memory block slice } type MemoryReader struct { Buffers [][]byte // Multiple memory blocks, non-contiguous! Size int // Total size Length int // Readable length } ``` ### 2.2 Non-Contiguous Memory Buffer Model #### Contiguous vs Non-Contiguous Comparison ``` bufio.Reader (Contiguous Memory): ┌─────────────────────────────────┐ │ 4KB Fixed Buffer │ │ [Read][Available] │ └─────────────────────────────────┘ - Must copy to contiguous target buffer - Fixed size limitation - Read portion wastes space BufReader (Non-Contiguous Memory): ┌──────┐ ┌──────┐ ┌────────┐ ┌──────┐ │Block1│→│Block2│→│ Block3 │→│Block4│ │ 512B │ │ 1KB │ │ 2KB │ │ 3KB │ └──────┘ └──────┘ └────────┘ └──────┘ - Directly pass reference to each block (zero-copy) - Flexible block sizes - Recycle immediately after processing ``` #### Memory Block Chain Workflow ```mermaid sequenceDiagram participant N as Network participant P as Object Pool participant B as BufReader.buf participant U as User Code N->>P: 1st read (returns 512B) P-->>B: Block1 (512B) - from pool or new B->>B: Buffers = [Block1] N->>P: 2nd read (returns 1KB) P-->>B: Block2 (1KB) - reused from pool B->>B: Buffers = [Block1, Block2] N->>P: 3rd read (returns 2KB) P-->>B: Block3 (2KB) B->>B: Buffers = [Block1, Block2, Block3] U->>B: ReadRange(4096) B->>U: yield(Block1) - pass reference B->>U: yield(Block2) - pass reference B->>U: yield(Block3) - pass reference B->>U: yield(Block4[0:512]) U->>B: Processing complete B->>P: Recycle Block1, Block2, Block3, Block4 Note over P: Memory blocks return to pool for reuse ``` ### 2.3 Zero-Copy Passing: ReadRange API **Core API**: ```go func (r *BufReader) ReadRange(n int, yield func([]byte)) error ``` **How It Works**: ```go // Internal implementation (simplified) func (r *BufReader) ReadRange(n int, yield func([]byte)) error { remaining := n // Iterate through memory block slice for _, block := range r.buf.Buffers { if remaining <= 0 { break } if len(block) <= remaining { // Pass entire block yield(block) // Zero-copy: pass reference directly! remaining -= len(block) } else { // Pass portion yield(block[:remaining]) remaining = 0 } } // Recycle processed blocks r.recycleFront() return nil } ``` **Usage Example**: ```go // Read 4096 bytes of data reader.ReadRange(4096, func(chunk []byte) { // chunk is reference to original memory block // May be called multiple times with different sized blocks // e.g.: 512B, 1KB, 2KB, 512B processData(chunk) // Process directly, zero-copy! }) // Characteristics: // - No need to allocate target buffer // - No need to copy data // - Each chunk automatically recycled after processing ``` ### 2.4 Advantages in Real Network Scenarios **Scenario: Read 10KB from network, each read returns 500B-2KB** ``` bufio.Reader (Contiguous Memory): 1. Read 2KB to internal buffer (contiguous) 2. Copy 2KB to user buffer ← Copy 3. Read 1.5KB to internal buffer 4. Copy 1.5KB to user buffer ← Copy 5. Read 2KB... 6. Copy 2KB... ← Copy ... Repeat ... Total: Multiple network reads + Multiple memory copies Must allocate 10KB contiguous buffer BufReader (Non-Contiguous Memory): 1. Read 2KB → Block1, append to slice 2. Read 1.5KB → Block2, append to slice 3. Read 2KB → Block3, append to slice 4. Read 2KB → Block4, append to slice 5. Read 2.5KB → Block5, append to slice 6. ReadRange(10KB): → yield(Block1) - 2KB → yield(Block2) - 1.5KB → yield(Block3) - 2KB → yield(Block4) - 2KB → yield(Block5) - 2.5KB Total: Multiple network reads + 0 memory copies No contiguous memory needed, process block by block ``` ### 2.5 Real Application: Stream Forwarding **Problem Scenario**: 100 concurrent streams, each forwarded to 10 subscribers **Traditional Approach** (Contiguous Memory): ```go func forwardStream_Traditional(reader *bufio.Reader, subscribers []net.Conn) { packet := make([]byte, 4096) // Alloc 1: contiguous memory n, _ := reader.Read(packet) // Copy 1: from bufio buffer // Copy for each subscriber for _, sub := range subscribers { data := make([]byte, n) // Allocs 2-11: 10 times copy(data, packet[:n]) // Copies 2-11: 10 times sub.Write(data) } } // Per packet: 11 allocations + 11 copies // 100 concurrent × 30fps × 11 = 33,000 allocations/sec ``` **BufReader Approach** (Non-Contiguous Memory): ```go func forwardStream_BufReader(reader *BufReader, subscribers []net.Conn) { reader.ReadRange(4096, func(chunk []byte) { // chunk is original memory block reference, may be non-contiguous // All subscribers share the same memory block! for _, sub := range subscribers { sub.Write(chunk) // Send reference directly, zero-copy } }) } // Per packet: 0 allocations + 0 copies // 100 concurrent × 30fps × 0 = 0 allocations/sec ``` **Performance Comparison**: - Allocations: 33,000/sec → 0/sec - Memory copies: 33,000/sec → 0/sec - GC pressure: High → Very low ### 2.6 Memory Block Lifecycle ```mermaid stateDiagram-v2 [*] --> Get from Pool Get from Pool --> Read Network Data Read Network Data --> Append to Slice Append to Slice --> Pass to User Pass to User --> User Processing User Processing --> Recycle to Pool Recycle to Pool --> Get from Pool note right of Get from Pool Reuse existing blocks Avoid GC end note note right of Pass to User Pass reference, zero-copy May pass to multiple subscribers end note note right of Recycle to Pool Active recycling Immediately reusable end note ``` **Key Points**: 1. Memory blocks **circularly reused** in pool, bypassing GC 2. Pass references instead of copying data, achieving zero-copy 3. Recycle immediately after processing, minimizing memory footprint ### 2.7 Core Code Implementation ```go // Create BufReader func NewBufReader(reader io.Reader) *BufReader { return &BufReader{ Allocator: NewScalableMemoryAllocator(16384), // Object pool feedData: func() error { // Get memory block from pool, read network data directly buf, err := r.Allocator.Read(reader, r.BufLen) if err != nil { return err } // Append to slice (only add reference) r.buf.Buffers = append(r.buf.Buffers, buf) r.buf.Length += len(buf) return nil }, } } // Zero-copy reading func (r *BufReader) ReadRange(n int, yield func([]byte)) error { for r.buf.Length < n { r.feedData() // Read more data from network } // Pass references block by block for _, block := range r.buf.Buffers { yield(block) // Zero-copy passing } // Recycle processed blocks r.recycleFront() return nil } // Recycle memory blocks to pool func (r *BufReader) Recycle() { if r.Allocator != nil { r.Allocator.Recycle() // Return all blocks to pool } } ``` ## 3. Performance Validation ### 3.1 Test Design **Real Network Simulation**: Each read returns random size (64-2048 bytes), simulating real network fluctuations **Core Test Scenarios**: 1. **Concurrent Network Connection Reading** - Simulate 100+ concurrent connections 2. **GC Pressure Test** - Demonstrate long-term running differences 3. **Streaming Server** - Real business scenario (100 streams × forwarding) ### 3.2 Performance Test Results **Test Environment**: Apple M2 Pro, Go 1.23.0 #### GC Pressure Test (Core Comparison) | Metric | bufio.Reader | BufReader | Improvement | |--------|-------------|-----------|-------------| | Operation Latency | 1874 ns/op | 112.7 ns/op | **16.6x faster** | | Allocation Count | 5,576,659 | 3,918 | **99.93% reduction** | | Per Operation | 2 allocs/op | 0 allocs/op | **Zero allocation** | | Throughput | 2.8M ops/s | 45.7M ops/s | **16x improvement** | #### Streaming Server Scenario | Metric | bufio.Reader | BufReader | Improvement | |--------|-------------|-----------|-------------| | Operation Latency | 374.6 ns/op | 30.29 ns/op | **12.4x faster** | | Memory Allocation | 79,508 MB | 601 MB | **99.2% reduction** | | **GC Runs** | **134** | **2** | **98.5% reduction** ⭐ | | Throughput | 10.1M ops/s | 117M ops/s | **11.6x improvement** | #### Performance Visualization ``` 📊 GC Runs Comparison (Core Advantage) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ bufio.Reader ████████████████████████████████████████████████████████████████ 134 runs BufReader █ 2 runs ← 98.5% reduction! 📊 Total Memory Allocation ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ bufio.Reader ████████████████████████████████████████████████████████████████ 79 GB BufReader █ 0.6 GB ← 99.2% reduction! 📊 Throughput Comparison ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ bufio.Reader █████ 10.1M ops/s BufReader ████████████████████████████████████████████████████████ 117M ops/s ``` ### 3.3 Why Non-Contiguous Memory Is So Fast **Reason 1: Zero-Copy Passing** ```go // bufio - Must copy buf := make([]byte, 1024) reader.Read(buf) // Copy to contiguous memory // BufReader - Pass reference reader.ReadRange(1024, func(chunk []byte) { // chunk is original memory block, no copy }) ``` **Reason 2: Memory Block Reuse** ``` bufio: Allocate → Use → GC → Reallocate → ... BufReader: Allocate → Use → Return to pool → Reuse from pool → ... ↑ Same memory block reused repeatedly, no GC ``` **Reason 3: Multi-Subscriber Sharing** ``` Traditional: 1 packet → Copy 10 times → 10 subscribers BufReader: 1 packet → Pass reference → 10 subscribers share ↑ Only 1 memory block, all 10 subscribers reference it ``` ## 4. Usage Guide ### 4.1 Basic Usage ```go func handleConnection(conn net.Conn) { // Create BufReader reader := util.NewBufReader(conn) defer reader.Recycle() // Return all blocks to pool // Zero-copy read and process reader.ReadRange(4096, func(chunk []byte) { // chunk is non-contiguous memory block // Process directly, no copy needed processChunk(chunk) }) } ``` ### 4.2 Real-World Use Cases **Scenario 1: Protocol Parsing** ```go // Parse FLV packet (header + data) func parseFLV(reader *BufReader) { // Read packet type (1 byte) packetType, _ := reader.ReadByte() // Read data size (3 bytes) dataSize, _ := reader.ReadBE32(3) // Skip timestamp etc (7 bytes) reader.Skip(7) // Zero-copy read data (may span multiple non-contiguous blocks) reader.ReadRange(int(dataSize), func(chunk []byte) { // chunk may be complete data or partial // Parse block by block, no need to wait for complete data parseDataChunk(packetType, chunk) }) } ``` **Scenario 2: High-Concurrency Forwarding** ```go // Read from one source, forward to multiple targets func relay(source *BufReader, targets []io.Writer) { reader.ReadRange(8192, func(chunk []byte) { // All targets share the same memory block for _, target := range targets { target.Write(chunk) // Zero-copy forwarding } }) } ``` **Scenario 3: Streaming Server** ```go // Receive RTSP stream and distribute to subscribers type Stream struct { reader *BufReader subscribers []*Subscriber } func (s *Stream) Process() { s.reader.ReadRange(65536, func(frame []byte) { // frame may be part of video frame (non-contiguous) // Send directly to all subscribers for _, sub := range s.subscribers { sub.WriteFrame(frame) // Shared memory, zero-copy } }) } ``` ### 4.3 Best Practices **✅ Correct Usage**: ```go // 1. Always recycle resources reader := util.NewBufReader(conn) defer reader.Recycle() // 2. Process directly in callback, don't save references reader.ReadRange(1024, func(data []byte) { processData(data) // ✅ Process immediately }) // 3. Explicitly copy when retention needed var saved []byte reader.ReadRange(1024, func(data []byte) { saved = append(saved, data...) // ✅ Explicit copy }) ``` **❌ Wrong Usage**: ```go // ❌ Don't save references var dangling []byte reader.ReadRange(1024, func(data []byte) { dangling = data // Wrong: data will be recycled }) // dangling is now a dangling reference! // ❌ Don't forget to recycle reader := util.NewBufReader(conn) // Missing defer reader.Recycle() // Memory blocks cannot be returned to pool ``` ### 4.4 Performance Optimization Tips **Tip 1: Batch Processing** ```go // ✅ Optimized: Read multiple packets at once reader.ReadRange(65536, func(chunk []byte) { // One chunk may contain multiple packets for len(chunk) >= 4 { size := int(binary.BigEndian.Uint32(chunk[:4])) packet := chunk[4 : 4+size] processPacket(packet) chunk = chunk[4+size:] } }) ``` **Tip 2: Choose Appropriate Block Size** ```go // Choose based on application scenario const ( SmallPacket = 4 << 10 // 4KB - RTSP/HTTP MediumPacket = 16 << 10 // 16KB - Audio streams LargePacket = 64 << 10 // 64KB - Video streams ) reader := util.NewBufReaderWithBufLen(conn, LargePacket) ``` ## 5. Summary ### Core Innovation: Non-Contiguous Memory Buffering BufReader's core is not "better buffering" but **fundamentally changing the memory layout model**: ``` Traditional thinking: Data must be in contiguous memory BufReader: Data can be scattered across blocks, passed by reference Result: ✓ Zero-copy: No need to reassemble into contiguous memory ✓ Zero allocation: Memory blocks reused from object pool ✓ Zero GC pressure: No temporary objects created ``` ### Key Advantages | Feature | Implementation | Performance Impact | |---------|---------------|-------------------| | **Zero-Copy** | Pass memory block references | No copy overhead | | **Zero Allocation** | Object pool reuse | 98.5% GC reduction | | **Multi-Subscriber Sharing** | Same block referenced multiple times | 10x+ memory savings | | **Flexible Block Sizes** | Adapt to network fluctuations | No reassembly needed | ### Ideal Use Cases | Scenario | Recommended | Reason | |----------|------------|---------| | **High-concurrency network servers** | BufReader ⭐ | 98% GC reduction, 10x+ throughput | | **Stream forwarding** | BufReader ⭐ | Zero-copy multicast, memory sharing | | **Protocol parsers** | BufReader ⭐ | Parse block by block, no complete packet needed | | **Long-running services** | BufReader ⭐ | Stable system, minimal GC impact | | Simple file reading | bufio.Reader | Standard library sufficient | ### Key Points Remember when using BufReader: 1. **Accept non-contiguous data**: Process each block via callback 2. **Don't hold references**: Data recycled after callback returns 3. **Leverage ReadRange**: This is the core zero-copy API 4. **Must call Recycle()**: Return memory blocks to pool ### Performance Data **Streaming Server (100 concurrent streams, continuous running)**: ``` 1-hour running estimation: bufio.Reader (Contiguous Memory): - Allocates 2.8 TB memory - Triggers 4,800 GCs - Frequent system pauses BufReader (Non-Contiguous Memory): - Allocates 21 GB memory (133x less) - Triggers 72 GCs (67x less) - Almost no GC impact ``` ### Testing and Documentation **Run Tests**: ```bash sh scripts/benchmark_bufreader.sh ``` ## References - [GoMem Project](https://github.com/langhuihui/gomem) - Memory object pool implementation - [Monibuca v5](https://m7s.live) - Streaming media server - Test Code: `pkg/util/buf_reader_benchmark_test.go` --- **Core Idea**: Eliminate traditional contiguous buffer copying overhead through non-contiguous memory block slices and zero-copy reference passing, achieving high-performance network data processing.