diff --git a/docs/quantization/images/wint2.png b/docs/quantization/images/wint2.png new file mode 100644 index 000000000..a117ea8af Binary files /dev/null and b/docs/quantization/images/wint2.png differ diff --git a/docs/quantization/wint2.md b/docs/quantization/wint2.md index f82b7da73..e7c586632 100644 --- a/docs/quantization/wint2.md +++ b/docs/quantization/wint2.md @@ -4,7 +4,7 @@ Weights are compressed offline using the [CCQ (Convolutional Coding Quantization - **Supported Hardware**: GPU - **Supported Architecture**: MoE architecture This method relies on the convolution algorithm to use overlapping bits to map 2-bit values ​​to a larger numerical representation space, so that the model weight quantization retains more information of the original data while compressing the true value to an extremely low 2-bit size. The general principle can be seen in the figure below: -[卷积编码量化示意图](./wint2.png) +![卷积编码量化示意图](./images/wint2.png) CCQ WINT2 is generally used in resource-constrained and low-threshold scenarios. Taking ERNIE-4.5-300B-A47B as an example, weights are compressed to 89GB, supporting single-card deployment on 141GB H20. diff --git a/docs/zh/quantization/images/wint2.png b/docs/zh/quantization/images/wint2.png new file mode 100644 index 000000000..a117ea8af Binary files /dev/null and b/docs/zh/quantization/images/wint2.png differ diff --git a/docs/zh/quantization/wint2.md b/docs/zh/quantization/wint2.md index cc224aabb..00e55a979 100644 --- a/docs/zh/quantization/wint2.md +++ b/docs/zh/quantization/wint2.md @@ -5,7 +5,7 @@ - **支持结构**:MoE结构 该方法依托卷积算法利用重叠的Bit位将2Bit的数值映射到更大的数值表示空间,使得模型权重量化后既保留原始数据更多的信息,同时将真实数值压缩到极低的2Bit大小,大致原理可参考下图: -[卷积编码量化示意图](./wint2.png) +![卷积编码量化示意图](./images/wint2.png) CCQ WINT2一般用于资源受限的低门槛场景,以ERNIE-4.5-300B-A47B为例,将权重压缩到89GB,可支持141GB H20单卡部署。