qtorch package¶
-
class
qtorch.
FixedPoint
(wl, fl, clamp=True, symmetric=False)[source]¶ Low-Precision Fixed Point Format. Defined similarly in Deep Learning with Limited Numerical Precision (https://arxiv.org/abs/1502.02551)
The representable range is \([-2^{wl-fl-1}, 2^{wl-fl-1}-2^{-fl}]\) and a precision unit (smallest nonzero absolute value) is \(2^{-fl}\). Numbers outside of the representable range can be clamped (if clamp is true). We can also give up the smallest representable number to make the range symmetric, \([-2^{wl-fl-1}^{-fl}, 2^{wl-fl-1}-2^{-fl}]\). (if symmetric is true).
Define \(\lfloor x \rfloor\) to be the largest representable number (multiples of \(2^{-fl}\)) smaller than \(x\). For numbers within the representable range, fixed point quantizatio corresponds to
\[NearestRound(x) = \Biggl \lbrace { \lfloor x \rfloor, \text{ if } \lfloor x \rfloor \leq x \leq \lfloor x \rfloor + 2^{-fl-1} \atop \lfloor x \rfloor + 2^{-fl}, \text{ if } \lfloor x \rfloor + 2^{-fl-1} < x \leq \lfloor x \rfloor + 2^{-fl} }\]or
\[StochasticRound(x) = \Biggl \lbrace { \lfloor x \rfloor, \text{ with probabilty } 1 - \frac{x - \lfloor x \rfloor}{2^{-fl}} \atop \lfloor x \rfloor + 2^{-fl}, \text{ with probabilty } \frac{x - \lfloor x \rfloor}{2^{-fl}} }\]- Args:
attr: wl (int) : word length of each fixed point number attr: fl (int) : fractional length of each fixed point number attr: clamp (bool) : whether to clamp unrepresentable numbers attr: symmetric (bool) : whether to make the representable range symmetric
-
class
qtorch.
BlockFloatingPoint
(wl, dim=-1)[source]¶ Low-Precision Block Floating Point Format.
BlockFloatingPoint shares an exponent across a block of numbers. The shared exponent is chosen from the largest magnitude in the block.
-
class
qtorch.
FloatingPoint
(exp, man)[source]¶ Low-Precision Floating Point Format.
We set the exponent bias to be \(2^{exp-1}\). In our simulation, we do not handle denormal/subnormal numbers and infinities/NaNs. For rounding mode, we apply round to nearest even.
- Args:
attr: exp: number of bits allocated for exponent attr: man: number of bits allocated for mantissa, referring to number of bits that are supposed to be stored on hardware (not counting the virtual bits).