qtorch package

class qtorch.FixedPoint(wl, fl, clamp=True, symmetric=False)[source]

Low-Precision Fixed Point Format. Defined similarly in Deep Learning with Limited Numerical Precision (https://arxiv.org/abs/1502.02551)

The representable range is \([-2^{wl-fl-1}, 2^{wl-fl-1}-2^{-fl}]\) and a precision unit (smallest nonzero absolute value) is \(2^{-fl}\). Numbers outside of the representable range can be clamped (if clamp is true). We can also give up the smallest representable number to make the range symmetric, \([-2^{wl-fl-1}^{-fl}, 2^{wl-fl-1}-2^{-fl}]\). (if symmetric is true).

Define \(\lfloor x \rfloor\) to be the largest representable number (multiples of \(2^{-fl}\)) smaller than \(x\). For numbers within the representable range, fixed point quantizatio corresponds to

\[NearestRound(x) = \Biggl \lbrace { \lfloor x \rfloor, \text{ if } \lfloor x \rfloor \leq x \leq \lfloor x \rfloor + 2^{-fl-1} \atop \lfloor x \rfloor + 2^{-fl}, \text{ if } \lfloor x \rfloor + 2^{-fl-1} < x \leq \lfloor x \rfloor + 2^{-fl} }\]

or

\[StochasticRound(x) = \Biggl \lbrace { \lfloor x \rfloor, \text{ with probabilty } 1 - \frac{x - \lfloor x \rfloor}{2^{-fl}} \atop \lfloor x \rfloor + 2^{-fl}, \text{ with probabilty } \frac{x - \lfloor x \rfloor}{2^{-fl}} }\]
Args:
  • attr:wl (int) : word length of each fixed point number
  • attr:fl (int) : fractional length of each fixed point number
  • attr:clamp (bool) : whether to clamp unrepresentable numbers
  • attr:symmetric (bool) : whether to make the representable range symmetric
class qtorch.BlockFloatingPoint(wl, dim=-1)[source]

Low-Precision Block Floating Point Format.

BlockFloatingPoint shares an exponent across a block of numbers. The shared exponent is chosen from the largest magnitude in the block.

Args:
  • attr:wl word length of the tensor
  • attr:dim block dimension to share exponent. (*, D, *) Tensor where

    D is at position dim will have D different exponents; use -1 if the entire tensor is treated as a single block (there is only 1 shared exponent).

class qtorch.FloatingPoint(exp, man)[source]

Low-Precision Floating Point Format.

We set the exponent bias to be \(2^{exp-1}\). In our simulation, we do not handle denormal/subnormal numbers and infinities/NaNs. For rounding mode, we apply round to nearest even.

Args:
  • attr:exp: number of bits allocated for exponent
  • attr:man: number of bits allocated for mantissa, referring to number of bits that are supposed to be stored on hardware (not counting the virtual bits).