December 19, 2017

npycat for npy and npz files

Pytorch, Theano, Tensorflow, and Pycaffe are all python-based, which means that I end up with a lot of numpy-based data and a lot of npy and npz files sitting around my filesystem. All storing my data in a way that is hard to print out. (Why this format?)

Do you have this problem? It is nice to pipe things into grep and sed and awk and less, and, as simple as it is, the npy format is a bit inconvenient for that.

So here is npycat, a cat-like swiss army knife for .npy and .npz files.

>  npycat params_001.npz
 0.46768  2.4e-05 2.03e-05  2.3e-05   ...   2.4e-05  7.4e-06  5.1e-06  4.5e-06
 2.4e-05  0.46922   0.0002  1.2e-05   ...   5.2e-05  5.9e-05  2.7e-05  5.3e-06
 2.6e-05  0.00026  0.59949  8.3e-05   ...   7.4e-06  5.6e-05  5.9e-06  1.3e-05
  ...
 1.1e-05 8.59e-05  6.4e-05 9.74e-05   ...     2e-05  0.68193  2.2e-05  1.7e-05
 5.3e-06  2.8e-05  4.8e-06  8.4e-06   ...   0.00015  1.6e-05  0.49022  2.6e-05
 4.8e-06  5.6e-06 1.06e-05  1.5e-05   ...   6.3e-06  1.3e-05 2.68e-05  0.50255
xi: float32 size=6400x6400

0.08672 0.09111 0.07268 0.10268   ...  0.06562 0.0652 0.09805 0.09459
err: float32 size=6400

-0.22102 -0.2293 -0.2118 -0.2582   ...  -0.2056 -0.2106 -0.2412 -0.243
coerr: float32 size=6400

None
rho: object

0.0001388192177
delta: float64

1 1 1 1   ...  1 1 1 1
theta: float32 size=6400

0.90006 0.90004 0.90002 0.89994   ...  0.89998 0.89999 0.89996 0.89994
gamma: float32 size=6400

By default, all the data is pretty-printed to fit your current terminal column width, with a narrow field width, pytorch-style. But the --noabbrev and --nometa flags gets rid of pretty-printing and metadata to produce an awk-friendly format for processing.

Other flags provide a swiss-army knife array of slicing and summarization options, to make it a useful tool for giving a quick view of what is happening in your data files. What is the mean and variance and L-infinity norm of a block of 14 numbers in the middle of my matrix?

> npycat params_001.npz --key=xi --slice=[25:27,3:10] --mean --std --linf
 4.91e-06    0.0001   4.9e-06  1.09e-05  1.93e-05  0.000118  1.01e-05
 0.000318  2.42e-05  0.000182   9.1e-06  1.88e-05  4.02e-05   0.00011
float32 size=2x7 mean=0.000069 std=0.000087 linf=0.000318

Is that theta vector really all 6400 ones from beginning to end?

> npycat params_000.npz --key=theta --min --max
1 1 1 1   ...  1 1 1 1
float32 size=6400 max=1.000000 min=1.000000

Also npycat is smart about using memory mapping when possible so that the start and end of huge arrays can be printed quickly without bringing the whole contents of an enormous file into memory first. It is fast.

The full usage page:

npycat --help
usage: npycat [-h] [--slice slice] [--unpackbits [axis]] [--key key] [--shape]
              [--type] [--mean] [--std] [--var] [--min] [--max] [--l0] [--l1]
              [--l2] [--linf] [--meta] [--data] [--abbrev] [--name] [--kname]
              [--raise]
              [file [file ...]]

prints the contents of numpy .npy or .npz files.

positional arguments:
  file                 filenames with optional slices such as file.npy[:,0]

optional arguments:
  -h, --help           show this help message and exit
  --slice slice        slice to apply to all files
  --unpackbits [axis]  unpack single-bits from byte array
  --key key            key to dereference in npz dictionary
  --shape              show array shape
  --type               show array data type
  --mean               compute mean
  --std                compute stdev
  --var                compute variance
  --min                compute min
  --max                compute max
  --l0                 compute L0 norm, number of nonzeros
  --l1                 compute L1 norm, sum of absolute values
  --l2                 compute L2 norm, euclidean size
  --linf               compute L-infinity norm, max absolute value
  --meta               use --nometa to suppress metadata
  --data               use --nodata to suppress data
  --abbrev             use --noabbrev to suppress abbreviation of data
  --name               show filename with metadata
  --kname              show key name from npz dictionaries
  --raise              raise errors instead of catching them

examples:
  just print the metadata (shape and type) for data.npy
    npycat data.npy --nodata

  show every number, and the mean and variance, in a 1-d slice of a 5-d tensor
    npycat tensor.npy[0,0,:,0,1] --noabbrev --mean --var
Posted by David at December 19, 2017 08:57 AM
Comments
Post a comment









Remember personal info?