If single precision asm versions are needed the generate script will
need to be modified. This change allows code generated single precision
BLAS routines to be easier to make.
No significant difference in the benchmark.
Benchmark using noasm tag:
benchmark old ns/op new ns/op delta
BenchmarkDaxpySmallBothUnitary 28.6 25.0 -12.59%
BenchmarkDaxpyMediumBothUnitary 1367 1360 -0.51%
BenchmarkDaxpyLargeBothUnitary 139217 138521 -0.50%
BenchmarkDaxpyHugeBothUnitary 16451616 16243873 -1.26%
Benchmask using amd64 assembly:
benchmark old ns/op new ns/op delta
BenchmarkDaxpySmallBothUnitary 17.7 17.4 -1.69%
BenchmarkDaxpyMediumBothUnitary 363 363 +0.00%
BenchmarkDaxpyLargeBothUnitary 72119 72107 -0.02%
BenchmarkDaxpyHugeBothUnitary 12826300 12817173 -0.07%
All benchmark done using Go 1.4 on Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz.
Best of 3 runs.
This allows easy benchmarking against the pure pure Go implementation
and also provides a way to easily assess compiler improvements.
e.g.
$ cd $GOPATH/src/github.com/gonum/blas/native
$ go test -bench . > asm.txt
$ go test -tags noasm -bench . > noasm.txt
$ benchcmp noasm.txt asm.txt