> BTW, is there an architecture where it is known that the
> malloc+memcpy+aligned access is faster than doing unaligned accesses?
> I tested X86 and power8le and both got faster with unaligned. I can
> try on ARM when I get back home on Monday.

I just tested it (patch attached). The results I got were

x86:
master:    1.325797007s ( +-  0.00% )
unaligned: 1.197201064s ( +-  0.00% )

tegra tk1 (arm)
master:    5.619536031s ( +-  0.05% )
unaligned: 5.650522742s ( +-  0.05% )

gcc112 (power8 LE) (no root, so lower priority):
master:    2.715273952s ( +-  0.26% )
unaligned: 2.600145948s ( +-  0.81% )

gcc110 (power7 BE) (no root, no tmpfs, using 30 call to time instead of perf):
5.85267s
5.47700s

So the only case with a small slowdown was the tegra.

Cheers,
Rafael