> BTW, is there an architecture where it is known that the > malloc+memcpy+aligned access is faster than doing unaligned accesses? > I tested X86 and power8le and both got faster with unaligned. I can > try on ARM when I get back home on Monday. I just tested it (patch attached). The results I got were x86: master: 1.325797007s ( +- 0.00% ) unaligned: 1.197201064s ( +- 0.00% ) tegra tk1 (arm) master: 5.619536031s ( +- 0.05% ) unaligned: 5.650522742s ( +- 0.05% ) gcc112 (power8 LE) (no root, so lower priority): master: 2.715273952s ( +- 0.26% ) unaligned: 2.600145948s ( +- 0.81% ) gcc110 (power7 BE) (no root, no tmpfs, using 30 call to time instead of perf): 5.85267s 5.47700s So the only case with a small slowdown was the tegra. Cheers, Rafael