The attached code tried two loops each of which just calls a function that increments an integer variable. One loop is a simple variable, the other has the thread_local qualifier. I put in ugly annotations to prevent g++ from inlining the functions even though I compile with -O3, but in real cases separate compilation forces each TL access to be independent. The timing as between the two cases is EXTREME on cygwin (both 32 and 64-bit) however g++ on Linux and the Microsoft compiler on Windows both manage to keep the base of thread-local regions in a segment register in such a way that the thread_local overhead is minimal. The cygwin thread_local overhead is large enough to be very visible in my code as a whole. I can see that changing to use a segment register might be a painful API change even if it was feasible, but has there been any consideration of it? Note that x86_64-w64-mingw32-g++ and clang also do not use the segment register so suffer the significant speed penalty, so maybe it would be hard to match what Microsoft manage? Sample output: simple 1.265 thread_local 33.219 Arthur